NSF/JISC Repositories Workshop
Donald J. Waters, The Andrew W. Mellon Foundation
April 6, 2007
Download: PDF Version WORD
version
Not long ago, my colleagues and I at the Mellon Foundation
were reviewing a proposal from Stephen Murray, an architectural
historian who has been developing a database of images, measurements,
virtual reconstructions, and other representations of the features
of several hundred churches in medieval France. We asked
Professor Murray about some changes he was proposing to make
in the database and wondered how these changes would make the
database more attractive and easier for a broad range of students
and scholars to use in their studies and research. He
said that we had asked a “wonderful and complex question” and
then referred us to a masterwork of 19th century scholarship
that was serving as a model for his own work: Viollet-le-Duc's
ten-volume Dictionnaire raisonné de l'architecture française
du XIe au XVe siècle (Dictionary of French Architecture
from the 11th to the 16th century).
The
Dictionnaire was published beginning in 1854. Not unlike
the Mellon-funded church project, it exhaustively cataloged
and illustrated medieval design and construction methods from
basic structural to refined decorative techniques. Although
it was a prodigious effort in itself, building this database
was not enough for the author, who feared that without further
work to demonstrate its relevance and significance, the Dictionnaire
would languish and fall into obscurity. According to
Professor Murray, “the formlessness of a dictionary or
database gives the user no particular reason to want to use
it. Realizing this in the 1860s, le-Duc shifted to a
different mode of representation—that of story-telling.” In
subsequent writings, le-Duc turned the data compiled in the
Dictionnaire into compelling narratives of how medieval French
architecture changed and developed, giving rise to the Gothic
style. With such demonstrations of how a rich data source
could be mined and used, the Dictionnaire has since been deeply
influential in driving the scholarly understanding of French
medieval architecture as well as revivalist movements among
architects.
Professor
Murray then went on to say in answer to our usability question
about his own database that “without the mechanisms of
the narrative the user will remain unmotivated. The problem
of the cultural and geopolitical context of architectural production
is vast and the user really needs an interlocutor or a teacher
to lead them on. The database cannot fully provide what
a good teacher can—but it can do much more than we have
so far attempted.”[1]
The
purpose of this workshop on repository support for data–driven
science seems to me to be perfectly aligned to Professor Murray’s
ambition to “do much more than we have so far attempted” with
vast arrays of data. My task in the following comments
is to examine this objective from the perspective of scholarly
communications, which includes the processes by which scholars
generate, record, report, preserve, and disseminate their knowledge-building
activities for each other and the larger society. What
are the key factors and how might consideration of these factors
affect what more could be done in designing systems for the
support of data-driven science? I turn first to the definition
of data-driven science
Data-driven science. Today, the broadening
deployment of computer-based data conversion and capture instruments
and sensors has certainly expanded the scale of humanistic
and scientific data to be digested. Scientists are thus
confronted with information overload in the form of vast arrays
of data that have been generated from vacuum-cleaner surveys
of, among other things, the galaxies, physical phenomena on
earth, and the molecular composition of organic life. Because
these data arrays generally do not lend themselves easily
to controlled experiments or the application of theory, some
scientists have suggested that a new paradigm called data-driven
science is emerging—or needs to emerge—for the
discovery of new knowledge. Instead of depending on theory
and experiment, such a paradigm would depend heavily on the
data-mining, pattern matching and simulation capabilities of
high-performance computing.[2]
One
might reasonably be skeptical of such a sweeping claim of novelty
just as one might observe that nearly every generation complains
that it is uniquely burdened by information overload even as
they develop and adapt tools and methods to bring the problem
under control. However, even if these experiences of
and approaches to data are in fact new to scientists, they
are all too familiar to any humanist who has tried to fathom
the depths of an historical archives, comprehend the semantics
of an unknown language, describe the social rituals of a foreign
culture, reconstruct social practice from archaeological remains,
or with Professor Murray and Viollet-le-Duc before him interpret
the context of architectural production in the context of what
Murray called “a vast cultural and geopolitical context.” Deep
dives into formless data, pattern-matching to organize and
render those data more manageable, and story-telling, which
after all is at the core of all simulation, are stocks in trade
of humanistic scholarship.
In
other words, as the anecdote from architectural history makes
clear, data-driven scholarship is not such a new paradigm and
may offer a deep methodological basis for collaboration between
the scientists and humanists. What is new, however, and
what represents at least in part the challenge and opportunity
of “doing more than we have so far attempted” is
the formalization of very traditional interpretive activities
of data-mining, pattern matching, and story-telling or simulation
in powerful algorithms that represent large and complex sets
of data in terms of multiple features and variables that can
be analyzed, tested, replicated, and changed at the scale and
speed afforded by advanced computation. The promise of
these new automated capabilities is that new knowledge can
be created in ways that were not previously possible. To
achieve this promise a dependable, deeply scaled, and flexible
repository infrastructure is needed for managing the data,
serving them up for various analytical and synthetic tasks,
and capturing the outputs of such scholarly work. This
infrastructure depends in part on the larger context of scholarly
communications, and here I limit my focus to the implications
of three aspects of that larger domain: the general qualities
of the analytic process itself, its grounding in discipline-based
culture, and its economic and legal underpinnings.
The analytic process. One of the fundamental building
blocks of scholarly activity is search—for both basic
discovery and more complex analysis and synthesis. Over
the last fifteen years, search has been a growth industry,
and search for discovery has become almost a commodity item,
having been the subject of intense investment and development
by Google, Amazon, Yahoo, Microsoft, and their predecessors
and competitors. Search is effective as a discovery tool,
however, only insofar as a sufficiently rich body of sources
is comprehensively aggregated to be worth searching. Successful
aggregation of sources at scale is the unsung hero of the success
of the search engine industry, and if we are looking for models
of advanced repositories, it is to the Googles and Amazons
that we must surely look, not just for how they have stored
these aggregations, but also for how they have been gathered
in disparate formats from multiple sources operating under
a variety of business models and intellectual property regimes,
and then normalized and indexed for rapid delivery.
But
search for discovery is only the beginning of the scholarly
process. Scholars then must zero in on the subsets they
have found—the primary and secondary source objects of
interest to their work. They need to pull together these
selected subsets for deeper analysis. The process of
aggregation at this stage is more difficult and complicated
because data need to be scrubbed, normalized, and prepared
in a more rigorous fashion than is likely to be necessary or
affordable to the commodity search engines. Provenance
and authenticity of the information needs to established; rights
cleared, and databases and database schemas created; textual
objects may need to be translated and marked up for grammatical
and structural features as well as semantically according to
certain knowledge structures; numeric data may need conversion
to common measures; and assumptions and guesswork throughout
need to be carefully documented. Over centuries of data-driven
work in the humanities, such processes were codified and standardized
in the hands of what became commonly known as documentary editors. Today,
given the amount of data, the more these processes can be automated,
the better, and functions are increasingly regarded as “curatorial” rather
than “editorial” in nature. The main difference
between the two designations seems to be that a documentary
editor engages in active and largely manual tasks, while the
data curator tends to take a more hands-off role, and instead
presides over an increasingly automated set of transformations.
More
research is desperately needed, however, to improve the accuracy
and reliability of automated data preparation or “curation” processes. In
addition, specifically with respect to repositories, there
is a two-fold challenge in this area of data curation. First,
the rapid and seamless preparation of data requires standard
protocols and interfaces for moving either the information
or a processing engine in and out of repositories. Second,
much of the preparation or curation work creates intermediate
representations of data. As large scale analyses in
a variety of projects have shown, including the MONK and Nora
projects[3] for
textual analysis at the University of Illinois, these intermediate
products may themselves be worth saving for purposes of experimental
iteration and replication, and research is much needed to identify
when and under what conditions they are indeed worth saving
Discipline-based culture. Although one can identify
important challenges associated with the general features of
the scholarly process as it moves from discovery to data preparation
and analysis, it is also necessary to recognize, as numerous
studies have shown, that significant differences in scholarly
practice exist among disciplines and fields of study. It
is within the disciplines where the pressures to innovate and
advance knowledge are greatest, but the investments in automation
are highly uneven, with some fields bursting with energy and
creativity and others operating within relatively static paradigms. At
Berkeley’s Center for Studies in Higher Education, former
University of California Provost, Jud King, and his colleague,
Diane Harley, recently concluded one of these studies focused
on the promotion and tenure decision-making process.[4]
King
and Harley show that recognition of innovative computationally-based
forms of scholarship and publication occurs slowly in general,
but that variation is greatest at the discipline level. Moreover,
and perhaps more importantly, they make the useful distinction
between formal and informal modes of communication and observe
that the formal modes, such as publication in peer-reviewed
books and journals, tend to be most deeply resistant to change. After
all, the formal means of establishing scholarly credentials
are the basis on which institutional position, rank, and salary
is allocated, and few scholars are prepared to take significant
action that would disrupt their means of livelihood. King
and Harley go on to observe, however, that the informal realm
is where scholars work with each other on a daily basis, consulting
with one another, letting each other know what technique worked
and did not and what new discoveries they have made. In
this informal realm—at the edge of the reputational and
promotional system, where credentials are being formed rather
than fixed—innovation is easier and more likely to occur.
The
important distinction between formal and informal modes of
scholarly communication helps explain why the physics ArXive,
to which all high-energy physicists routinely deposit their
papers continues to exist along side of rather than, as some
have promised for almost two decades that it would, replacing
a publication system to which they also routinely submit their
papers: one is an informal mode of communication and the other
is formal. The innovative automation of the preprint
process in the ArXive in the early 90’s was built on
a stunning ethnographic insight about the informal scholarly
communications process in physics, and it has been usefully
extended to other fields in the sciences and the social sciences
where there have been informal traditions of circulating preprints
and working papers. Little innovation has occurred in
this area since initial breakthrough and, as Paul Ginsparg
reported at Rice University’s De Lange Conference in
March 2007,[5] even
the code base for the ArXive system has changed little since
the mid-90s. Real innovation in scholarly communications
is now occurring elsewhere in the formal and informal systems
of communications, and continued attention to the potential
interaction between pre-prints and formal publication threatens
to divert resources from other areas where they might be needed
and better invested.
From
the perspective of repositories and data-driven scholarship,
perhaps the most important opportunity is to focus on the construction
and curation of data sets. It may be sufficient for funding
agencies and journal and book publishers to mandate that original
datasets on which new publications are based be deposited and
maintained in publicly accessible repositories. However,
there are some fields that are thinking even more innovatively
and are trying to build peer-review systems around the data
so that they can be judged formally on qualities of coherence,
design, consistency, reliability of access, and so on. With
JISC support in the UK, scientists and professional associations
in the field of meteorology have joined to establish a new
kind of electronic publication called a data journal, where
practitioners would submit data sets for peer review and dissemination.[6] With
Mellon support, in the field of nineteenth century literary
studies, Jerry McGann at the University of Virginia, has organized
scholarly societies into a federation for the purpose of providing
peer review for data in the form of online documentary editions
of nineteenth-century authors.[7] And
Bernard Frischer, who is a specialist in online virtual reconstructions
of archaeological sites, has received NSF support to plan a
journal-like outlet that would provide peer review of virtual
reconstructions.[8] More
research and experimentation with forms of peer-reviewed data
could have significant impact in helping organize the field
of data curation, provide additional information for promotion
and tenure committees, and avoid wasting resources in a frontal
assault on a long-established and, by many accounts, still
highly valued system of formal publication.
The economy of openness and intellectual
property. Another
area of vigorous innovation is also occurring on the edge of
the traditional and formal system of scholarly publication
that will deeply affect sustainability of new forms of data-driven
scholarship. For such scholarship to thrive, there is
little question that data and other forms of content need to
be open, in the sense that economic, intellectual property,
and other barriers must be low enough to permit an easier flow
of information into rich computational environments. In
many fields, these barriers are falling with innovations such
as embargo periods, moving walls, toll-free access, and special
forms of license. Even the major publishers, like Elsevier,
Wiley, and others are actively participating in these activities,
opening their published content. However, these publishers
are not innovating in the direction of greater openness for
its own sake, but to advance innovative new business opportunities
built precisely around new forms of data-mining and other services
that depend on open content. The principle of openness
thus is crucial in the formation of public policy for scholarship
and may be necessary for new forms of sustainable businesses
to emerge that support scholarship, but simple advocacy of
openness for its own sake is not necessarily sufficient or
wise. Here are a few examples of how focused research
in this area of rapid innovation could deepen and sharpen our
thinking about economic and intellectual property policies.
First,
let me draw your attention to complex intellectual property
issues associated with the arrangements between libraries and
commercial entities such as Google, Microsoft, Proquest and
others. Peter Kaufman and Ithaka made a useful attempt
in 2004 to analyze the large variety of types of arrangements,
some of which involve secret deals and do not always articulate
coherent and collective educational and public interest objectives.[9] Additional
work is especially needed especially on the IP issues associated
with emerging commercial services that will likely make use
of open access materials.
Indeed,
sophisticated publishers are increasingly seeing that the availability
of material in open access form gives them important new business
opportunities. That is, they can begin to incorporate
and recombine materials that they and other publishers have
produced with data and other related materials in sophisticated
databases, subject them to sophisticated search, data mining,
and semantic algorithms, and then present these as services
to a variety of specialized audiences willing to pay for the
added value over and above the original content. These
may be desirable outcomes in the end, and certainly present
opportunities for useful partnerships among scholars, libraries,
and publishers. However, what is worrisome about many
arguments in favor of open access is the lack of strategic
thinking about how open access material will actually be used
once it is made available, and the faith-based assumptions
that only beneficial consequences will follow from providing
open access.
One
worry is that open access to traditionally published monographs
and serials will cannibalize the sales of smaller publishers,
pushing them into further decline, and make it difficult for
them to invest in ways to help scholars select, edit, market,
evaluate, and sustain the new products of scholarship represented
in digital resources and databases. The bigger worry,
which is hardly recognized and much less discussed in open
access circles, is that the large, heavily capitalized publishing
firms will exploit open access repositories, cherry-picking
the most valuable open access products, combining them with
the most valuable new databases and resources, and selling
them back to the academy at a significant profit, while chasing
out sources of capital from within the academic community that
are desperately needed to advance scientific, humanistic, and
social science study.
Policy-oriented
research is needed that is to move past simple advocacy of
open access and the trendy, glitzy rhetoric about the initial
step of making materials freely available, and focuses strategically
on the full life cycle of scholarly communications. Hard
questions have to be asked: open access for what and for whom
and how can we ensure that there is sufficient capital for
investment in the dissemination of new and emerging forms of
scholarly output? In the software arena, a variety of
alternatives have been explored and articulated in the form
of open source licenses, some of which facilitate desirable
downstream activities and others do not. For content,
options like those afforded by the Creative Commons licenses
are important to consider, but now with wider use, conflicts
are beginning to emerge from the different forms of license,
and especially around the different interpretations of commercial
and non-commercial use. There are no magic bullets in
a highly pluralistic world, and these conflicts and their potential
solutions need to be much better understood.
References
- E-mail
correspondence from Stephen Murray to Suzanne Lodato, May
4, 2007, quoted by permission.
- See
for example the testimony of Raymond L. Orbach, Director,
Office of Science, U.S. Department of Energy, before
the U.S. House of Representatives Committee on Science,
July 16, 2003. Available at http://www.science.doe.gov/Sub/speeches/Congressional_Testim/7_16_03_testimony.htm.
- See Nora project description as well as reports and publications
at http://www.noraproject.org. Information
on the MONK project is forthcoming at http://www.monkproject.org.
- Diane
Harley, Sarah Earl-Novell, Jennifer Arter, Shannon
Lawrence, and C. Judson King, “The Influence of Academic
Values on Scholarly Publication and Communication Practices,” Research
and Occasional Paper Series, CSHE.13.06 (September
2006), Berkeley, CA: Center for Studies in Higher Education. Available
at http://cshe.berkeley.edu/publications/publications.php?id=232.
- Paul Ginsparg, “Read as We May,” presentation
at the De Lange Conference on Emerging Libraries, Rice
University, March 6, 2007. Webcast available
at http://webcast.rice.edu/webcast.php?action=details&event=985.
- Dr
Alan Gadian , Principal Investigator, The Overlay Journal
Infrastructure for Meteorological Sciences (OJIMS)
Project. Available
at http://www.see.leeds.ac.uk/research/ias/dynamics/current/ojims.html.
- Bethany
Nowviskie and Jerome McGann, NINES: A Federated Model
for Integrating Digital Scholarship, September 2005. Available
at http://www.nines.org/about/9swhitepaper.pdf.
- The
SAVE (Serving and Archiving Virtual Environments) project. See http://www.iath.virginia.edu/save/.
- Peter B. Kaufman, “Marketing Culture in the Digital
Age: A Report on New Business Collaborations Between
Libraries, Museums, Archives and Commercial Companies,
August 25, 2005. Available at http://www.intelligenttelevision.com/MarketingCultureinDigitalAge.pdf.
|
|