April 17 - 19, 2007   
Hyatt Regency Phoenix   
Phoenix, Arizona   

 

Position papers

Thoughts on Scale and Complexity

 
   

NSF/JISC Repositories Workshop
Abby Smith, Historian and Cultural Resources Analyst
April 2, 2007
Download: PDF Version  WORD version

When sitting down to solve a problem, it is good to imagine the size, shape, and dimensions of the problem; then to articulate the problem as precisely as possible, taking particular care with words that bear the burden of defining (and hence, solving) the problem. I offer a few thoughts on how to imagine the scale and complexity of digital data; a few thoughts on the language we use to describe the problem; and a few thoughts on whether existing frameworks and practices can help us—or not.

Scale and complexity: What is the problem?

The organizers of this workshop define the “scale and complexity of data” issue primarily as technical challenge. What about the imaginative challenge of scale and complexity: our inability to conceive richly and objectively of any particular scale and degree of complexity much beyond the human scale? This inability is probably an innate feature of our minds, given adaptation to the specific parameters of our environment and our physical size. We may study phenomena on the scale of galaxies, viruses, and mesons, but those scales are not native to us. I have observed that extremely large and extremely small scales are often experienced subjectively as confusing, confounding, and at times paralyzing; so too, are extreme levels of complexity and interdependence. At the same time, I have observed that such scales and levels of complexity can also be experienced with awe, wonder, curiosity, and the other affects that motivate scholarly inquiry. This last fact leads me to believe that the problem of imagining scale and complexity in the digital information environment is a tractable one. But we have not gotten there yet, as scientists and scholars. And once we do, we need to translate that imaginative capacity into operational knowledge and management skills. For, as the organizers of this workshop wisely remind us, the fate of data-driven scholarship will ultimately lie as much in the hands of managers, administrators, and funders as in those of scientists and scholars.

So, imagine this: you are a responsible manager of reputable stewardship organization and are asked by a member of your governing body to explain how the organization will respond to the recently released report by IDC, “The Expanding Digital Universe,” (http://www.emc.com/about/destination/digital_universe/), a report the member has skimmed. This report forecasts that by 2010 there will be 988 exabytes of data in our world, up to 70% of which will be user-created. (I take that report as an example somewhat arbitrarily. Any publication that mentions petabytes and exabytes could be substituted here.) How precisely do you explain, objectively, reasonably, to both your funders and your staff what such a reality would mean for your organization and its mission? What does that mean for a natural history museum director, a collection development librarian, museum curator, or a data scientist? How can they not feel confused, confounded, even a tad paralyzed when contemplating this news, at least in the context of everyday decisions about what to collect, preserve, to provide access to, and what resources they will need?  

For this is what is at stake: if we are serious about data-driven scholarship, then we must respond decisively but wisely to the imperative to collect, to curate, to preserve, and to provide access to the content that we believe, to the best of our abilities, will have enduring value. And we need to do so now. Thus the challenge of scale and complexity becomes a matter of urgency in our political economy. I hope that this challenge will be addressed technically, but I do not believe that we can get to a technical solution until we can wrap our heads around it imaginatively.

One more point: it’s helpful to remember that in the digital world, content wants to be machine-readable, not human-readable. When most of the historical record (presently in analog form) is in machine-readable form, the problems of migrating content repositories into a digital environment and deciding which born-digital content to collect and how will appear ever so much smaller. People will be able to see what they are now only asked to imagine.

And another point: worrying about what researchers five decades hence will want is a waste of time at best, a dodge at worst. Let's spend our time addressing today's research problems, and tomorrow's.

So, one challenge before us is about vision:

Can we create a frame of reference that is meaningful for people making decisions today and tomorrow about what digital content to collect and commit resources to?

 

You say data, I say content, let's call the whole thing off

Actually, let's not. I’d rather we all work together, each of us tackling the part of the problem most important to us. I like the definition of "data" found in the NSF cyberinfrastructure vision report: "…data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data." The term "content," more frequently used among the general public, the creative industries, and scholars of human culture, literally means “the substantive or meaningful part” of a created work (American Heritage Dictionary, 4th edition). By extension, in digital parlance “content” refers to that which is created when data (that is, data points) are combined in a specific context with the purpose of creating meaning, order, or significance. Content is always characterized by specific behaviors and, loosely speaking, some structure—a certain level of organization or of fixity—that results in such familiar forms as….music, webpages, geo-mash-ups, visual patterns, simulations, etc. "Subterranean Homesick Blues," for example, comprises a lot of data points—sound waves—but taken together in the specific way that Bob Dylan devised a few decades ago, emerge as “content,” that is, a song. (A trendy way of putting this would be to say that content is an emergent form of data, but I am sure researchers who actually need to use the word “emergent” in very specific research contexts would prefer that we leave that word alone, and so I shall.)

Why does this matter? Just because I'd like to see insights and solutions to the scale and complexity problem in the scientific and engineering domains be accessible to those in other domains, especially the general public.

 

One more point, a distinction with a difference: content can be copyrighted. Data, in the sense of facts, cannot be.

And a point about that 70%: people love creating content! Content is meaning. It is expression. It is how we experience ourselves and connect with others.

So here's another challenge for us:
Let us agree that data-driven scholarship, and content-driven scholarship are the same thing, and get more people working on the problem.

Lessons learned: Is there any continuity between yesterday, today, and tomorrow?

The biggest problem posed by the scale of digital content production is figuring out what needs to be collected and what does not. For very good reasons, most long-standing stewardship institutions—social science data archives, research libraries, art and natural history museums—have focused on the technological and personnel challenges demanded by the migration from an all analog environment to a hybrid analog-digital environment. They are particularly concerned about the implications of these changes for their long-term sustainability and their short-term business models. At times they act as if they do not understand the decisions about what born-digital content to collect will be determinative both of their long-term sustainability and their short-term business models. Okay, so the scale of production of relevant content for institutions has ballooned completely out of proportion to the resources available to capture and manage them. And okay, collecting always involves a commitment of resources both presently, and into the future, so these decisions are not easy. But without getting the collecting bit right, a lot of resources will be wasted. (And a lot of valuable content will be lost.)

Would it help to know that the problem posed by the scale and complexity of information for content selection is by no means a new problem? Of course, as a historian, I'm inclined to answer that question "yes." To take one example: libraries have always made trade-offs between comprehensiveness of subject matter coverage and depth of coverage, even the largest library in the world. Of the Library of Congress's collections, numbering well over 130 million items, about a third are bound volumes. A very large portion of the remainder are manuscript archives, most of them personal archives. Even if the scale of any individual collection is familiar to us—the scale of a single individual's life—we still encounter the problem of determining how much documentary evidence is enough to create a meaningful and authentic record of a phenomenon—the phenomenon in this case being a person. In all domains of scholarship, there's a very real question about how much data are needed to have something—a species, the bathysphere, a historical event, Hamlet —authentically represented. Deciding where the boundaries of any person' s life begin and end is conceptually very similar to deciding what are the boundaries of, say, a webpage. People commonly say "Well, a webpage isn't just a page, it's all the links to the page." By this, I believe they mean that what makes a webpage have value as a tool of expression, information, or communication is that it links to other pages, that those pages become part of the communication in toto, and if you leave out too much of all the stuff that makes the webpage meaningful to begin with, you might as well not save the webpage at all.

This is precisely the boundary problem that archivists have faced for a long time when determining the collecting scope for an individual human life. Humans, and certainly noteworthy humans, tend to be interesting because of all the people, places, things, and historical events that they are connected to. What happens when you have a wonderful subject, such as John Muir or Gifford Pinchot, who had wide correspondence during long lives with other wonderful subjects? A good archivist would want to collect as much from that subject and about that subject as possible under one roof. But that is not possible. Given that, what you, as a researcher, would want to do is to be able to have access to the content in all the other collecting institutions that hold materials relating to subjects with whom Muir or Pinchot corresponded, and then, of course, people with whom those subjects corresponded.

And that's just the correspondence link. What about that person who, as a historical agent, such as Muir or Pinchot, was widely involved in the growth of the environmental movement, public policy, the institutions of higher education they attended and taught at, the numerous domiciles in which they grew up and lived, the interesting organizations that they were associated with, all the trips they took, etc.?

How is it that archivists are not constantly stymied by these boundary definitions? That's a long story. But they don't always get it right, and their solutions have extreme limitations because they're dealing with physical artifacts. But I do believe that part of solving the scale problem lies in grappling directly with the boundary problem, that is, the scoping problem in collecting. One straightforward way of solving the scale problem, one that should be pretty obvious in the Web 2.0 environment, is to have more people/organizations collecting. The more people/organizations we have collecting, the more finely we can parse the boundary problem. The fewer people/organizations we have collecting, the greater is the pressure on the ones that do to settle boundary problems both at very large scales and very small scales. Working through communities of practice, with each community taking on responsibility for content that is most valuable to them, could go a long way to capturing more, rather than less, content. Of course, this solution would achieve our goals for content sharing only if the collections were open and interoperable. Again, a topic urgent for our political economy and one that, fortunately, is a topic for another breakout group.

So our third challenge: reach out to other communities of practice.