April 17 - 19, 2007   
Hyatt Regency Phoenix   
Phoenix, Arizona   

 

Position papers

Access Tools: Bridging Individuals to Information

 
   

NSF/JISC Repositories Workshop
Linda Frueh, Internet Archive
April 13, 2007
Download: PDF Version  WORD version

We are here to discuss the enabling factors for individuals to participate in data-driven research. The goal from the Internet Archive’s perspective is to provide researchers and the general public with powerful and flexible access to stored information. As a community, we have brought several links between information and individual access into production; with the successes have come new opportunities to complete the data-to-researcher bridge and new directions in which to
fund research.

1. Information Collection: In Production
The workshop participants represent significant success stories in gathering information into large collections. In order to create meaningful archives, libraries and repositories we have dealt with rights issues, the cost of collecting and organizing data and digitization for easy access. We also are managing the challenges of working across multiple institutions to create collections. Tools have been created to enable large-scale information gathering, including on-site digitization, digital database platforms and web harvesting. The Internet Archive has used these tools to create collections numbering in the billions of web pages, in hundreds of thousands of digitized books, movies and audio recordings. Others have assembled quantitative data sets in the millions of items. These information-gathering tools are most powerful when built on opensource principles; they result in some standards across collections, which will later facilitate information sharing. Many such tools are being adopted, shared and further developed by the research community – but new opportunities exist to facilitate digital information collection.

2. Petabyte Storage: In Production
Storage of vast quantities of data is the second challenge taken on by this group. Petabyte scale storage is now readily available, with stable, replicable architecture. Storage facilities are becoming networked to enable information sharing and exchange. As a group, we’re taking on the management of offsite storage and adoption of standards to enable internetworking of multiple locations. Further challenges include technology migration and development of much larger scale storage capability – perhaps to exabyte scale.

But researchers should not be limited to any single storage facility or organization in plumbing data for new discoveries. Hence, interoperability of storage structures is an opportunity to strengthen the information-to-individual bridge.

3. Preservation: Research Still Needed
Peter Murray-Rust has cited very high figures – 80% and up - for the loss of primary research data post-publication. Preservation of information is an essential task in the information bridge - enabling scientists and scholars to find unexpected and novel associations without having to generate new primary data.

The Internet Archive and others are implementing policies to ensure sustainable, archival preservation by keeping duplicate copies on separate devices and storing multiple copies on different continents, with different administrators. LOCKSS systems and the Internet Archive’s mirror facilities in Amsterdam, Alexandria Egypt and San Francisco are examples of these policies in action. The goal is to protect information from the failure or policy changes of any individual system, organization or administrating body. More sophisticated preservation policies, addressing file format issues and technology standards, are sure to be a new frontier for leadership by JISC and NSF.

4. Access Tools: Research Needed
Ultimately, researchers must be able to get at stored information. Access tools are the last, critical step in supporting individual participation in data-driven research. These tools can be for finding, associating, tagging and downloading information – one node at a time or in bulk. Cornell University is pushing forward this frontier with its link structure analysis of the web, and its access tools overlaid on Internet Archive collections. Opportunities abound to encourage and enable better and more powerful access to archived collections. For example, there are approximately 5,000 living, written languages today – how will we support language-independent research in this new digital world? The Internet Archive
has 25,000 digitized texts in Arabic; how will a Portuguese-speaking scholar gain useful access to them? Our Television Archive has one million hours of television – none of them indexed. How will scholars and historians be able to review and analyze them?

The Internet Archive has hundreds of thousands of users each day, but we believe that with more
sophisticated access tools it can be millions. Some observations from the Archive’s experience
with access tools so far:

  • Offering programmer access to our storage machines has not attracted much activity, particularly not from researchers and scholars themselves.
  • We have been successful at boosting access with APIs and user-friendly interfaces: e.g. the Wayback Machine, fully searchable text and web collections, and flipbook readers.
  • The Internet field is bursting with community-oriented sites and organizing tools - Wikipedia, YouTube, GoogleMaps, and LibraryThing are some of the successes at attracting tremendous traffic because of their novel interfaces. Their success offers insights into features of interest to individual users – including tagging, annotation, user-generated content and community discussion.
  • To design great access tools we need to understand and incorporate end-user workflow issues; this suggests applications tailored to specific research communities

What can NSF/JISC do to support Individual participation in data-driven research?
Digital collections are coming together in enough quantity to support data-driven research. The limiting factor now is tools that provide great, interactive access to the materials. Opportunities for development in this area include:

  • Sets of open source software that can index collections and make them easily and
    flexibly findable
  • Library-scale machine translation
    Many languages to many languages
    Hundreds of thousands to millions of books
  • Library-scale universal OCR
    Language-independent OCR
    Multiple languages in many typefaces
  • Massive/bulk digital indexing of video content
  • An open source web search system
  • Tools to enable time based studies for trend analysis and sociological change

This group can help us to, in the words of David MacArthur (NSDL) “move from the era of
developers to the era of end-users.”