Latest page update: 1997 September 7.
It has been more than thirteen years since the publication of Salton and McGill's Introduction to Modern Information Retrieval. The texts that have appeared during these years have either been nontechnical, or focused on a particular aspect of information retrieval (IR). The present text is designed for an introductory graduate level course on the concepts and methods of IR, including many ideas from the IR research that has taken place during the past decade.
User profiles. Current IR systems are more sophisticated than those of earlier years. As a result, tailoring of retrieval results to individual users is now practical. The concept of a user profile, in both a simple (key term) and an extended (user characteristics) form, provides a means of modifying the retrieval process to better fit individual information needs. It can be combined with a query in several ways to provide a more complex retrieval process, better matching an individual's information need.
Multiple reference point systems. If one thinks of a query not as a single stream of terms, but as a set of individual reference points, it is possible to organize the output from a retrieval session to show how documents relate to the individual reference points. This provides the user with a more structured view of the output. In particular, output from a Boolean retrieval session can be broken into subsets corresponding to the various Boolean combinations comprising the query.
Modern document databases. The trend in information retrieval has been to work increasingly with very large document databases, to include both full-text and multimedia documents, to work with multiple databases distributed across a network, and to work with multilanguage databases. This has resulted in research studies (e.g., TREC - Text REtrieval Conference) that focus on these contexts, and in a proliferation of Web search engines that may not be well designed, but that do give access to millions of pages of documents. In addition, as document retrieval becomes more integrated into broad information systems, the problem of blending text retrieval methods, classical database management methods, and image analysis and retrieval methods has taken on major significance.
VIRI: Visual Information Retrieval Interface, As Web search engines and other retrieval methods bring forth thousands of documents in r esponse to a query, VIRIs provide graphical means for the user to assimilate the flood of documents, and to focus attention on "the richest part of the ore." Although VIRIs are a relatively recent concept, nearly one hundred have been proposed, and in most cases developed to prototype stage. Several are now commercially available.
A glossary of terms used in the book is included in the text, and also available online.
Hints and answers to the exercises in the book are provided online. This material will be updated as new ideas or better answers are developed.For further information, please contact the publisher or the author.
This chapter introduces the distinctions between data and information, and between query and information need. It also discusses the author's concept of an endosystem under control of the designer and an ectosystem that influences sytem behavior but is not under designer control.
Commentary: The user is a key element of any information system, and must be considered in system design.Chapter 2: Document and Query Forms
Document and query forms are compared and contrasted. Based on the fact that some information need statements are longer than some documents, and that the user may want to use a document as the basis for a query, the book comes down on the side of considering a query to be just one form of a document. Boolean queries may be an exception to this, since published documents tend not to be written in the form of Boolean expressions.Chapter 3: Query Structures
Depending on the retrieval model used, a query may have one of several distinct structures: a Boolean expression, a vector of terms or term weights, a vector or probability or fuzzy measures, or a natural language expression.Table of Contents
If a query is viewed as a document, then conceptually it resides in the document space, and the retrieval process can be viewed as matching documents to queries, finding those that are in some sense the closest. If a query is viewed as distinct from a document, then the retrieval process becomes one of mapping from the document space into the query space, of transforming the document into a form that can be compared to the query.
Commentary: Consider the fact that matching is an asymetrical process. The user wants documents that match a query, not queries that match a document. Yet the similarity measures that are commonly used are symmetrical in nature. What are the implications of this?
Clustering techniques are almost invariably based on some kind of distance measure, and various audiences, including ASIS and SIGIR, tend very strongly to choose distance rather than angle as an indicator of similarity, when given a choice. Why, then, is the cosine measure so dominant in vector retrieval models?Chapter 5: Text Analysis
Effective retrieval depends on having a sound analysis of the documents in a collection. Manual and automatic indexing are discussed. Lexical, syntactic, semantic, and pragmatic methods of text analysis are introduced.Chapter 6: User Profiles and Their Use
One possible reason for the relatively poor performance of information retrieval systems is that they have treated users monolithically. The concept of a user profile, both simple and extended, is introduced. Methods of combining user profile information with an explicit query are presented.Table of Contents
One outgrowth of work with user profiles is the realization that a query can be considered in terms of multiple reference points. Individual query terms can be treated separately. Alternatively, reference points can include user profiles, known documents, or related queries. For example, a user might pose three queries on a topic, focusing on the technical, legal, and societal aspects of the topic. The mathematical basis for this approach is discussed, showing that a multiple reference point query can be thought of as a classical query, but with the retrieved set distorted to better match the user's information need.Chapter 8: Retrieval Effectiveness Measures
The chapter opens with a discussion of precision and retrieval, the classical effectiveness measures. The shortcomings of these measures are discussed and alternative measures are described.Chapter 9: Effectiveness Improvement Techniques
The major effectiveness improvement technique is relevance feedback, a well-established procedure. This is discussed, along with a genetic algorithm version of feedback that permits a more widespread exploration of the document space. This exploration may identify two or more distinct sets of documents that respond to the information need.Table of Contents
Citation processing, less frequently used than Boolean or vector retrieval methods, nevertheless provides a sound way to identify documents related to a known document. Hypertext links provide another useful linkage between documents, now widely used on the World Wide Web. Image and sound retrieval are also discussed.Chapter 11: Output Presentation
The traditional output from a retrieval search has been a list of documents. The shortcomings of such a list are described, and VIRIs (visual information retrieval interfaces) are introduced. These provide a more detailed and structured view of the retrieved document set, an individual document, or the document space. They are now being widely studied as interfaces that potentially will help the users identify relevant documents.Chapter 12: Document Access
If a computer-based retrieval system is to be used, documents must be presented in an electronic form. Other than manually retyping paper documents, the most viable ways to present documents seem to be either scanning the paper document into the system, or preparing the document in electronic form originally. Both of these processes are discussed, as they relate to document retrieval.Chapter 13: The Ectosystem and Policy Issues
The ectosystem does, in some sense, influence and control the endosystem. As the ectosystem includes people, people-related issues such as intellectual property rights, privacy, and security become important. System design alternatives relating to these issues are presented.Table of Contents Appendix A: String Matching Techniques
Query to document comparison depends fundamentally on matching strings of characters. Four different string matching techniques are discussed.Appendix B: File Structures Six different file structures of use in information retrieval are presented. Table of Contents