IST 2140
Information Storage and Retrieval
COURSE DESCRIPTION:
Introduction to storage and retrieval of textual, pictorial, graphic, and voice data. The focus is on effectively interpreting imprecise queries and providing a high quality response to them from a database of incompletely described "documents."
(Prerequisites: introduction to logic and statistical analysis, familiarity with a high-level programming language)
COURSE OBJECTIVES:
1. to understand the dimensions of the information retrieval problem;
2. to understand the functions of an information retrieval system;
3. to analyse the components of an information retrieval system;
4. to consider the factors which optimize the information retrieval process;
5. to examine current issues in information retrieval.
RECOMMENDED TEXTBOOK:
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley.
ASSESSMENT:
Midterm Exam 30
Short Papers 20
Course Project 40
Participation 10
SCHEDULE:
Wednesday, 3:00 - 5:50 p.m., Room 404
INSTRUCTOR:
Edie Rasmussen
Office: 646 LIS Building
Tel: (412) 624-9459
Fax: (412) 648-7001
Email:erasmus@mail.sis.pitt.edu
Office Hours: Mon. 2:00-4:00 p.m.
Tues. 9:30-11:30 a.m.
GSA:
Shveta Goel
Office: A-206 IS Building
Office Hours: Mon. 10:30 am-12:30 p.m.
Course Policies
Attendance
Class attendance is required for success in this course, as material will be covered in class which is not included in the textbook. A part of the final grade (10% of the total) will be based on your attendance and participation. If you must miss a class please notify the teaching fellow, and make arrangements to obtain course notes and handouts. Makeup exams for the midterm and final will not be offered except under extreme circumstances.
Plagiarism
It is expected that the work you submit in this course will be your own. While collaboration is allowed for the course project, it should be approved in advance and the nature of each contribution should be specified in the project proposal and the final submission.
The following statement is taken from The Teaching Assistant Experience: A Handbook for Teaching Assistants and Teaching Fellows at the University of Pittsburgh (A.P. Haley and J.M. Nicoll, eds.)
Plagiarism means submitting work as your own that is someone elses. For example, copying material from a book or other source without acknowledging that the works or ideas are someone elses and not your own is plagiarism. If you copy an authors words exactly, treat the passage as a direct quotation and supply the appropriate citation. If you use someone elses ideas, even if you paraphrase the wording, appropriate credit should be given. You have committed plagiarism if you purchase a term paper or submit a paper as your own that you did not write.
Plagiarism is a violation of the University of Pittsburghs standards on academic honesty, and violations of this policy are taken seriously. From the Guidelines on Academic Integrity: Student and Faculty Obligations and Hearing Procedures (effective September, 1995):
A student has an obligation to exhibit honesty, and to respect the ethical standards of the historical profession in carrying out his or her academic assignments. Without limiting the application of this principle, a student may be found to have violated this obligation if he or she:
1. Presents as ones own, for academic evaluation, the ideas, representations, or words of another person or persons without customary and proper acknowledgment of sources.
2. Submits the work of another person in a manner which represents the work to be ones own. [Quotation ellipsed.]
Special Needs
Students with disabilities who
require special accommodations or other classroom modifications
should notify the instructor and the University's Office of
Disability Resources & Services (DRS) no later than the 2nd
week of the term. Students may be asked to provide documentation
of their disability to determine the appropriateness of the
request. DRS is located in 216 William Pitt Union and can be
contacted at 648-7890 (Voice), 624-3346(Fax), and 383-7355(TTY).
Students who must miss an exam or class due to religious
observances must notify the instructor ahead of time and make
alternative arrangements.
Course Outline
Week Date Topic
1 August 28, 2001 Introduction to Course
Information Retrieval Systems and their Design
Lecture 1 -- Powerpoint slides
2 September 4, 2001 Documents and Queries
Representing Document Content
Lecture 2 -- Powerpoint slides
3 September 11, 2001 Information Retrieval Models I
Boolean Model
Vector Model
Lecture 3 -- Powerpoint slides
4 September 18, 2001 Information Retrieval Models II
Probabilistic Models
Cluster-based Retrieval
Language Models
Lecture 4 -- Powerpoint slides
5 September 25, 2001 Implementing IR Systems
Storage
Search Algorithms
Software
Lecture 5 -- Powerpoint slides
6 October 2, 2001 Measuring Effectiveness of IR Systems
Lecture 5 -- Powerpoint slides
7 October 9, 2001 Improving Effectiveness of IR Systems
Relevance Feedback
Query Expansion
Review Session
Lecture 7 -- Powerpoint slides
8 October 16, 2001 Mid-term Exam
9 October 23, 2001 Alternative Retrieval Techniques
Latent semantic indexing
Citation-based Retrieval
Hypertext Retrieval
Natural Language Processing
Machine Learning
Lecture 8 -- Powerpoint slides
10 October 30, 2001 Other IR Problems:
Cross-lingual Information Retrieval
Document Representation
Text Summarization
Question-Answering
Text Categorization
Data Mining
11 November 6, 2001 Information Retrieval and the WWW
12 November 13, 2001 Multimedia information Retrieval
Images
Video
Sound
13 November 20, 2001 Users and Information Retrieval
User Modelling
User Interfaces
Information Visualization
Short Papers Due
November 27, 2001 Thanksgiving - No Class
14 December 4, 2001 Social Issues in IR
Course Review
15 December 11, 2001 Presentation of Course Projects
Assessment
Student work for this course involves several components.
1. A midterm exam on October 16 on the work covered in class to October 9 (the basic information on information storage and retrieval systems).
2. A course project which will involve creating or installing an information storage and retrieval system, loading a set of documents (to be provided) and testing it against a set of queries (also provided). Candidate systems will be identified. The project can be done individually or in groups of 2 or 3. In the final class students will report the results from their system, analyse its strengths and weaknesses, and compare the results across systems.
3. Two short papers, one from a list of topics to be provided from the material covered in the second half of the term, the other a user evaluation of an IR simulation.
4. Participation in the class (attendance and contribution to discussions).
Reserve List
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: ACM.
Chowdhury, G.G. (1999). Introduction to Modern Information Retrieval. London: Library Association.
Frakes, W.B. and Baeza-Yates, R. (eds.) (1992). Information Retrieval: Data Structures & Algorithms. Englewood Cliffs, NJ: Prentice-Hall.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley.
Lancaster, F.W. and Warner, A. (1993). Information Retrieval Today. Arlington, VA: Information Resources Press.
Meadow, C.T., Boyce, B.R., and Kraft, D.H. (2001). Text information retrieval systems. San Diego, CA: Academic.
Salton, G. (1989). Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.
Witten, I.H., Moffat, A., and Bell, T.C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. San Francisco, CA: Morgan Kaufmann.
Weekly Reading List
Week 1 and 2:
Information Retrieval Systems and their Design
Documents and Queries
Representing Document Content
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch.1, Overview, pp. 1-16.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch.2, Document and query forms, pp. 17-49.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch.5, Text analysis, pp. 105-143.
Lancaster, F.W. and Warner, A. (1993). Information Retrieval Today. Arlington, VA: Information Resources Press. Ch. 1, Some basics of information retrieval, pp. 1-20.
Week 3:
Information Retrieval Models I
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch. 3, Query structures, pp. 51-78.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch. 4, The matching process, pp. 79-104.
Week 4:
Information Retrieval Models II
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: ACM. Ch. 2, Modeling, pp. 19-71.
Lavrenko, V. and Croft, B. (2001). Relevance-based language models. SIGIR01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM. pp. 120-127. Available at http://ciir.cs.umass.edu/~lavrenko/pub/RelevanceModels.pdf
Rasmussen, E. (1992). Clustering Algorithms. In Information Retrieval: Data Structures and Algorithms (W.B. Frakes and R Baeza-Yates, eds.). Englewood Cliffs, NJ: Prentice Hall. Pp. 419-442.
Salton, G. (1989). Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley. Ch. 10, Advanced information-retrieval models, pp. 313-373.
Week 5:
Implementing IR Systems
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: ACM. Ch. 8, Indexing and searching, pp. 191-228.
Croft, W.B. (2001). An Overview of InQuery as used for the TIPSTER Project. Available at: http://ciir.cs.umass.edu/demonstrations/InQueryRetrievalEngine.html.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Appendix B, File Structures, pp. 305-312.
Witten, I.H. et al. (2001). Greenstone: a comprehensive open-source digital library software system. In: Proceedings of the Fifth ACM Conference on Digital Libraries, San Antonio, TX, June 2-7, 2001. (New York, NY: ACM). Pp. 113-121. (Software download at http://www.nzdl.org/)
Witten, I.H., Moffat, A., and Bell, T.C. (1999) Chapter 3, Indexing; Chapter 4, Querying, in Managing Gigabytes, 2nd ed. Morgan Kaufmann.
Week 6:
Measuring Effectiveness of IR Systems
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: ACM. Ch. 3, Retrieval Evaluation, pp. 73-97.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch. 8, Retrieval effectiveness measures, pp. 191-218.
Mizzaro, S. (1997). Relevance: the whole history. Journal of the American Society for Information Science 48(9): 810-832.
Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing & Management 28(4): 467-490.
Voorhees, E. & Harman, D. (2001). Overview of the Tenth Text REtrieval Conference (TREC-10) In: NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 10) (National Institute of Standards and Technology). http://trec.nist.gov/pubs/trec10/papers/overview_10.pdf
Week 7:
Improving Effectiveness of IR Systems
Efthimiadis, E. (1996). Query expansion. Annual Review of Information Science and Technology 31: 121-187.
Harman, D. (1992). Relevance feedback and other query modification techniques. In: Frakes, W.B. and Baeza-Yates, R. (eds.), Information Retrieval: Data Structures & Algorithms. Englewood Cliffs, NJ: Prentice-Hall. Pp. 241-263.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch 9, Effectiveness improvement techniques, pp. 219-236.
Salton, G. & Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science 41: 288-297.
Week 9:
Alternative Retrieval Techniques
Chen, H. (1995). Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms. Journal of the American Society for Information Science 46(3): 194-216
Dunlop, M.D. and Van Rijsbergen, C.J. (1993). Hypermedia and free text retrieval. Information Procession & Management 29(3): 287-298.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch. 10, Alternative retrieval techniques, pp. 235-256.
Strzalkowski, T. (1995). Natural language information retrieval. Information Processing & Management 31: 397-417 (1995).
Week 10:
Other IR Problems:
Hasnah, A. and Evans, M. (2001). Arabic/English cross language information retrieval using a bilingual dictionary. ACL/EACL Workshop 2001: Arabic Language Processing: Status and Prospects. Available at: http://www.elsnet.org/arabic2001/hasnah.pdf
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch. 11, Output presentation. Pp. 257-270.
Lam, W., Ruiz, M. & Srinivasan, P. (1999). Automatic text categorization and its application to text retrieval. IEEE Transactions on Knowledge and Data Engineering 11(6): 865-879.
Mani, I. et al. (1998). The TIPSTER SUMMAC Text Summarization Evaluation. Final Report. October, 1998. McLean, VA: MITRE. (Mitre Technical Report MTR 98W0000138). Available at: http://www-nlpir.nist.gov/related_projects/tipster_summac/final_rpt.html
Voorhees, E. (2001). The TREC-10 Question Answering Track. In : NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 8). Available from: http://trec.nist.gov/pubs/trec10/papers/qa10.pdf
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval 1: 69-90.
Week 11:
Information Retrieval and the WWW
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., & Raghavan, S. (2000). Searching the Web. Stanford University Technical Report 2000-37. [Online] Available at http://dbpubs.stanford.edu/pub/2000-37
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: ACM. Ch. 13, Searching the Web, pp. 367-395.
Schwartz, C. (1998). Web search engines. Journal of the American Society for Information Science 49(11): 973-982.
Week 12:
Multimedia information Retrieval
Del Bimbo, A. (1999). Visual Information Retrieval. San Francisco: Morgan Kaufmann. Ch. 1, Introduction, pp. 1-28 only.
Gupta, A. & Jain, R. (1997). Visual information retrieval. Communications of the ACM 40(5): 70-79.
McNab, R.J. et al. (1996). Towards the digital music library: tune retrieval from acoustic input. In: Proceedings of the 1st ACM International Conference on Digital Libraries, Bethesda, MD, March 20-23, 1996. (New York, NY: ACM). Pp. 11-18.
Wold, E., Blum, T., Keislar, D. and Wheaton, J. (1999). Classification, search, and retrieval of audio. CRC Handbook of Multimedia Computing 1999. Available at: http://www.musclefish.com/crc/crcwin.html
Yeo, B. & Yeung, M. (1997). Retrieving and visualizing video. Communications of the ACM 40 (12): 43-52.
Week 13:
Users and Information Retrieval
Hearst, M.A. (1999). Chapter 10, User Interfaces and Visualization). In Modern Information Retrieval (Baeza-Yates, R. & Ribeiro-Neto, B., eds.) New York: ACM. pp. 257-323.
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch. 7, Multiple reference point systems, pp. 163-189.
Olsen, K.A. et al. (1993). Visualization of a document collection: the VIBE system. Information Processing & Management 29(1): 69-81.
Shneiderman, B. (1998). Designing the User Interface. 3rd ed. Reading, MA: Addison-Wesley. Ch. 15, Information search and visualization, pp. 509-549.
Week 14
Social Issues in IR
Korfhage, R.R. (1997). Information Storage and Retrieval. New York: John Wiley. Ch. 13, The Ectosystem and policy issues, pp. 281-289.