Latest page update: 1997 February 15.

0-1 vector : Any vector each of whose components are either 0 or 1.

absolute term frequency : The raw count of the number of times that a term occurs in a document or document collection.

abstract : Any brief one or two paragraph description of the contents of a document, usually by the author.

ACM : Association for Computing Machinery, a professional society.

ad hoc query : Any query that is asked once, requiring search of an entire database.

adaptive model : Any data compression method in which the encoding changes or adapts as the statistical properties of an individual document are determined.

agglutinative language : any language in which syntactic relationships are expressed by distinct suffixes.

Aho-Corasick algorithm : a string matching algorithm that uses multiple finite state recognizers for simultaneous matching of several substrings.

algorithm : the specification of a method by which an information system accomplishes a given task.

animation : the presence of motion in a document such as a videotape.

ANSI : American National Standards Institute, the U.S. authority for data encoding and other standards; a text encoding standard.

antonym : a word meaning the opposite of a given word.

approximate match : any matching technique retrieving documents that are similar to, but may not exactly match, the query specification.

arithmetic code : any data compression method that represents an entire document by a single number computed adaptively from the frequencies of letters or pixels within the document.

Arithmetic Mean Coefficient : a similarity measure based on the arithmetic mean.

array : any rectangular array of data, usually of numbers.

ASCII : American Standard Code for Information Interchange, a method for encoding alphanumeric data.

ASIS : American Society for Information Science, a professional society.

atomic data : data that are not subdivided into smaller units.

automatic indexing : indexing that is performed according to an algorithm, without human intervention.

average information content : a measure of how much information is contained in a typical message from a given set of messages.

average precision : a value computed by averaging the precision values at several different recall levels, typically three or eleven levels.

average recall : a value computed by averaging the precision values at several different precision levels, typically three or eleven levels.

average similarity : the average of the similarities of document pairs within a collection.


balance : in an indexing method, having the subcollections identified by index terms be of approximately uniform size.

base representation : representation of a document as a vector of numbers related to every term in the vocabulary used for a collection of documents.

basis vector : one of a set of vectors from which all vectors within a given vector space can be defined.

Bead : a visual information retrieval interface using a landscape metaphor.

bibliography : the list of documents cited by a given document, also called a reference list.

bilevel image : any image in which a pixel has only two values, typically black or white.

binary measure : any measure having only two values.

binary search : a search technique that iteratively discards half of a given set in an effort to locate a desired item. The technique requires a sorted set stored as an array.

BIRD : a visual information retrieval interface utilizing a separator array to effect sequential development of a Boolean query two terms at a time.

bit : the smallest unit of data, having only two possible values.

bit map : a representation of a 0,1-vector in which each component is represented by a single bit. Used to represent a set, with 1 representing an element (from a universal set) that is included in the set and 0 representing an element that is not in the set.

BMG algorithm : see Boyer-Moore-Galil algorithm.

BookHouse : a visual information retrieval interface using a library metaphor.

Boolean algebra : an algebra based on a certain set of arithmetic rules, used for logical computations. The number of elements in a finite Boolean a lgebra is always a power of 2. The most common Boolean algebra has only two elements, 0 and 1, and differs from ordinary algebra in that 1 + 1 = 1.

Boolean point : in a visual information retrieval interface, any point representing a Boolean combination of reference terms.

Boolean query : any query in which the individual terms are combined with Boolean or logical connectives.

Boolean retrieval system : any retrieval system using Boolean queries.

Boyer-Moore-Galil algorithm : a string matching algorithm based on matching substrings from the right hand end, rather than the left hand end. This is an O(n) algorithm that in the best case may be faster than other O(n) algorithms by a factor of five or more.

branching factor : in a tree or hierarchical file organization, the maximum number of subunits that a given unit can have.

breeding pair : in a genetic algorithm, two variants that are associated for possible crossover operations.

breeding population : in a genetic algorithm, the replicated population from which the population of variants for the next generation is formed.

broader term : in a thesaurus, any term whose interpretation includes a given term and similar or related terms.

Brown corpus : a well-known frequency study of American texts of various types.

byte : a sequence of eight bits, hence a unit of data having 256 possible values.


caption : any brief text describing a figure in a document.

Cassini oval model : a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the product of the distances from the document to each reference point.

cell : in general, an element position in an array. Specifically, one of four element positions in the separator array for BIRD.

characteristic function : a function whose value is 1 for elements of a given set and 0 for elements not in the set. It is used, for example, to separate documents satisfying a query (1) from those not satisfying it (0).

citation index : an index that lists documents citing a given document.

citation processing : any retrieval technique in which documentary citations are traced to identify documents related to a given one.

city block distance : a distance computed using the sum of the absolute values of distance changes in each direction, so called because it counts the number of blocks traversed in moving from one location to another in a city; L1.

classification bin : in BIRD, a bin in which a selected subset of documents can be stored.

cluster point : any point that represents a cluster of documents.

clustered file : any file in which the data elements are organized by a clustering technique.

clustering technique : any technique by which relationships among data elements such as documents are determined and closely related elements are grouped into clusters.

CNF : see conjunctive normal form.

co-citation : the phenomenon of two documents being cited by a given document, used as a measure of similarity of the two documents.

coefficient of association : any measure of similarity between two documents.

co-filter : use of a user profile as a second reference point in conjunction with a query.

collision : in hashing, the situation in which two data items are assigned to the same location.

communication theory : see transmission theory.

component : an individual element in a vector.

concept : an idea within a document, in contrast to the specific terms used to express that idea.

concordance : an inverted index identifying all occurrences of each term within a body of text.

conditional probability : the probability that a given event occurs, assuming that another event has occurred.

Conditional Probability Coefficient : a similarity measure based on conditional probability.

conjunct : in the disjunctive normal form, a group of individual terms joined by AND; in the conjunctive normal form, a group of disjuncts joined by AND.

conjunctive model : a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the maximum of the distances from the document to each reference point.

conjunctive normal form : a standard form for logical expressions, in which individual terms are joined by OR, and such groups of terms are joined by AND.

conjunctive query : any Boolean query using only AND and NOT.

content-bearing words : words that are deemed to relate to the concepts in a document. See also stop list.

content search : any search to locate a record or document having a specific content.

context-dependent : any character or term whose interpretation depends on the context within which it occurs.

context encoding : any encoding method in which the code for a given symbol depends on the context within which it occurs.

contingency table : any array in which the cells represent specific combinations of conditions, often a 2 x 2 table in which each of two conditions may or may not occur.

continuous tone image : any image in which each pixel may have any of a range of values. Typically the value of an individual pixel is closely related to the values of surrounding pixels.

contour : within a document space, the boundary of a region containing documents to be retrieved.

controlled vocabulary : any restricted set of words and phrases that are used to describe documents within a given set.

copyright : the legal right of an individual or corporation to receive credit for, and benefit from, published works.

Cosine Coefficient : a similarity measure based on the cosine of the angle between two documents as represented by term weight vectors.

cosine measure : the vector angle between two documents, used as a measure of similarity.

coverage ratio : the proportion of the relevant documents known to the user that are actually retrieved.

cross referencing : in a thesaurus, reference to terms related to the given term.

crossover : in a genetic algorithm, any method of exchanging portions of two variants to create two new variants.

crossover rate : in a genetic algorithm, the fraction of breeding pairs that are chosen for a crossover operation.

current awareness system : any information retrieval system in which users are automatically notified of any new documents that may relate to their interests; also called selective dissemination of information, and routing system.


data : the documents received, stored and retrieved by an information endosystem.

data compression : the encoding of data in less than one byte per character for text, and in as little as one or two bits per pixel for images.

data fusion : the merging of search results from several different databases, possibly using several different search techniques.

data model : in data compression, the model, adaptive or static, used to represent the data.

deep structure : the structure of a sentence related to its meaning, independently of the specific syntax used.

default bus : in a finite state recognizer, a bus that is used for any unspecified character.

deleted average similarity : average similarity of documents within a collection, computed on the assumption that occurrences of a given term have been deleted from the computation.

DeMorgan's Laws : logical laws governing the interaction of AND, OR, and NOT.

deterministic : any algorithm or automaton such that for any given set of data each step has only one possible successor.

device : any computer or other tool used to process information.

Dewey Decimal classification : a system of classifying documents according to contents.

Dice's coefficient : a similarity measure developed by Dice.

dimensional compatibility : in an abbreviated vector representation of documents, the concept that a given position must refer to the same term in each of the documents, whether or not that term occurs.

direct file : any file of documents without an index into it.

direct search : any search of each document within a file to locate those containing a given term.

discriminant function : in probabilistic retrieval, a function that determines whether a given document should be retrieved.

disjunct : in the conjunctive normal form, a group of individual terms joined by OR; in the disjunctive normal form, a group of disjuncts joined by OR.

disjunctive model : a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the minimum of the distances from the document to each reference point.

disjunctive normal form : a standard form for logical expressions, in which individual terms are joined by AND, and such groups of terms are joined by OR.

disjunctive query : A Boolean query using only OR and NOT.

dissimilarity measure : any measure in which high values represent documents that are dissimilar and low values such as 0 represent documents that are similar.

distance measure : any measure relating two entities that satisfies certain conditions: zero distance only between an entity and itself, non-negativity, symmetry, and the triangle inequality.

distance space : any space in which documents are positioned according to their distances from given reference points.

DNF : see disjunctive normal form.

doctrine of fair use :: the concept in copyright law that limited individual use of any document is permitted without specific permission of the copyright holder.

document : any stored data record in any form.

document analysis : the process of analyzing a scanned document to determine its components such as headings, paragraphs, and figures.

document cluster : any group of related documents.

document-document matrix : an array used to compare documents within a collection according to a given criterion.

document identifier : any number or other code uniquely identifying a document.

document reference number : the identifier by which an information system refers to a document.

document space : a conceptual space in which documents are distributed according to given characteristics, often term occurrences.

document surrogate : any limited representation of a full document.


EBCDIC : Extended Binary Coded Decimal Information Code, a method for encoding alphanumeric data, now largely obsolete.

economy : how well the information system meets the economic goals of the funder.

ectosystem : those system factors that are not under control of the designer, including the people who are involved with the system, the forms in which information is available, and the equipment and technology available for the system.

effectiveness : the quality of the information system response to the information need.

efficiency : the time and effort required for the information system to respond to the information need.

eleven-point average : computation of average precision or average recall from the values at eleven recall or precision points, respectively, namely, at 0.0, 0.1, ..., 1.0.

ellipsoidal model : a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the sum of the distances from the document to each reference point.

endosystem : those system factors that the designer can specify and control, such as the equipment, algorithms, and procedures used.

Euclidean distance : ordinary straight line distance; L2.

exact match : any document that exactly matches the terms and criteria in a query.

exclusive OR : interpretation of "or" as meaning either one or the other but not both.

exhaustivity : the extent to which a given set of index terms covers all topics and concepts met in a document set.

expected search length : the average number of documents to be examined to locate a given number of relevant documents.

expert system : any inferential information system built on a knowledge base.

extended Boolean query : see weighted Boolean query.

extended user profile : any user profile that includes user characteristics that cannot be directly related to terms in documents, such as levels of education and experience.

extract : any brief description of a document formed by selecting certain sentences from the document.

extrinsic measure : any document similarity measure that depends on reference to some point independent of the two documents.


failure link : in the KMP algorithm, a link to be taken when a required character is not present.

fallout : the proportion of non-relevant documents that are not retrieved.

feedback : the passing of information between components of a system in response to some action of the system.

file any collection of documents or other entities organized into a single unit.

fine structure : the detailed representation of data, including encoding methods used.

finite state recognizer : any automaton or algorithm that recognizes a given string of characters.

frustration measure : a performance measure based on the positions of nonrelevant documents in the retrieved set.

full disjunctive normal form : a disjunctive normal form in which each disjunct contains all of the terms or their negations.

full document surrogate : any extended representation of a document, possibly including title, author, author's location, source, date, abstract, subject descriptors and categories, and key terms.

full text the entire text of a document.

funder : the person or organization who underwrites the cost of operating the information system.

fuzzy matching : a matching process based on the concept of a fuzzy set.

fuzzy query : any query based on the concept of a fuzzy set.

fuzzy set : a set for which each element of the space, rather than being in or out of the set, has an associated membership function representing the belief that that element should be in the set.


generality : the proportion of relevant documents within the entire collection.

genetic algorithm : an iterative process of simultaneously solving many variants of a complex problem, leading to the identification of a near-optimal solution to the problem, so-called because of the analogy to breeding within a population of plants or animals.

graph theoretic clustering : any clustering method based on concepts from graph theory.

grayscale image : any monochrome image having several, usually 8, levels of intensity.

gross structure : the extent and type of formatting that the document exhibits.

GUIDO : a visual information retrieval interface based on the distances of a document from given reference points.


hash function : any function used for key-to-address transformation in hashing.

hashed file : any file whose organization is based on a hashing technique.

hashing : any method of assigning items to computer storage where the storage location is computed from the characteristics of the individual items. In theory hashing permits one step access to any data item.

hierarchical file : any file whose contents are organized in a hierarchical or tree-like manner.

highly dissimilar documents : documents that conceptually have very little in common.

highly similar documents : documents that are very close conceptually.

hill climbing : any optimization technique that moves at each step toward a better solution to a problem. The flaw in the technique is that such a direct move may miss the best possible solution, which is reachable only be moving away from a good solution, or by starting at a different point.

home page : in the World Wide Web, any site identified with an individual user or organization.

homograph : a word that is spelled the same as a given word, but has a different meaning.

homonym : a word that sounds the same as a given word, but has a different meaning.

HTML : Hypertext Markup Language, a method for encoding the format of a document on the World Wide Web, including hypertext links.

Huffman code : a static data compression code, widely used because of its ease of development and application.

hypertext : a system of linking a portion of a document to related portions of the same or different documents through direct pointers.


ib. : see ibid.

ibid. : a reference in a footnote or endnote to a document cited in the immediately preceding reference.

ibidem : see ibid.

idf : see inverse document frequency.

idiomatic expression : a phrase that has a conventional meaning seemingly unrelated to the literal meaning of the words.

image database : any collection of images organized for information processing and retrieval.

image processing : the processing of an image to store it in a computer or to analyze its contents and components.

inclusive OR : interpretation of "or" as meaning either one or the other or possibly both.

independence value : a base value used in some similarity measures, related to the statistical independence of two terms or documents.

Index of Independence Coefficient : a similarity measure based on statistical independence.

index structure : the organization of an index system, particularly with reference to the number of levels in the index.

index term : any term used to identify a concept in a document.

indexed file : any file with an associated set of index terms.

indexer-user mismatch : the fact that indexers and users may use different terms to denote a given concept.

indexing : the act of creating an index for a document or a document collection.

indexing language : the language, particularly the set of terms, used to create an index.

InfoCrystal : a visual information retrieval interface based on displaying all Boolean combinations of the reference terms.

information : data that have been matched to a particular information need, having both personal and time-dependent components that are not present in the concept of data.

information content : a measure of the information inherent in a given message or document.

information filtering : the concept of utilizing inexpensive techniques to eliminate most of a document collection from further consideration in relation to a given information need.

Information Navigator : a visual information retrieval interface based on the concept of navigating a document space.

information need : the requirement to store information (data) in anticipation of future use, or to find information (data) in response to a current problem.

information retrieval : the location and presentation to a user of information relevant to an information need expressed as a query.

information retrieval system : any system, usually involving computers, that performs information retrieval.

information theory : see transmission theory.

initial bus : in a finite state recognizer, a bus used when the initial symbol of a substring is recognized.

initial state : the starting state for a finite state recognizer.

inner product : a value computed from two vectors by multiplying corresponding components and adding the results.

integrated media document : any document that may contain text, images, and sound.

integrated media system : any information system for handling integrated media documents.

intellectual property right : the right of the creator of a document to recognition and benefit from his or her work.

Internet : the worldwide system linking computers through telecommunications.

intrinsic measure : any similarity measure that refers only to the documents being compared, not to any external point.

inverse document frequency : the logarithm of the reciprocal of the number of documents in a collection that contain a given term.

inverse document frequency weight : a term weight computed from term frequency and inverse document frequency.

inverse relationship : any relationship between two values such that as one value increases the other decreases. Often a relationship in which the product of the two values is constant.

inverted index : any index into a file arranged so that each term in the index directly identifies the documents containing that term.

inverted file : any file accessible through an inverted index.

iterative affix removal : any stemming method in which compound affixes are removed iteratively. For example, tionally is shortened to tional, then to tion, then removed entirely.


Jaccard's Coefficient : a similarity measure developed by Jaccard.

JBIG : Joint Bilevel Image Experts Group, also a standard developed by the group for encoding bilevel images.

JPEG : Joint Photographic Experts Group, also a standard developed by the group for encoding continuous tone images.

judging dilemma : the problem associated with having a fixed range of judgment values. If a given document is assigned a value near the maximum and a subsequent document is clearly superior, that superiority cannot be accurately represented without reassessing all prior documents.


Karp-Rabin algorithm : see Rabin-Karp algorithm.

key : any term used to access or organize a data file; in hashing, any term used to identify an individual data item.

key phrase : a phrase chosen to represent the content of a document.

key-to-address transformation : in hashing, the computation of the address for a data item from its key.

key value : the value of a key.

keyword : one of a set of individual words chosen to represent the content of a document.

KMP algorithm : see Knuth-Morris-Pratt algorithm.

knowledge : information integrated to form a large, coherent view of a portion of reality.

knowledge base : the stored data, algorithms, facts, concepts, and rules that are representative of one or more selected experts in a particular area.

Knuth-Morris-Pratt algorithm : an O(n) substring matching algorithm based on the use of a finite state recognizer.


lack of consistency : the fact that a given indexer or information system user may not be consistent in the use of terms over a period of time.

latent semantic indexing : a technique based on multidimensional scaling for identifying the major concepts in a document or document collection.

Law of Double Negation : the logical rule that states that a second negation cancels out a first negation: NOT NOT A is the same as A.

Lempel-Ziv code : see Ziv-Lempel code.

level of compression : the degree to which a document has been compressed.

lexical similarity : document similarity based solely on the occurrence of words.

lg : a notation for the logarithm to the base 2.

Library of Congress classification : a system of classifying documents according to contents.

Linear Correlation Coefficient : a similarity measure based on the linear correlation between two documents.

linear transformation : any transformation defined by a linear equation.

linearly independent vectors : any set of vectors such that no linear combination adds up to the zero vector.

link : in documentation, any relationship between two terms; in the KMP algorithm, any path to be followed in a given circumstance.

linked list search : search through items organized into a linked list.

list of references : the documents cited by a given document.

list of terms : the terms used in a given document or document collection.

listing : the process of creating a list of documents in a file.

logic : any method for reasoning about the relationships among terms.

logic of use : the concept that closely related documents should be within the same section of an indexed file.

logical connective : one of the operators, typically AND, OR, and NOT, used to express the cooccurrence relationships among terms.

logical structure : the conceptual organization of a file, in contrast to its physical organization in a storage system.

Lp metric : one of a family of metrics developed using the pth root of the sum of the pth powers of absolute differences of components.

LSI : see latent semantic indexing.

LyberWorld : a three-dimensional visual information retrieval interface with document placement based on the ratio of similarities of the document to given reference points.


machine readable medium : any document storage medium that can be read by a computer.

Manhattan distance : see city block distance.

manual indexing : indexing that is done directly by a person, rather than by an algorithm.

mapping : if documents and queries are regarded as having distinct forms, the process of transforming a document into an entity that can be matched to the query.

markup language : any notation used to add information to a text about its formatting.

matching : if documents and queries are regarded as having similar forms, the process of identifying documents that are similar to a given query.

matrix : any rectangular array of numbers.

maximal direction distance : a distance measure that utilizes only that largest of the distances in any coordinate direction; Linf.

measure : any numerical value used for evaluation. A measure may be applied to terms, documents, pairs of documents, retrieval systems, and so forth. The measure may be such that the actual value is important, the relative values are important, or only the ordinal values are important.

medium : any material used to store a document.

membership grade : in fuzzy set theory, the degree of belief that a given element belongs to a given set.

metric : see distance measure.

MIDI : Musical Instrument Digital Interface, a standard for encoding for music.

minimal perfect hash function : a perfect hash function that utilizes minimal space.

minimization : any technique for determining a Boolean function that is logically equivalent to a given one and contains a minimum number of operator occurrences.

monotone nondecreasing function : a function f of one variable with the property that if x2 > x1 then f(x2) >= f(x1).

MPEG : Moving Picture Experts Group, also a standard developed by the group for encoding motion pictures and video.

multidimensional scaling : a statistical technique for determining the best set of coordinates or dimensions to represent a given set of data.

multimedia document : see integrated media document.

multimedia system : any information system that can process multimedia documents.

mutation : in genetic algorithms, the technique of randomly replacing a parameter value with a new, randomly chosen value.

mutation rate : in genetic algorithms, the rate at which mutations are applied.


narrower term : in a thesaurus, a term that is more specific than a given term. Entities identified by the narrower term are among those identified by the original term.

n-ary measure : any measure that takes on n different values.

natural language processing : text processing that includes syntactic, semantic, and sometimes pragmatic interpretation techniques.

natural language query : any query stated in English or some other natural language.

natural order : an ordering of entities that people commonly use: numerical order for numbers, alphabetic order for letters and words, calendar order for month and days of the week, and so forth.

negative dictionary : see stop list.

NetScape : one of the most widely used systems for accessing the World Wide Web.

netted file : any file whose structure is based on a network of relationships.

noise : transmission errors that corrupt the original signal; any portion of a document that is not appropriate in response to a given information need.

nondeterministic : any algorithm or automaton which may include a step involving several possibilities with no method of deciding among the possibilities.

normalization : in logic, the process of converting a logical expression to a canonical form such as CNF or DNF.

normalized precision : a measure in which precision is normalized against all relevant documents.

normalized recall : a measure in which recall is normalized against all relevant documents.

normalized similarity measure : any similarity measure that has been adjusted so that the similarity of a document to itself is 1.

novelty ratio : the proportion of the relevant retrieved documents that were previously unknown to the user.

n-simplex : an n-dimensional polyhedron with n+1 vertices.


OCR : see optical character recognition.

one point crossover : in genetic algorithms, a breeding technique using one randomly chosen point, interchanging the portions of the two breeding individuals to the right of that point.

op. cit. : reference in a footnote or endnote to a previously cited document.

operating curve : for an information retrieval system, a curve plotting the fraction of relevant documents retrieved against the fraction of irrelevant documents retrieved, as the system retrieves increasing numbers of documents. In general, a curve plotting one system characteristic against another as the system carries out its function.

opere citato : see op. cit.

optical character recognition : (OCR) any technique for identifying the distinct segments of a scanned document and for converting the textual portions to ASCII or some other text code.

order of precedence : the order in which a given set of operators is to be processed in the absence of parentheses or other indications to the contrary. For arithmetic this is conventionally unary minus (negative number) before multiplication and division before addition and subtraction, with left-to-right ordering among operators of equal precedence. For logic it is conventionally NOT before AND before OR, with left-to-right order among operators of equal precedence.

ordered minimal perfect hash function : a minimal perfect hash function that also preserves the sorted order of a set of entities.

ordered proximity : any proximity measure in which the order of the words is taken into account.

Overlap Coefficient : a similarity measure based on the overlap in terms between two documents.


partially inverted file : any file with an associated index that includes some, but not all, words.

perfect hash function : any hash function that produces no collisions for a given set of data.

pertinence : a measure of how well a document matches an information need.

phrase : a contiguous set of words within a sentence, usually associated with some concept.

piecewise linear transformation : any transformation that is defined by different linear equations over different ranges of its variables.

Piles : a visual information retrieval interface based on the metaphor of piles of papers on a desk.

POI : see point of interest.

point of interest : a reference point, such as a term, phrase, user profile, or known document, a POI.

post-filter : application of a user profile to a set of retrieved documents to alter its characteristics, either by excluding some documents or by changing the order in which they are presented.

Postscript : a widely used markup language.

pragmatic factor : any factor involving the specifics of an information seeking situation, such as known documents, user background, and time constraints.

precision : the proportion of retrieved documents that are relevant.

precision-recall graph : a graph plotting precision against recall.

pre-coordinated indexing language : any indexing language in which the terms to be used have been chosen a priori, along with the set of terms that each chosen term is to represent.

pre-filter : application of a user profile to a query before retrieval, to alter the characteristics of the query.

prefix property : the property of an encoding system that no code is the prefix of any other code.

pretrieval : a technique for attempting to retrieve the one document most relevant to an anticipated information need before a query is posed.

probabilistic matching : a document-query matching based on the probability that the document will satisfy the query.

probabilistic query : any query with term weightings interpreted as probabilities that the given terms will identify relevant documents.

Probability Difference Coefficient I : a similarity measure based on probability.

Probability Difference Coefficient II : a similarity measure based on probability.

process : the operation of a system, or one step in the operation.

product : the result of a process.

Proportion of Overlap Coefficient : a similarity measure based on the overlap in terms between two documents.

proximity : any measure of the nearness of two term occurrences in a document, usually in terms of the number of intervening terms, or of cooccurrence within a sentence.

proximity operator : any function that identifies pairs of terms satisfying specified proximity conditions.

pseudo-metric : any measure that behaves like a metric, except that two distinct entities may be at a "distance" of 0.


QBIC : an image query system based on color, texture, and rough sketches.

query : the formal expression of an information need.

query-profile interaction : the way in which a query and a user profile are used jointly to identify relevant documents.

query variant : in a genetic algorithm, one of a set of query representations differing only in the term weights.


Rabin-Karp algorithm : an O(n) substring matching algorithm based on a hash function.

range match : any matching process in which a specified range of values, such as numbers or names, is acceptable.

ReadingRoom : a visual information retrieval interface presenting a self-organizing semantic map of a document set.

recall : the proportion of relevant documents that are retrieved.

recall effort : the ratio of the number of relevant documents desired to the number of documents examined by the user to find the number of relevant documents desired.

record : any individual entity within a file.

record number : the identifier for a record.

record size : the size of a record, usually in bytes or computer words.

rectangular distance : see city block distance.

Rectangular Distance Coefficient : a similarity measure based on the rectangular or city block distance between two documents.

reference list : see bibliography.

reference point : any point by which a document can be judged, a POI.

related term : in a thesaurus, a term whose meaning bears some relationship to that of a given term.

relative recall : the ratio of the relevant retrieved documents examined by the user to the number of documents the user would have liked to examine.

relative term frequency : term frequency normalized by the length of a document or by the number of documents in a collection.

relevance : a measure of how well a document matches a query.

relevance feedback : an iterative process in which the user indicates the relevance of documents within a sample retrieval and the system utilizes this information to modify the query.

relevant document : any document that matches a query according to some specified measure.

replication : in a genetic algorithm, the process of duplicating the best of the query variants in one generation in preparation for defining the succeeding generation.

retrieval : see information retrieval.

retrospective search system : any search system that responds to an ad hoc query by searching the entire database for relevant documents.

review : any brief description of a document, written by someone other than the author, often including critical remarks relating the document to the literature of an area.

Rich Text Format : see RTF.

role : the way in which a given term is used in a document.

routing query : any query that is permanently on file, to be matched against any new documents entered into the system.

routing system : see current awareness system.

RTF : Rich Text Format, a method for encoding textual data.

run length encoding : an encoding method based on the lengths of sequences of characters in a document. Most often used for bilevel image encoding, where the characters are individual black or white pixels.


satisfaction measure : a performance measure based on the positions of relevant documents in the retrieved set.

scaling effect : the effect that increasing the size of a database has on information processing efficiency and effectiveness.

scanner : a device for generating a digital image of a document and entering it into a computer or telecommunication system.

Scatter/Gather : a visual information retrieval interface based on the metaphor of iteratively scattering documents into groups, gathering desired documents together, and rescattering them into new groups.

search technique : any method used to locate a record within a file.

second order filing system : any two-level filing system with the entities in each main file being files.

see reference : in a thesaurus, a reference to a term that is used in place of the given term.

see also reference : in a thesaurus, a reference to a related term.

segmentation : in optical character recognition, the process of identifying the distinct portions of a document, such as headings, sections, paragraphs, and figures.

selective dissemination of information : see current awareness system.

semantic analysis : analysis of a text to determine its meaning.

semantic structure : the structure of a sentence based on its meaning, rather than on the order in which the words appear in the sentence.

semantics : the study of the meaning of text.

semi-static model : in data compression, any model that is basically static, but is reinitiated periodically to better fit the data.

separable verb : in German, any verb whose prefix can be separated from the main portion of the verb, often appearing at the end of the sentence.

Separation Coefficient : a similarity measure based on the proporation of words that are unique to each of two documents.

separator array : in BIRD, the cell array used to separate documents according to the presence or absence of two specified terms.

sequential file : any file whose order is sequential.

sequential search : a search technique that begins at the beginning of a file and sequentially processes the entities in the file.

server : any information professional who operates the system or provides service to the users.

set theoretic clustering : any clustering method based on set theory.

SGML : Standard Generalized Markup Language, a widely used markup language.

short reference : a document reference that includes only title, author, and source.

signal : the bit stream or electromagnetic wave form transmitted from one place to another during information processing.

signal-to-noise ratio : a method of weighting term frequencies based on information content.

signature : any function, often a short string of bits, designed to characterize a particular string or other textual element.

similarity : any comparison of two documents, or a document and a query, to determine how much they relate to the same concepts or information need.

similarity measure : any measure of similarity.

simple linear transformation : see linear transformation.

simple user profile : any user profile stated entirely in terms of keywords.

sliding scale : a measure of recall based on the retrieval of a specific set of documents.

sound processing : the processing of voice, music, and other sound data.

sparse matrix : any matrix containing many zeros or empty cells. Typically fewer than 10% of the cells contain non-zero data.

special index : any index of special characteristics of a text, such as figures, cited authors, or mathematical proofs.

specificity : the depth of coverage of an index. The extent to which specific topic ideas are indexed in detail.

spine : in a finite state recognizer, the sequence of states that recognizes a correct string of characters.

standards : agreed upon rules for the specification, design, and development of entities within a given class.

state : in a finite state recognizer, one of several conditions of the automaton that correspond to recognition of specific sequences of characters.

state transition diagram : a diagram showing the states of a finite state recognizer and the conditions for switching from one state to another.

state transition table : a table showing the states of a finite state recognizer and the conditions for switching from one state to another.

static model : any data compression method using an encoding that is fixed a priori and does not adapt to the characteristics of an individual document.

statistical clustering : any method of clustering based on statistics.

stemming : the removal of suffixes, and sometimes prefixes from words to arrive at a core that can represent any of a set of related words.

stemming algorithm : any algorithm to perform stemming.

stop list : any list of words to be ignored in information processing. A stop list usually contains the most common words in text, and may include from 15 to approximately 500 words; also called a negative dictionary.

storage structure : the way in which a file is stored in computer memory, in contrast to the logical or conceptual structure of the file.

string-to-string correction : the process of identifying a minimal set of changes necessary to replace one string of characters by another.

summary : a section at the end of a document providing an overview of the document contents.

Swets' E measure : a measure of effectiveness for an information retrieval system, proposed by Swets and based on the operating curve.

synchronization point : in data compression, a point at which a semi-static encoding is reinitiated.

synonym : any word having the same meaning as a given word.

syntactic ambiguity : the property of natural language that a given sentence may be parsed in two or more distinct ways.

syntactic analysis : analysis of a text to determine its syntactic structure.

syntactic structure : the textual structure imposed by the syntax of a language.

syntax : linguistic rules for composing well-formed sentences.


term : any word or phrase having a distinct meaning.

term discrimination value : a measure of how much a given term contributes to separating a set of documents into distinct subsets.

term-document matrix : an array matching terms to documents, usually containing similarity values.

term-term matrix : an array matching term to term, usually related to documents that contain each pair of terms.

Text Relationship Map : a visual information retrieval interface based on identifying and matching the occurrences of terms within documents.

text tiling : the process of dividing a text into paragraphs or other units, and identifying the occurrences of terms within these units.

tf/idf : a term weighting based on term frequency and inverse document frequency.

thesaurus : a document identifying relationships among terms. Classically these relationships are based on term meanings, but they can also be based on term cooccurrences.

three-point average : computation of average precision or average recall from the values at three recall or precision points, respectively, usually 0.25, 0.5, and 0.75, or 0.2, 0.5, and 0.8.

threshold : a set value of a similarity or other measure. Documents for which the measure value is below the threshold will not be considered.

TileBars : a visual information retrieval interface based on text tiling.

topicality : the extent to which a document relates to a given topic.

total measure : a weighted sum of the satisfaction and frustration measures.

transmission theory : a mathematical theory of signal processing.

TREC : Text REtrieval Conference, an on-going series of information retrieval experiments with very large databases, multimedia databases, and multilanguage databases, involving research groups around the world.

tree-structured file : any file whose conceptual structure is a tree or hierarchy.

triangle inequality : the statement that the sum of the lengths of any two sides of a triangle is at least equal to the length of the third side. A key property of metrics or distance measures.

trie : from retrieval, a tree structure whose vertices are letters in the words of a vocabulary. Used for rapid matching of words in a text.

trigger phrase : any phrase in a text that identifies specific features of the text, such as figures, examples, or conclusions.

truth table : an array used for representing or determining the truth or falsity of a logical proposition.

two-point crossover : in a genetic algorithm, a breeding technique using two randomly chosen points, interchanging the portions of the two breeding individuals between the two points.


uncontrolled vocabulary : the use of unrestricted terms in indexing.

uniform crossover : in a genetic algorithm, a breeding technique in which it is randomly decided for each element in one of a breeding pair of individuals whether it should be switched with the corresponding element in the other vector in the breeding pair.

unit circle : the set of points at distance 1 from a given point.

updating : the process of changing entries in a file to make them current.

usefulness : the concept that a retrieved document may be relevant to an information need other than the present one, or that a document relevant to the present need may not useful since the information in it is already known.

user : the person who either wishes to store information in the system, or to retrieve information from the system.

user-oriented measure : any measure taking into account the individual situation and characteristics of the user, in contrast to one that is uniform for all users.

user profile : any description of the user's interests and background in relation to the information need.


vector : any ordered list of elements, in information retrieval either terms or term weights. The elements are called components.

Vector Angle Coefficient : a similarity measure based on the angle between two documents.

vector model : any retrieval model based on viewing documents and queries as term or term weight vectors.

vector of terms : any vector whose components are terms.

VIBE : a visual information retrieval interface based on the ratios of similarities between a document and multiple reference points.

view : any abstracted and organized subset of data.

VIRI : see visual information retrieval interface.

visual information retrieval interface : any 2- or 3-dimensional graphical display showing some of the relationships among documents, or among terms within a given document, a VIRI.

vocabulary : the set of words used in an information system.

VR-VIBE : a visual information retrieval interface based on VIBE, but introducing a third dimension and virtual reality techniques.


weakly ordered set : any set in which there an order relation defined between some, but not all, elements. The set can be partitioned into subsets each consisting of elements that have no order relationship among themselves.

weight vector : any vector whose components are term weights.

weighted Boolean query : a modified Boolean query with weights applied to the terms or the Boolean operators.

weighting of terms : the assignment of numerical values to terms, representing their importance in a document or query.

wild card : in a character string representation. a special character for which one or more arbitrary characters may be substituted.

wisdom : a broad view, encompassing all of known reality, governing the use of the information that has been obtained and the knowledge that has been developed, and involving the capacity to make balanced judgments in the light of certain value criteria.

word order : the order in which words occur in a phrase or sentence.

World Wide Web : an outgrowth of the Internet permitting individuals and organizations to make information publicly available, and to access information that others have made publically available.

WWW : see World Wide Web.


Yule Auxiliary Quantity : a similarity measure developed by Yule.

Yule Coefficient of Colligation : a similarity measure developed by Yule.


Zipf's law : the observation, due to Zipf, that the frequency with which a term occurs in a document collection and its position in a frequency-ranked list of words are approximately inversely related.

Ziv-Lempel code : in data compression, an encoding method in which each occurrence of a substring is encoded by a pointer to a prior substring plus an additional character. Also called Lempel-Ziv code.