Evaluation of Text, Numeric and Graphical Presentations for Information Retrieval Interfaces:
User Preference and Task Performance Measures

Emile L. Morse
elm2@sis.pitt.edu

Michael Lewis
ml@sis.pitt.edu

Robert R. Korfhage
korfhage@sis.pitt.edu

Kai Olsen
kai.olsen@himolde.no

Department of Information Science and Telecommunications
Visual Information Retrieval Interface Research Group
University of Pittsburgh
Pittsburgh, PA 15260 USA

ABSTRACT

Information retrieval has a long history of dealing with printed materials. More recent work has involved the development of experimental visual interfaces to support users' attempts to access appropriate documents. This research has matured to the point that usability studies and evaluation of approaches to information visualization are needed to guide further development. The reported studies examine the use of alternative document visualizations in tightly controlled settings.

Five types of interface representations were defined, including ordered text, ordered icons, a table format, a x-y graph format and a novel spring-based visualization. To assess the relative utility of the various interfaces, we have chosen to apply two kinds of measures: performance on information retrieval tasks and user preference rankings of the interfaces.

The results show that performance is strikingly different across the range of interface types with the ordered icon list and text list producing the best results. Users' preferences, however, indicated that the textual format was the least desirable, while both of the visualization methods, i.e., icon list and spring-based visual, were preferred. We conclude that performance is more easily and accurately measured and that preferences of users can not be used alone to determine the utility of interfaces.

  1. INTRODUCTION
  2. An active area of information retrieval research concerns the development of alternatives to text lists for displaying the results of user-generated queries. Many experimental interfaces have been implemented that use visual methods for presenting results. The data underlying these systems is a vector for each document in a collection in which the elements of the vector represent key terms. In Boolean systems the values are zero or one. The model is easily extended to handle weights, which might be assigned by a human indexer or more likely by a computer program, that count word or phrase frequencies. Other models, such as probabilistic and fuzzy logic models, can be viewed as variants of the basic vector representation, since the weights may be determined using a different algorithm but the outcome is still a vector of term weights.

    The experimental visualization systems are built from a set of primitive components including documents and terms as objects and relationships among the objects are variously expressed as distances between them or differences in attributes such as color, size or shape. Examples of visualizations for information retrieval include BEAD [2], InfoCrystal [7], SOM [4], VIBE [6], VR-VIBE [1] and most recently WebVIBE[5]. BEAD employs a landscape view of an entire document collection. InfoCrystal uses a crystalline lattice structure to represent the relationships of documents by their placement in a web of key terms. SOM creates self-organizing maps of document collection. Olsen et al invented VIBE at the University of Pittsburgh. In the system terms are represented as nodes in a 2-dimensional space and documents are placed in the resulting convex hull according to their relative attraction to the terms. VR-VIBE is a 3-dimensional extension developed at the University of Nottingham. WebVIBE is a Java applet that is a direct descendant of VIBE and was created by removing all but the essential features of VIBE[5]. Its goal is to present the results obtained from existing search engines such as Lycos, Infoseek, or Excite. In these situations many documents, sometimes exceeding tens of thousands, are frequently presented to a bewildered user in a text list ordered by some mystical ranking algorithm. Figure 1 shows a WebVIBE display of a three-term query. The user can explore the piles of documents and especially can see the relationships among the documents and the query terms.

    Previous studies of the utility of visual search interfaces are limited. Koshman [3] tested VIBE against a text-based interface (AskSam) with a group of experienced librarian searchers and a group of novices. She found that both groups preferred the text-based system. WebVIBE has only been tested in the context of usability [5] and has not yet been subjected to comparative testing.

  3. METHODS

The 218 subjects for this study were members of undergraduate courses at the University of Pittsburgh or the Molde College, Molde, Norway. The test was administered as a paper-and-pencil exercise during a normal class meeting. Subjects were given a packet containing instructions for completing the experiment, a randomly ordered set of 5 presentation types and a post-test questionnaire. The instructions were read aloud to each class before the booklets were opened. Subjects were instructed to refrain from changing answers on a page after they had flipped to the next page. This constraint was applied in order to detect more easily learning effects over the course of the repeated presentation of questions. No restriction was placed on the amount of time for completing the test but most subjects handed in their booklet in 10-15 minutes.

Approximately half of the subjects received additional explanation of the various interfaces. The information provided was limited to a preview of each type using dummy data, e.g., X and Y rather than actual terms and A and B rather that numeric values.

The five kinds of presentations were: text, icon list, table, graph, spring display. Figure 2 shows an example of each of these test conditions. Text is ordered so that the items at the head of the list contain both terms, then items containing term X but not Y, and the tail of the list contains Y but not X. The icon list is presented in the same order as text; dark shading indicates the presence of the term and white indicates a term's absence. The table is constructed so that counts of documents containing the various combinations of terms are presented. The graph display plots term X along the X-axis and Y along the Y-axis. The spring display, also called a VIBE display [5], is based on a model in which documents are placed in a display according to the amount of attraction that the document has for the terms placed at the ends of the line segment. In this 2-term instance, documents that are about term 'bank' will be counted up at the end of the line labeled 'bank'. Documents that are about both terms will be counted at the middle of the segment.

For each type of presentation the subject was required to answer two questions.

  1. Circle the item(s) that contain term X and Y.
  2. How many items contain the term X?

After all five interfaces had been seen and used by the subject, he was asked to rank the interfaces with respect to:

The primary measures of the study are performance and preference. Performance is measured as number of correct answers to the questions related to each display type. In general, preference results concentrate to the subjects' top choice for each ranking category.

 

  1. RESULTS
  2. Covariates

    In order to determine if any of the factors probed in the post-test questionnaire might have confounding effects on the study design, we analyzed the data for covariate effects. Overall performance as measured as total correct answers or display performance as measured as the number correct

    answers per presentation type was not affected by gender (Table 1), age (Table 2), amount of prior computer experience (not shown) or current year in academic program (Table 3).

    Table 1 : Performance by gender

    Male (n=121)

    Female (n=93)

    Total Score (mean± SEM)

    6.60± 0.25

    6.00± 0.31

    Table 2: Performance by age range (mean± SEM)

    Age (yr.)N

    Total Score

    >23

    118

    6.41± 0.25

    23-30

    71

    6.17± 0.37

    >30

    21

    6.48± 0.71

    Table 3: Perfomance by level of academic achievment(mean± SEM)

    Year in Academic Program

    N

    Total Score

    1

    71

    6.61± 0.30

    2

    19

    5.63± 0.33

    3

    43

    6.12± 0.46

    4

    72

    6.47± 0.33

    Graduate

    9

    5.67± 1.13

    Initial analysis of performance showed a significant effect for country (U.S. vs. Norway); Norwegian students scored higher on all displays except for the 'table' for which performance was equivalent in both groups. Subsequent factoring in of native language resulted in a disappearance of any difference by country in which the study was done (Table 4). The explanation is that the relatively high proportion of international students in the Pittsburgh sample performed significantly more poorly than the native English speakers. The Norwegian sample did not have any non-native Norwegian speakers.

    Table 4: Total performance based on country of origin

    Test Site

    Native Language

    N

    Total Score

    Norway

    Norwegian

    75

    6.85± 0.26

    America

    English

    114

    6.41± 0.29

     

    Non-English

    29

    4.62± 0.54

    Question Difficulty

    The tasks that the subjects were asked to perform were chosen to represent two of the Boolean combinations that are possible with a 2-term query. Question A corresponds to the logical AND-ing of the terms. Question B is simply the existence of a single term. In all instances Question A was answered correctly more often than Question B for each presentation. The overall performance by Question type is shown in Figure 3.

     

    Effect of Instruction

    The groups of subjects that received an abstract overview of the study performed significantly better than subjects who received only logistical instructions. This was true regardless of question type as shown in Figure 4.

    Order of Presentation

    The order of the presentations was randomized to control for order effects. Our results show that a significant amount of learning occurred during the trials. Figure 5 shows that the interfaces that were poorest with respect to performance, i.e., spring model and graph, became more useful if they were presented later in the sequence.

    User Preferences

    Subjects generated three separate rankings of the five presentations. Comparison of the three rankings, however, revealed no significant difference among them. Perhaps the distinction between the question types was not sufficiently salient to the subjects. The 'best' and 'worst' rankings for the 'overall' data are shown in Table 5.

    Table 5: User preferences for presentation styles

    Best

    Worst

    Text

    13.5%

    47.3%

    Icons

    33.3%

    8.2%

    Table

    15.3%

    15.5%

    Spring (VIBE)

    28.8%

    9.1%

    Graph

    9.0%

    20.0%

    The most striking finding is that over 60% of users preferred the visual methods, i.e., icon list and spring displays. It is also interesting to note that although performance was superior with the 'text' interface, users dislike it.

    Relationship of performance and preference

    The data were analyzed to determine whether users preferred interfaces with which they were successful over those with which they experienced problems. No patterns were detected that could support this hypothesis. Subjects appeared to prefer interfaces independent of the ability of the interface to support enhanced performance.

  3. CONCLUSIONS

The results of the current study with respect to user preferences for visual displays stand in contrast to previous studies [3]. There are many potential explanations for this difference including a changing user population that is more willing to expect visual interfaces. Other influences might be related to developments in hardware technology that support faster rendering and encourage novel visualization development.

The second key finding of this study is that learning of novel interfaces can be enhanced by several methods. First, we found that a short overview was sufficient to enhance performance of both easy and hard information retrieval tasks. Second, learning could be transferred from one interface to another. Subjects who were presented initially with interfaces such as the 'graph' or the 'spring' performed poorly, but the same interface presented later in the series was easily managed. Taken together the facts that users prefer visual interfaces and that such interfaces can be learned easily encourages us.

Aside from user preferences and performance, there are significant findings in this study related to the overall testing paradigm. They relate to the set of interface types, the set of Boolean tasks, and the choice of performance and preference as measures. The display types were chosen to represent a range of presentation method, each of which is used in one or more information retrieval systems. Differences between types were highly significant, while variation within a particular display was relatively low.

Similarly the Boolean tasks used in the study showed salient differences between the tasks but little variation within a task. Extension of the current study to address retrieval using queries composed of more terms will allow exploration a large number of possible Boolean queries.

Finally, the use of both performance and preference as measures appears to be a reasonable approach. They are not predictive of one another but rather complement and extend each other.

Taken together our results validate the use of the evaluation approach that we used. Areas for future investigation include increasing the difficulty of the display, which could be accomplished by:

Another intriguing area for exploration would be to create controlled training scenarios to discover the best ways of teaching visualization understanding and interaction.

  1. REFERENCES
  1. Benford, S.D., D. Snowdon, C. Greenhalgh, R. Ingram, I. Knox and C. Brown. 1995. VR-VIBE: a virtual environment for co-operative information retrieval. Eurographics '95, 30th August - 1st September, Maastricht, The Netherlands, 349-360.
  2. Chalmers M. 1993. Using a Landscape to represent a corpus of documents, Springer-Verlag Proceedings of COSIT '93, Elba, pp. 377-390.
  3. Koshman, S. 1996. VIBE Usability: an Investigation into Visualization Techniques for Information Retrieval. Dissertation. University of Pittsburgh.
  4. Lin, X. 1991. A self-organizing semantic map for information retrieval. Proceedings for the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Oct. 13-16; Chicago, IL), 262-269.
  5. Morse, E., and M. Lewis: Why Information Visualizations Sometimes Fail. Proceedings of IEEE International Conference on Systems Man and Cybernetics 1997, Orlando, FL, October 12-15, 1997.
  6. Olsen, K.A., R.R. Korfhage, M.B. Spring, K.M. Sochats, and J.G. Williams. 1993. Visualization of a document collection: The VIBE system. Information Processing and Management. 29(1): 69-81.
  7. Spoerri, A. 1993. Visual tools for information retrieval. Proceedings of the 1993 IEEE Symposium on Visual Languages. Bergen, Norway. Los Alamitos, CA: IEEE Computer Society Press, 160-168.