An analysis of the Greenstone Digital Library Software

 

 

 

 

 

 

 

 

 

 

 

Nghi Dao

 

4/26/2001


 

My objective in this paper is to explain the process of which I tested the Greenstone software and to explain my analysis of how the Greenstone digital library software applies to this information storage retrieval.  First, I'll get to what the Greenstone software is. 

 

1.  What is the Greenstone Digital Library and how does it work?

 

The Greenstone software is a system for constructing and presenting collections of thousands of documents, text, images, audio and video information.  There are several ways to find information in most Greenstone collections.  You can search for particular words or you can browse documents by title or subject.  Each collection can implement any of these features.  Therefore, each collection will have different features based on what the developers of the library intend for the users to see. 

 

There are many collections that used the Greenstone software to build the library.  Each collection is organized slightly differently.  Most collections can be accessed by both searching and browsing.  When searching, the Greenstone software does a full-text search, meaning it looks through the entire text of all documents in the collections.  In most collections, the user can choose between indexes built from different parts of the documents.  Some collections have an index of full documents, an index of paragraphs, and an index of titles, each of which can be searched for particular words or phrases.  Using these, the user can find all documents that contain a particular set of words or paragraphs that contain the set of words, or all documents whose title contain the words.  Browsing involves lists that the user can examine:  lists of authors, lists of titles, lists of dates, hierarchical classification structures.  Each collections offers different browsing facilities.

 

I mention earlier that the collections can be searched or browsed.  In order to make the search possible, Greenstone constructs full-text indexes from the document text.  The indexes enable a user to search on any words in the full text of the body.  Indexes can be searched for particular words, combinations of words, or phrases, and results are ordered according to how relevant they are to the query.  Indexes are created during a building process from the information in the collection information file.  In terms of browsing, metadata such as author, title, date, keywords form the raw material for browsing.  It was either provided explicitly or derived automatically from the documents.  The metadata browsing structure is built by a scheme of "classifiers”.  The classifiers build browsing indexes of various kinds:  scrollable lists, alphabetic selectors, dates, and arbitrary hierarchies.  Greenstone allows developers to create their own hierarchy if they choose.

 

Greenstone creates all searching and browsing structures automatically from the documents themselves so that nothing has to be done manually.  When new documents need to be added to the library, they can be merged into the collection automatically.  This is usually done by a process that awake regularly, scout for new material, and rebuild the indexes - all without manual intervention.  The source documents for the library can be in a variety of format and are converted for indexing by plugins.  Plugins from the Greenstone software process plain text, HTML, WORD, and PDF documents, and Usenet and email messages.  Greenstone also gives you the ability to develop your own plugins for other document types.

 

2.  What is the procedure I used to test the software? 

2.1 Introduction

In this section of my paper, I will explain my test procedure and give further details about the software.  The intentions of this section is mainly to give you an idea of kinds of test that I performed with the software so that you may get a rough estimate of the amount of time that I have been working with the software.  I would characterize the time I have spent in the project as follows:

 

Reading the documentation:  4 hours

Installation and Learning to use the systems and trying things for the first time:  4 hours

Reading through the Food and Nutrition Library so that my query would be relevant:  4 hours

Following the test procedure and trying specific test to test for robustness:  8 hours

Writing this paper:  5 hours

Total time on project:  25 hours

 

 

2.2 Documentation:

Part of the project involved reading the documentation.  There were three documents to read.  There was an install guide (30 pages), a users guide (44 pages), and a develop’s guide (103 pages).  I read those documents and then tried the features in the procedure below.

 

2.3 Installation

Installation of the Greenstone software was very easy.  I installed the software in Windows 2000 on a Compaq Presario laptop.   The install only involve inserting the Greenstone CD and running the automatic install program.  I choose all default settings for file locations.  Once the program is installed, the program is easily accessed from the programs menu. 

 

2.4 Main Home Page

The Greenstone software allows for the creation of a digital library that is html format.   The collection can reside on your location hard disk, on the web, or on a CD.  In this case, I install the software and several collections/demonstration libraries on my hard drive.  The main page of the Software is shown in figure 1.  The main page offers access to the Greenstone Demo, the Language Extraction Demo, the Chinese Demonstration collection, the Greenstone Archives, the Food and Nutrition Library, collector information, administration information, and Greenstone background information.     Each collection has its own method of searching for information.  They are all different.  In the interest of seeing what software capabilities are available and how queries can be made, I will only look at certain sections of the collections.  This paper was not meant to look at every feature of every collection in detail, but to look at how the information retrieval system is established and used.

 

figure 1:  Main page of the Greenstone software

 

 

2.5 Chinese Collection Demonstration

 

I first checked out the Chinese Demonstration collection.  This collection can only be searched by title or by document sections.  You can also search for all the key words or some of the key words.  Also, you can browse through the titles.  At first, I decided to browse through the titles so that I can get an idea of what documents are in the collection.  When I clicked on the Titles, all I saw was question marks.  This is shown in figure 2.  I soon found out the error was that the page defaulted to the western encoding scheme instead of the simplified Chinese encoding.  You have to choose the encoding scheme manually from the preferences link on the upper right hand corner. The preferences link has an assortment of options that you can choose as to how you want.  After choosing the right encryption type, the text appeared properly.  This is shown in figure 3.

 

Figure 2:  Chinese demonstration collection with incorrect encoding scheme

 

 

Figure 4:  Chinese Demonstration collection with the correct Simplied Chinese encoding scheme

 

 

I did the following test to see if the queries can be performed in English even though the text is in Chinese.  By the way, I am not fully literate in Chinese.  I am only somewhat literate.  Another words, I can pick out words, but I can't read the whole paragraph.  In the end, my point is, I can tell you that from figure 3, there is a story about a red dream and a story of gold something.  The gold word is the first word of line three.  The red word is the first word of line four.  So, base on those two colors, red and gold that are in the text.  I wanted to find out if my query in English will result in those two stories.  I typed in "red", chose the titles section as the section to search for, choose match some of the search results in 0 matches.  I typed in gold, chose the titles section as the section to search for, and the search resulted in 0 matches.   So obviously, you cannot query text in another language. 

 

2.6 Food and Nutrition Library

 

Now, I did more advance searches with the food and nutrition library.   First, I browsed through the titles just to get an idea of what is there.   With this library, there was an option to browse by title, browse by subjects, or browse by organization that performed the study or offered relief.  I briefly browsed all three sections.   Then I did some searches.  This library allows you to search by chapters, titles or paragraphs.  You can also select the criteria that the search must have all the words or the results can have some of the words. 

 


2.7 Comparing queries:

 

Below is a matrix of an example of each of the options I used for the queries.  I also explain each option that is available for the search.  I also bolded the items that changed from query to query.  Then I explain below the table exactly what I was looking for and if the query was successful. 

 

Options for each variable

Select for:  chapters, titles or paragraph

Which contains:  some or all the words

Case difference:  ignore case differences or upper/lower case must match

Word endings:  ignore word endings or whole word must match

Query mode:  simple or advance

               Advance mode allows for Boolean search with !,&, | and parenthesis.

 

 

 

Search Options

Search Preferences

 

Query used:

Select for

Which contains

Case difference

Word endings

Query mode

1

Effects malnutrition mortality medical evidence

chapters

All

Ignore case

Whole word match

simple

2

Same as 1

Chapters

some

Ignore case

Whole word match

simple

3

Gardening techniques no rain desert

Chapters

some

Ignore case

Whole word match

simple

4

Gardening techniques no rain desert

Chapters

some

Ignore case

Ignore word endings

simple

5

(food processing) &fruit

Chapters

some

Ignore case

Ignore word endings

Advance (Boolean)

6

(food processing) &fruit !nut

Chapters

some

Ignore case

Ignore word endings

Advance (Boolean)

7

Food poison death

Chapters

some

Ignore case

Ignore word endings

Advance (Boolean)

8

Food | poison | death

Chapters

some

Ignore case

Ignore word endings

Advance (Boolean)

9

Water treatment facilities

title

some

Ignore case

Ignore word endings

Ranked

10

Alcohol problems

Title

some

Ignore case

Ignore word endings

ranked

 

1.      I wanted to find the effects of malnutrition on infant mortality.  I wanted medication evidence that proved that malnutrition can weaken cells and hence increase infant mortality.  With query listed in number 1, I found the information I needed.  But the query returned more than 50 results and I had to look through 14 different articles before I found the information I needed. 

2.      I wanted to test the same query as query one, but this time using the parameter contains “some” of the words.  This query worked much better than the first query.  The result showed 49 documents that matched the query.  But, the articles I was interested appeared in the first five results list.  So, sometimes less is better.

3.      I wanted to find documents about gardening techniques for areas that have little rain.  I submit a query with query 3.  I found the information I need with relative ease.

4.      Then, I wanted to test the ignore word ending feature of the greenstone software.  I wanted to see if I can get similar results that I saw with query 3.  With query 4, the first 5 results were different than query 3.  There were less documents found than in query and the documents seem to focus more gardens than gardening techniques for hostel environments.  For example, some documents were on the gardening accounting -  measuring income from the garden.  This was not what I wanted to see.

5.      In query 5, I wanted to try an advance query, using Boolean “and”, “or”, or “not.”  So, I first looked for food processing with fruits.  I found documents on fruit processing, but most documents were on fruit and nut processing.  So, I proceeded to query 6.

6.      This query looks for fruit processing only but not nut processing.  With the query I entered in query 6, there were not articles on nuts, as expected.

7.      I wanted to find some articles on food poisoning and how they have a potential to cause deaths.  With the query in 8, the search showed that no document matched the query.  The system returned that word count for poison was 2 words, death was 9 words and food was 1036 words. 

8.      With this query, I was able to find food poisoning issues.

9.      I wanted to find water treatment problems with the title instead of the chapters.

10.   I wanted to find alcohol related problems.  With the query in number 10, I was able to find that information.  This was for the purpose of a general query.

 

 

2.8 Creating my own digital library

One of the features that the Greenstone software provides is the ability to create your own digital library from the source files that you have.  So to test this feature out, I had to find some books in electronic form and create my own little digital library.  So, I decided to create a small digital library on religion.  First, I went to several web sites and got ten books in electronic form on religion.  Then I choose collector on the front page of the Greenstone software.  From there, it was very simple to create my own digital library.  The Greenstone software has five steps that help you to create you own library.   

 

The first step is specifying the name and location of the collection.  I filled out text boxes for the name of your collection, an email address, and information about the collection.  Next, I provided source data for the library.  Then there was a menu to configure the collection, but the standard configuration is fine.  Then I choose build collection.  There is a status bar that gives information on exactly what the process.  The Greenstone software has plugins that will automatically transfer word, pdf, and plain text files into html files.  The collection was created very easily with no problems.  Since it will be difficult for me to show you that this collection exist, I will demonstrate this collection in class during the presentation so that you can see it.

 

 

3.  How Greenstone links what we have learned in information storage and retrieval class?

 

The Greenstone software is a great piece of software that can create digital libraries very quickly.  It not only will create the digital library, but it will automatically generate an index.  The index is then used in different ways according to the type of query structure you are interested in using.

 

Weighted vector query is the default query structure used by the Greenstone software.  The software or documentation does not specifically say that it is using a weighted vector query.  But based on what I have learned in information storage and retrieval class, I recognize that the query type that they are using is very similar to a weighted vector query.   This is because the index of terms is weighted based on the frequency of the word in the library.  The documentation states that if a word occurs more frequently, it is given a smaller weight than a more unique word.   This implementation for these library types is extremely useful as well as practical, especially for the purpose of searching through special collections.  For example, in the Religion collection that I created, the word “religion” would occur many times.  This is because it is a collection on religion.  The Greenstone software automatically makes this assumption and can make the search more effective.

 

Although weighted vector query is the default query structure of the Greenstone software, there are a lot of other features in Greenstone that are excellent examples of information retrieval methods that I have learned in class.  One of these information retrieval methods is the Boolean query.  In the preferences of option of the Greenstone software, you can choose to do a standard search or an advanced search.  An advance search allows you to perform Boolean queries “and”, “or”, and/or “not.”  This query allows you have the advantage of incorporating word present or absence into your search.  The disadvantage is that it does not incorporate word weight into the search.

 

Greenstone can automatically handle stemming if you would like.  There is an option to match the exact word or to ignore ending.  This is a useful technique if you are not particular about the type of endings of the word.

 

All in all, the Greenstone software is excellent information storage and retrieval systems software.  The developers obviously know what they are doing in the information retrieval field and have incorporated many features that very useful in finding relevant queries.  My only critique of the software is that it is not very intuitive if you have not read the documentation.  My first try placing with the software, I decided not to read the documentation to figure out how intuitive the interface is and it wasn’t very intuitive.  But after reading the documentation, it all made sense and the interface were very to use.