An analysis of the Greenstone Digital Library Software
Nghi Dao
4/26/2001
My
objective in this paper is to explain the process of which I tested the
Greenstone software and to explain my analysis of how the Greenstone digital
library software applies to this information storage retrieval. First, I'll get to what the Greenstone
software is.
1. What is the
Greenstone Digital Library and how does it work?
The
Greenstone software is a system for constructing and presenting collections of thousands
of documents, text, images, audio and video information. There are several ways to find information
in most Greenstone collections. You can
search for particular words or you can browse documents by title or
subject. Each collection can implement
any of these features. Therefore, each
collection will have different features based on what the developers of the
library intend for the users to see.
There are
many collections that used the Greenstone software to build the library. Each collection is organized slightly
differently. Most collections can be
accessed by both searching and browsing.
When searching, the Greenstone software does a full-text search, meaning
it looks through the entire text of all documents in the collections. In most collections, the user can choose
between indexes built from different parts of the documents. Some collections have an index of full
documents, an index of paragraphs, and an index of titles, each of which can be
searched for particular words or phrases.
Using these, the user can find all documents that contain a particular
set of words or paragraphs that contain the set of words, or all documents
whose title contain the words. Browsing
involves lists that the user can examine:
lists of authors, lists of titles, lists of dates, hierarchical
classification structures. Each
collections offers different browsing facilities.
I mention
earlier that the collections can be searched or browsed. In order to make the search possible,
Greenstone constructs full-text indexes from the document text. The indexes enable a user to search on any
words in the full text of the body.
Indexes can be searched for particular words, combinations of words, or
phrases, and results are ordered according to how relevant they are to the
query. Indexes are created during a
building process from the information in the collection information file. In terms of browsing, metadata such as
author, title, date, keywords form the raw material for browsing. It was either provided explicitly or derived
automatically from the documents. The
metadata browsing structure is built by a scheme of "classifiers”. The classifiers build browsing indexes of
various kinds: scrollable lists,
alphabetic selectors, dates, and arbitrary hierarchies. Greenstone allows developers to create their
own hierarchy if they choose.
Greenstone
creates all searching and browsing structures automatically from the documents
themselves so that nothing has to be done manually. When new documents need to be added to the library, they can be
merged into the collection automatically.
This is usually done by a process that awake regularly, scout for new
material, and rebuild the indexes - all without manual intervention. The source documents for the library can be in
a variety of format and are converted for indexing by plugins. Plugins from the Greenstone software process
plain text, HTML, WORD, and PDF documents, and Usenet and email messages. Greenstone also gives you the ability to
develop your own plugins for other document types.
2. What is the
procedure I used to test the software?
2.1
Introduction
In this
section of my paper, I will explain my test procedure and give further details
about the software. The intentions of
this section is mainly to give you an idea of kinds of test that I performed
with the software so that you may get a rough estimate of the amount of time
that I have been working with the software.
I would characterize the time I have spent in the project as follows:
Reading
the documentation: 4 hours
Installation
and Learning to use the systems and trying things for the first time: 4 hours
Reading
through the Food and Nutrition Library so that my query would be relevant: 4 hours
Following
the test procedure and trying specific test to test for robustness: 8 hours
Writing this paper:
5 hours
Total
time on project: 25 hours
2.2
Documentation:
Part of
the project involved reading the documentation. There were three documents to read. There was an install guide (30 pages), a users guide (44 pages),
and a develop’s guide (103 pages). I
read those documents and then tried the features in the procedure below.
2.3
Installation
Installation
of the Greenstone software was very easy.
I installed the software in Windows 2000 on a Compaq Presario
laptop. The install only involve
inserting the Greenstone CD and running the automatic install program. I choose all default settings for file
locations. Once the program is
installed, the program is easily accessed from the programs menu.
2.4
Main Home Page
The
Greenstone software allows for the creation of a digital library that is html
format. The collection can reside on
your location hard disk, on the web, or on a CD. In this case, I install the software and several collections/demonstration
libraries on my hard drive. The main
page of the Software is shown in figure 1.
The main page offers access to the Greenstone Demo, the Language
Extraction Demo, the Chinese Demonstration collection, the Greenstone Archives,
the Food and Nutrition Library, collector information, administration
information, and Greenstone background information. Each collection has its own method of searching for
information. They are all
different. In the interest of seeing
what software capabilities are available and how queries can be made, I will
only look at certain sections of the collections. This paper was not meant to look at every feature of every
collection in detail, but to look at how the information retrieval system is
established and used.
figure
1: Main page of the Greenstone software

2.5
Chinese Collection Demonstration
I first
checked out the Chinese Demonstration collection. This collection can only be searched by title or by document
sections. You can also search for all
the key words or some of the key words.
Also, you can browse through the titles. At first, I decided to browse through the titles so that I can
get an idea of what documents are in the collection. When I clicked on the Titles, all I saw was question marks. This is shown in figure 2. I soon found out the error was that the page
defaulted to the western encoding scheme instead of the simplified Chinese
encoding. You have to choose the
encoding scheme manually from the preferences link on the upper right hand corner.
The preferences link has an assortment of options that you can choose as to how
you want. After choosing the right
encryption type, the text appeared properly.
This is shown in figure 3.
Figure
2: Chinese demonstration collection
with incorrect encoding scheme

Figure
4: Chinese Demonstration collection
with the correct Simplied Chinese encoding scheme

I did the
following test to see if the queries can be performed in English even though
the text is in Chinese. By the way, I
am not fully literate in Chinese. I am
only somewhat literate. Another words,
I can pick out words, but I can't read the whole paragraph. In the end, my point is, I can tell you that
from figure 3, there is a story about a red dream and a story of gold
something. The gold word is the first
word of line three. The red word is the
first word of line four. So, base on
those two colors, red and gold that are in the text. I wanted to find out if my query in English will result in those
two stories. I typed in
"red", chose the titles section as the section to search for, choose
match some of the search results in 0 matches.
I typed in gold, chose the titles section as the section to search for,
and the search resulted in 0 matches.
So obviously, you cannot query text in another language.
2.6
Food and Nutrition Library
Now, I
did more advance searches with the food and nutrition library. First, I browsed through the titles just to
get an idea of what is there. With
this library, there was an option to browse by title, browse by subjects, or
browse by organization that performed the study or offered relief. I briefly browsed all three sections. Then I did some searches. This library allows you to search by
chapters, titles or paragraphs. You can
also select the criteria that the search must have all the words or the results
can have some of the words.
2.7 Comparing queries:
Below is a matrix of an example of each of the options I
used for the queries. I also explain
each option that is available for the search.
I also bolded the items that changed from query to query. Then I explain below the table exactly what
I was looking for and if the query was successful.
Options for each variable
Select for:
chapters, titles or paragraph
Which contains: some
or all the words
Case difference:
ignore case differences or upper/lower case must match
Word endings: ignore
word endings or whole word must match
Query mode: simple
or advance
Advance
mode allows for Boolean search with !,&, | and parenthesis.
|
|
|
Search
Options |
Search
Preferences |
|||
|
|
Query used: |
Select for |
Which contains |
Case difference |
Word endings |
Query mode |
|
1 |
Effects malnutrition mortality medical evidence |
chapters |
All |
Ignore case |
Whole word match |
simple |
|
2 |
Same as 1 |
Chapters |
some |
Ignore case |
Whole word match |
simple |
|
3 |
Gardening techniques no rain desert |
Chapters |
some |
Ignore case |
Whole word match |
simple |
|
4 |
Gardening techniques no rain desert |
Chapters |
some |
Ignore case |
Ignore word endings |
simple |
|
5 |
(food processing) &fruit |
Chapters |
some |
Ignore case |
Ignore word endings |
Advance (Boolean) |
|
6 |
(food processing) &fruit !nut |
Chapters |
some |
Ignore case |
Ignore word endings |
Advance (Boolean) |
|
7 |
Food poison death |
Chapters |
some |
Ignore case |
Ignore word endings |
Advance (Boolean) |
|
8 |
Food | poison | death |
Chapters |
some |
Ignore case |
Ignore word endings |
Advance (Boolean) |
|
9 |
Water treatment facilities |
title |
some |
Ignore case |
Ignore word endings |
Ranked |
|
10 |
Alcohol problems |
Title |
some |
Ignore case |
Ignore word endings |
ranked |
1.
I
wanted to find the effects of malnutrition on infant mortality. I wanted medication evidence that proved
that malnutrition can weaken cells and hence increase infant mortality. With query listed in number 1, I found the
information I needed. But the query
returned more than 50 results and I had to look through 14 different articles
before I found the information I needed.
2.
I
wanted to test the same query as query one, but this time using the parameter
contains “some” of the words. This
query worked much better than the first query.
The result showed 49 documents that matched the query. But, the articles I was interested appeared
in the first five results list. So,
sometimes less is better.
3.
I
wanted to find documents about gardening techniques for areas that have little
rain. I submit a query with query
3. I found the information I need with
relative ease.
4.
Then,
I wanted to test the ignore word ending feature of the greenstone
software. I wanted to see if I can get
similar results that I saw with query 3.
With query 4, the first 5 results were different than query 3. There were less documents found than in
query and the documents seem to focus more gardens than gardening techniques
for hostel environments. For example,
some documents were on the gardening accounting - measuring income from the garden. This was not what I wanted to see.
5.
In
query 5, I wanted to try an advance query, using Boolean “and”, “or”, or
“not.” So, I first looked for food
processing with fruits. I found
documents on fruit processing, but most documents were on fruit and nut
processing. So, I proceeded to query 6.
6.
This
query looks for fruit processing only but not nut processing. With the query I entered in query 6, there
were not articles on nuts, as expected.
7.
I
wanted to find some articles on food poisoning and how they have a potential to
cause deaths. With the query in 8, the
search showed that no document matched the query. The system returned that word count for poison was 2 words, death
was 9 words and food was 1036 words.
8.
With
this query, I was able to find food poisoning issues.
9.
I
wanted to find water treatment problems with the title instead of the chapters.
10. I wanted to find alcohol related
problems. With the query in number 10,
I was able to find that information.
This was for the purpose of a general query.
One of
the features that the Greenstone software provides is the ability to create
your own digital library from the source files that you have. So to test this feature out, I had to find
some books in electronic form and create my own little digital library. So, I decided to create a small digital
library on religion. First, I went to
several web sites and got ten books in electronic form on religion. Then I choose collector on the front page of
the Greenstone software. From there, it
was very simple to create my own digital library. The Greenstone software has five steps that help you to create
you own library.
The first
step is specifying the name and location of the collection. I filled out text boxes for the name of your
collection, an email address, and information about the collection. Next, I provided source data for the
library. Then there was a menu to
configure the collection, but the standard configuration is fine. Then I choose build collection. There is a status bar that gives information
on exactly what the process. The
Greenstone software has plugins that will automatically transfer word, pdf, and
plain text files into html files. The
collection was created very easily with no problems. Since it will be difficult for me to show you that this
collection exist, I will demonstrate this collection in class during the
presentation so that you can see it.
3. How Greenstone links what we have learned in information storage and retrieval class?
The
Greenstone software is a great piece of software that can create digital
libraries very quickly. It not only
will create the digital library, but it will automatically generate an index. The index is then used in different ways
according to the type of query structure you are interested in using.
Weighted
vector query is the default query structure used by the Greenstone
software. The software or documentation
does not specifically say that it is using a weighted vector query. But based on what I have learned in
information storage and retrieval class, I recognize that the query type that
they are using is very similar to a weighted vector query. This is because the index of terms is
weighted based on the frequency of the word in the library. The documentation states that if a word
occurs more frequently, it is given a smaller weight than a more unique
word. This implementation for these
library types is extremely useful as well as practical, especially for the
purpose of searching through special collections. For example, in the Religion collection that I created, the word
“religion” would occur many times. This
is because it is a collection on religion.
The Greenstone software automatically makes this assumption and can make
the search more effective.
Although
weighted vector query is the default query structure of the Greenstone
software, there are a lot of other features in Greenstone that are excellent
examples of information retrieval methods that I have learned in class. One of these information retrieval methods
is the Boolean query. In the
preferences of option of the Greenstone software, you can choose to do a
standard search or an advanced search.
An advance search allows you to perform Boolean queries “and”, “or”, and/or
“not.” This query allows you have the
advantage of incorporating word present or absence into your search. The disadvantage is that it does not
incorporate word weight into the search.
Greenstone
can automatically handle stemming if you would like. There is an option to match the exact word or to ignore
ending. This is a useful technique if
you are not particular about the type of endings of the word.
All in
all, the Greenstone software is excellent information storage and retrieval
systems software. The developers
obviously know what they are doing in the information retrieval field and have
incorporated many features that very useful in finding relevant queries. My only critique of the software is that it
is not very intuitive if you have not read the documentation. My first try placing with the software, I
decided not to read the documentation to figure out how intuitive the interface
is and it wasn’t very intuitive. But
after reading the documentation, it all made sense and the interface were very
to use.