Documents are important to humans. They are the medium through which much of our cultural heritage is preserved and exchanged. Each of us has spent a good part, if not the majority of our education, processing documents. From birthday cards to birth certificates to college papers to novels, documents play a role in our lives in many ways. Each of us has a personal stake and a personal opinion in how we process documents. At the same time, it is likely that you have no rigorous definition of a document. You know what it is, but you would be hard pressed to define it with a high degree of precision.
This course is about document processing. It is somewhat difficult to know what to teach in a course on document processing. It is one of the most rapidly evolving areas in information and computer science. Indeed, most recently -- precipitated by the "World Wide Web"-- the changes in document processing have bordered on revolutionary. A more considered review of the history of the changes in document processing might suggest that over the last several decades we have experienced developments in three stages.
Most people are well aware of the early use of computers to do tedious calculations -- most notably those required for targeting artillery shells. The emergence of computers in the back offices of corporations and banks is also pretty well understood -- to do repetitive calculations. Less well understood is the use of computers in typesetting equipment -- where the repetitive calculation was line lengths where the length of every line had to be calculated based on the width of each character on the line. Thus, the early relationship between document processing and computing had to do with repetitive calculations related to typesetting. Computers were also useful in assisting in hyphenation decisions and in spell checking. Simultaneously, a series of researchers were examining how bibliographic records might be stored on computers and searched and accessed more quickly. This gave rise to the development of a series of approaches to information storage and retrieval. The tasks of storage and retrieval, when examined in detail, involve preparing and processing the text using a number of sophisticated algorithms. Finally, early researchers in this field were faced with the need to optimize the use of very expensive resources. Until 1980, bytes of memory was counted in thousands and disk storage was counted in millions. This meant that far less than one book could be held in memory at one time and a single book could absorb all the storage resources of a given computer. For this and other reasons, compression algorithms were of interest to these researchers. Thus, computer scientists interested in the marriage of computers and documents were intensely interested in the algorithms used to process text.
With the emergence of the PC in the 1980s text processing became something accessible to the personal computer user and we entered the second era. While 1980 saw the emergence of the PC which would have a dramatic impact on document processing, three other things were happening out of the main stream that were equally if not more dramatic. In 1980, Xerox unveiled the information system of the future -- an ethernet based set of workstations and servers that had a graphical user interface with a mouse and a bitmapped display, a connection to a laser printer, and a model of text processing that was object based. Secondly, a Phd Student at Carnegie Mellon University developed a text processing system called Scribe. It could be viewed as simply the next generation typesetting system built upon the insights gained in the then decade old nroff/troff systems on unix, or the Script system developed by IBM. It was however a new kind of system in that for the first time users marked up text not in terms of typographic characteristics -- e.g. "18 point bold helvetica", but in terms of structural characteristics -- e.g. "heading". Scribe had made the transition from text processing to document processing. Finally, in 1980, a number of practical hypertext systems were being developed including Xerox's Notecards, Knowledge System's KMS, MCC's IBIS and gIBIS, Brown's Intermedia. These efforts were echos of Vannevar Bush's memex dream and Douglas Engelbart's first hypertext system at the Augmentation Research Center.
Finally, in the 1990s, three things came together that changed the focus yet again. First, the ethernet first developed at Xerox Parc was now used to connect most computers in offices and academia. These isolated networks began to be connected to each other across a research network developed by the defense department -- the Internet. Second, the graphical display developed at Xerox Parc had been through several generations -- XDE, the Mac, X windows, and finally Microsoft Windows. Graphical user interfaces were now the standard for interaction and made it possible for users to learn new systems in matters of hours. Tim Berners-Lee, a physicist working at the CERN in Switzerland, envisioned a simple network based protocol and a simple universal naming scheme that could provide a kind of standardized hypertext. In addition, using the standardized form of markup that had been developed based on several iterations of Brian Reid work on Scribe, he defined a standard way of describing documents such that the universally identified documents could be viewed in an appealing graphical format. Thus, the new opportunities and challenges that we face as computer and information scientists have to do with how we manage large collections of standardized documents linked and available across a wide area network.
The course might be outlined by defining document processing. A definition of document processing may be developed from the component terms. Webster provides two definitions for document "a writing conveying information" and a material substance having on it a representation of the thoughts of men by the means of some conventional thought or symbol. Process is defined as to subject to some special process or treatment (as in the course of manufacture). While the definition implied is not quite comprehensive, it does provide a beginning point.
Traditionally, courses on document processing have looked at information storage and retrieval, or document imaging, or workflow. A course in any one of these areas would be a legitimate focus of a course on document processing. It is my feeling however that there are some new areas to be addressed and thought about. To get to a discussion of these new areas, we need to know where we are coming from, what we can do, and how we might go about doing it. Thus, from my perspective, the broad areas to be addressed in the course are history of document processing, basic technologies, algorithms, standards, systems, and futures.
It is my belief that at the collegiate level, students share the responsibility for the learning experience. The instructor's role should be less to direct and spoon feed and more to stimulate and guide learning. The instructor and students share the responsibility to make the course work. This means that students should set their own learning goals for a course. If a student has specific goals in a course, they are more likely to work to make the course productive. It is important in this respect that the course be interactive. While this will be difficult with the instructor and ocean away, you are encouraged to talk with each other and with me by e-mail as regularly as possible. You are also encouraged to use CASCADE as a tool to improve communication. While we are together, my lectures will be very interactive -- I will be asking questions about how you see the material. If you are squeamish about being asked questions in class, or get embarrassed, please let me know.
It is the student's responsibility to read and learn the material in the text books. It is my job to clarify what the textbooks fail to make clear and to go beyond what is said in the textbooks to new or more difficult ideas. This will be the goal of the lectures I will give. I encourage you, if at all possible, to read the relevant chapters from my book during the first week of lectures. This will be difficult, but it will help you a lot. Of course it goes without saying that you should have read all the assigned materials before I return in November.
While much can be learned by rote memorization, things learned by memorization tend not to be the skills that one generalizes and applies in later life. A different kind of learning takes place when students engage in the process actively. This course is based upon students being actively engaged -- in class, in working on projects and in the reading. I think that real learning occurs when you produce products that work. For that reason I encourage you to undertake some of the recommended projects listed in this syllabus.
The goals of the course are as follows:
Within these broad goals, students will define specific objectives for their own learning during the course.
There are a number of sources that students will use in this course. The primary textbook for the course is:
Electronic Document Management Systems : A Portable Consultant by Thomas M. Koulopoulos, Carl Frappaolo
You will also read some chapters from one or two books by the instructor which will be provided to you electronically. They are:
Electronic Printing and Publishing: The Document Processing Revolution by Michael B. Spring.
Hands on PostScript by Michael B. Spring and David Dubin.
You will also be pointed to a variety of other reading materials which will be accessible in the library, on line, or through CASCADE.
There are significant differences between the organization of instruction in the US and Norway -- just as there are differences among most educational systems around the world. In the US, a college course is taught by a faculty member who is also responsible for grading. Normally the grades are based on a mixture of assignments and tests that are completed by students over the course of study. There might even be components of the grade that are based upon student participation in class or upon the effort of the entire class in a project of some sorts. In Norway, as I understand it, while there may be assignments and other course requirements, they are generally not a component of the grade which tends to be more completely based upon the assessment of a final exam or project. Generally, an outside faculty member grades these exams or projects. Final determination of how this course will be graded will be made by the Molde College administration, and will be discussed with you over the first week. It is anticipated that it will be along traditional Norwegian lines -- i.e. a final written exam.
Below, I have outlined a series of assignments that I would ask US students taking this course to complete. I have indicated how many hours I would expect US students to devote to these assignments. I would encourage students to undertake these assignments. Should students wish to, they may submit the work to me for criticism during the course of study.
Only do one of these. (10 hours)
Only to one design. (20 hours)
Keep in mind that these are suggestions for activities that you might engage in during the course of study. A final decision about what you will do will be made during the week of August 23.
This course will cover the following topics:
The first four topics will be addressed by the instructor during the August class meetings in Molde. The next two topics will be covered by the students and the instructor during the months of September and October. Most of this interaction will take place electronically over the Web using email and the CASCADE system. The last two topics will be addressed by the students and the instructor during the month of November during class meetings in Molde. Some what more detail is provided in the tentative outline provided below. Keep in mind that some portion of this may be modified by the instructor based on the first week of the course in August.