Document Processing:
WWW and Internet Technology

DIST 2770
Fall, 2002 (03-1)
Monday 6:00-8:50, Room 404
Michael B. Spring
Department of Information Science and Telecommunications
University of Pittsburgh
727 SIS Building
Personal Email:
Class Email:
Office Hours:  Monday-Friday 8:00-6:00
Phone: 412-624-9429

Note 1:  The University will be closed on Monday September 2nd.  Thus, there will only be one class before the end of the add-drop period on September 6th, 2001.  Students should read the syllabus carefully and be sure they can attend the first lecture on August 26th, 2001 to insure that they are prepared to take the course.


Note 2:  This course requires students to use multiple programming languages and to work on the Unix operating system.  Most programming examples will be given in C and Java.  Students who are not proficient in at least one of these languages should consult the instructor before taking the course.  Students will learn and program in both Perl and Javascript.  While no prior knowledge of these languages is required, students who find it difficult to pick up a programming language should be prepared to do a fair amount of additional lab work.  Finally, much of the work will have to be done on the Department’s Unix system.  Familiarity with Unix and the Unix programming and development environments is highly desirable.  If you have not worked in the Unix environment, you should spend some time prior to the term to become familiar with Unix, Unix editors, and Unix programming and debugging environments.


The focus of this course has changed dramatically over the last decade.  Early versions of this course focused on algorithms and models for text processing consistent with the need to develop stemming algorithms, stop lists, compression algorithms. The course used SNOBOL and Unix to look at text processing algorithms. In 1992, a revised course was introduced, looking more at document processes. Text processing was relegated to about a quarter of the course with a focus on implementation of key algorithms in Postscript and C. The course introduced document design in the context of SGML, development of tools for structured and hypertext document manipulation, and collaborative authoring. With the emergence of the World Wide Web, Java, Javascript and Perl became more important and have been introduced.  Recently, SGML has given way to XML which will  eventually replace the anemic HTML standard.  The XML family of standards is the legitimate subject of an entire course and is now being emphasized in the course to the extent that time permits.

These changes reflect developments in document processing that have been spread over a period of four decades. However, the widespread impact of these changes has been more recent—precipitated by the growth of the “World Wide Web”. Today, document processing is one of the most rapidly evolving areas in information and computer science. A brief review of the history of these changes is in order.

Stage 1: Text Processing

Most people are well aware of the early use of computers to do tedious calculations— notably those required for targeting artillery shells, doing census work, and making financial calculations. The emergence of computers in the back offices of corporations and banks is pretty well understood. Less well understood is the use of computers for document processing.  Computer controlled typesetting equipment required millions of repetitive calculations to determine line lengths and hyphenation decisions.  (The length of every line had to be calculated based on the width of each character on the line.  In addition, interword and intercharacter spacing had to be optimized.)  Thus, the early relationship between document processing and computing had to do with repetitive calculations related to typesetting. Computers were also useful in assisting in hyphenation decisions and in spell checking. Simultaneously, a series of researchers were examining how bibliographic records might be stored on computers and searched and accessed more quickly. This gave rise to the development of a series of approaches to information storage and retrieval. The tasks of storage and retrieval, when examined in detail, involve preparing and processing the text using a number of sophisticated algorithms. Finally, early researchers in this field were faced with the need to optimize the use of very expensive resources. Until 1980, bytes of memory were counted in thousands and disk storage was counted in hundreds of thousands. This meant that far less than one book could be held in memory at one time and a single book could absorb all the storage resources of a given computer. For this and other reasons, compression algorithms were of interest to these researchers. In short, computer scientists interested in documents were intensely interested in the algorithms used to process text.

Stage 2: Document Processing

With the emergence of the PC in the 1980s, text processing became widely accessible to the office worker. While the PC would have a dramatic impact on document processing, three other things were happening out of the main stream that were equally if not more important. First, in 1980, Xerox unveiled the information system of the future—an Ethernet based set of workstations and servers that had a graphical user interface with a mouse and a bitmapped display, a connection to a laser printer, and a model of text processing that was object based. Second, a PhD Student at Carnegie Mellon University developed a text processing system called Scribe. In one sense, Scribe was simply the next generation typesetting system built upon the insights gained from a series of systems (pub, runoff, nroff, script, etc.). More importantly, it was very different in that users marked up text not in terms of typographic characteristics—e.g. “18 point bold Helvetica”, but in terms of structural characteristics—e.g. “title”, “footnote”, “quote”. Scribe marks a shift from text processing to document processing – more technically speaking, it marked the transition from procedural to structural copymarking. Third, during the 1980s, a number of practical hypertext systems were being developed including Xerox’s Notecards, Knowledge System’s KMS, MCC’s IBIS and gIBIS, Brown’s Intermedia. These efforts were echos of Vannevar Bush’s memex dream and Douglas Engelbart’s first hypertext system at the Augmentation Research Center.

Stage 3: Universal Hypertext

In the 1990s, three things came together that changed the focus yet again. First, the Ethernet first developed at Xerox Parc was now used to connect most computers in offices and academia and these isolated networks began to be connected to each other across the Internet – an evolution of the research network developed by the defense department – the ARPAnet.  Second, the graphical display developed at Xerox Parc had been through several generations—XDE, the Mac, X windows, and finally Microsoft Windows. Graphical user interfaces were now the standard for interaction and made it possible for users to learn new systems in a matter of hours. Third, Tim Berners-Lee, a physicist working at the CERN in Switzerland, envisioned a simple network based protocol and a simple universal naming scheme that could provide a kind of standardized hypertext. In addition, using the standardized form of markup called SGML, which was conceptually rooted in the tradition of Brian Reid’s Scribe, he defined a standard way of describing documents such that the universally identified documents could be viewed in an appealing graphical format. Thus, the new opportunities and challenges that we face as computer and information scientists have to do with how we manage large collections of interconnected standardized documents available across a wide area network.

As this course is offered, another stage is emerging in which the nature of the nodes in this universal hypertext is evolving from static documents to dynamic active document forms.  But that story is for the future.

Conduct of the Course

Philosophy of Instruction

DIST 2770 is a graduate course in which students share the responsibility for creating a learning experience. The instructor’s role is less to direct and spoon feed and more to stimulate and guide learning. The instructor and students share the responsibility to make the course work. This means two things:

·       PREPARATION: Some students hate to be lectured to from a book, others love it. Some students hate interactive classes, other students love them. This course will be interactive, and it will involve lectures that move well beyond what is written in the books. If you are squeamish about being asked questions in class, please let me know.  Otherwise, it is my style to challenge you in class to think about the issues and to question you about your grasp of the material.
As I see it, it is the student’s responsibility to read and learn the material in the textbooks. It is my job to clarify what the textbooks fail to make clear and to go beyond what is said in the textbooks to new or more difficult ideas. The lectures will begin with the assumption that students have read and understood what is in the books. If you have not read the assignments prior to class, the lectures will be very difficult to follow. It is important that you come to class prepared to discuss and move beyond what was in the books and readings.

·       ENGAGEMENT: While much can be learned by rote memorization, things learned by memorization tend not to be the skills that one generalizes and applies in later life. A different kind of learning takes place when students engage in the process actively. This course is based upon students being actively engaged—in class, in the assignments, in the reading etc. DIST 2770 is structured to provide a variety of hands on learning experiences that students will have to struggle at. The goal is to learn by producing products that work. In all cases the products will be both toys and real. That is to say, they will provide real functionality, but at a level that is attainable within the course of a term.


The goals of the course are as follows:

·       to review and appreciate the evolution of electronic printing and publishing and to understand the basic technologies used;

·       to understand the mechanics of WWW protocol, servers, and clients,

·       to develop software to manipulate symbolic and/or image information of various forms i.e. half-tone images, line drawings, plain text, typographic text, and special text components such as tables, equations, cross references, indices, etc.;

·       to understand the nature, functionality, and limitations of current forms of electronic records, e.g. HTML, XML, SGML, etc., as well as the mechanisms used to manipulate structured electronic records;

·       to learn and use the various languages—JAVA, C, Perl—currently being used to manipulate information on the World Wide Web.;

·       to assess trends in the technology and the probable nature of future formats for electronic records, and within this context understand the implications for selection, use, management and preservation of electronic documents and records;

·       to be able to analyze and design comprehensive systems for the creation, dissemination, storage, retrieval, and use of electronic records and documents.

Within these broad goals, students are encouraged to define specific objectives for their own learning during the course.

Introduction to the Course

At the heart of any system are basic text processing algorithms and the course begins with both a history of document processing and a quick review the functions, packages and libraries that are available.

In addition, because the web is a distributed application, the course will look at client server document related protocols—e.g. http. The design of servers will be reviewed to understand the focus of server side programming.  The design of clients will be examined to introduce the basic paradigm for spiders and agents.  Finally, as time permits we will examine the emerging XML standards and the impact they will have.


The two main books for the course will be those shown below. Of course, students will also be expected to use the resources available electronically on the Web.

Platinum Edition Using XHTML, XML, Java 2 - by Eric Ladd, Jim O’Donnell, Mike Morgan and Andrew Wyatt
List: $59.99; 2nd Bk&cdr Edition; Hardcover, 1400 pages
Que, November, 2000, ISBN: 0789724731

Cgi Programming: Perl for the World Wide Web by Jacqueline D. Hamilton
List: $24.95; Paperback, 210 pages
1999, ISBN: 0-9669426-0-4

The second book is a rather straightforward and direct look at Server side coding of CGI programs. We will review it in its entirety. The first book is a comprehensive reference work and will be consulted regularly throughout the course, but will not be read cover to cover. Both books are essential to the course, and while they are somewhat expensive, I think you will find them well worth the cost and will find them as useful references for several years to come.

Course Mechanics

There are several things that you need to keep in mind as you work on this course. At some point you will forget one or another of these things. Try to remember that this is the place to come check for the detail again.

Regarding CASCADE:

Students are encouraged to use a system, CASCADE, that the instructor developed as a part of a research project for the National Institute of Standards and Technology.  CASCADE stands for "Computer Augmented Support for Collaborative Authoring and Document Editing." CASCADE allows users to access a document space and it provides a series of tools that make it easy to browse and interact in the space.

CASCADE is applicable in different ways to the courses I teach. For Client Server, CASCADE is a good example of a three-tier client-server application with more than 100 protocols and a sophisticated set of business rules operating on the DBMS persistent store. For document processing, it provides some sense of what a collaborative authoring system might look like.  For Interactive Systems, the client provides examples of agents, visualization, and accommodation in interface design.  Using CASCADE with its frailties and strengths will give you some sense of the problems inherent in designing interactive client server systems for document processing. The system can serve as a model for systems you might develop related to the final projects in my course.

·       To run CASCADE:

·       (On Solaris Machines) Simply type the word cascade on the solaris systems in the lab.  You will be told the first time that some local information is being set up.  You will also need to set up the server information the first time you run the system.(see below)

·       (On lab PCs) Simply select Cascade from the start menu. You may need to set up the server information each time you run the system as personal profile information is not saved (see below)

·       (On your own PC) Obtain a copy of the CASCADE client form the CASCADE web site -- It is a self extracting zip file that should install fairly easily on an Win95/98/2000/NT platform.  The setup is fairly automatic. You will also need to set up the server information the first time you run the system.(see below)

·       To set up server information:

·       There is a drop down combo box that allows you to set up a name, host, and port name. The name can be any string you want, host for the class accounts is "" and the port is 7000. In class, you will be provided with a username and password that will allow you to login to the CASCADE server.  If you want to try to access the server before getting an account name, the account “guest” with the password “guest” provides minimal read only access.

·       There are a number of ways you can learn more about CASCADE should you wish to.

·       there is an extensive online help system.

·       there is a web site which contains a lot of information about CASCADE

·       there are a series of ten videos that can be run on a PC.

Regarding Homework Submissions

Assignments emailed to the instructor should be sent to unless you have been specifically instructed to send them to my personal mail account.

There is nothing more frustrating to a student than to have homework not be graded.  There is nothing more frustrating to an instructor than to have homework submitted incorrectly or with insufficient information.  Before you mail an assignment to me, please make sure that it meets the specific requirements for how it is to be submitted.  While there may be additional specific requirements set up in class, the following guidelines should be of help:

·       Papers:

·       Any paper that is submitted should be submitted in duplicate.  It should be carefully proofread and formatted professionally.  The paper should identify you, your email address, your social security number, the course, the term, the CRN, and the assignment for which the paper is submitted.

·       Projects:

·       Any project that is submitted should be thoroughly tested to insure that I will be able to run it on my machine.  The project source code and executable files should both be included.  The material, if it is extensive, should be zipped up in a zip or jar file.  Care should be taken to make sure that all necessary supporting DBMS and lib or jar files are included.  A readme file should be included that explains any particular constraints or steps that need to be taken.

·       ALL CODE THAT COME FROM ANY SOURCE OTHER THAN YOUR HEAD NEEDS TO BE FULLY AND CAREFULLY MARKED.  This includes code which you have adapted from some source but which is essentially someone else’s work.  Failure to note such use is cause for a grade of 0 on the assignment and an F in the course.  All of your code should be carefully and professionally commented and explained.

·       In both the mailnote to which the project is attached and in the main file of the project, you should include:

·       The names of all participants

·       Email addresses and social security numbers

·       The course, the term, the CRN

·       The assignment for which the paper is submitted.


Regarding Course Files:

CASCADE, will provide access to lectures, PowerPoint slides, reference documents and sample code. If you are not using CASCADE, copies of sample code and other resource materials are available at  This space will contain sample code, and selected reference documents that you may find of use.

Course Requirements

There will be six projects to be completed during the course. The first three of these projects are individual assignments. The fourth and fifth are also an individual assignment, but students who wish to develop sophisticated CGI programs or spiders may petition to undertake this project as a group assignment with up to three people.  Each additional person will increase the expectations for the final project by 60%.  Thus a three person project should be 220% of a 1 person project.  (More specifically, 4 CGI programs should become 10.) The final project will be easiest for a small group (2-4 people), but it may be undertaken as an individual project.

The Assignments are as follows:

·       Assignment 1: (5 points)  Design an extension to, or replacement for, your personal web site at SIS using advanced features of HTML.  The subject covered on the pages should be your goals, objectives, and interests related to document processing.  There must be at least 4 pages.  The pages must comply with good practice for web page design – i.e. there should be identification of responsible person, date of last modification, contact method, identifiable navigation mechanisms, etc.  The pages will need to use:

§        Tables

§        Imagemaps

§        Frames

§        Graphics

§        CSS Stylesheets

·       Assignment 2: (5 points) Write and debug a program in C on Unix that makes use of workshop or dbx to locate problems in the program.  This exercise is meant to familiarize the student with the Unix programming environment.  It is also meant to help the student understand low level programming strategies related to images and strings.  The program will require the student to read and print out information about the contents of a gif image on disk.  The student will also be required to read and parse a text file looking for certain patterns.  This will include use of the following:

·       the open, read, and write commands related to file I/O

·       the creation of a structure, an array of structures, and the malloc commands

·       the string handling functions

·       Assignment 3: (10 points) Develop a JavaScript that provides client side error checking of input or client side manipulation of a down loaded data set. Some functions must be written from scratch.  Other functions, appropriately attributed, may be downloaded from the web and used as a part of the system or as the basis for your own functions.  Students using without attributing a piece of code will receive 0 on this assignment.

·        Assignment 4: (10 points) Write a program in perl to provide information about a web site.  The assignment should be used to learn perl.  Students may provide a set of simple descriptive statistics – number of contacts, number of sessions, number of hosts, number of pages, etc. or the student may attempt to tease some higher level information out of the log.  For example, an analysis of the paths users take through the site highlighting “dead spaces”, hallways, or landmarks would qualify.

·       Assignment 5: (20 points) Build a set of no fewer than four CGI programs (Perl, C, or Java) that either:

§        dynamically shows the structure of a web site

§        processes on-line bug reports

·        Assignment 6: (20 points) Build a rudimentary spider that does site mapping tool (with visualization).  The program needs to have two main functions.  It will need to gather the data from a site and it will need to present meta information about what was found to the user.  In terms of data gathering, the spider will need to normalize URL’s for comparison, and deal with both HREF and SRC URL’s.  Between data gathering a data presentation one or more files will need to be used to save the results of the search.  Finally, using the stored files, the program will need an interactive component to present the results in some visual way.

Final Project

(30 points) The final project may be undertaken individually, or in a group.  Expectations about project sophistication will be a factor of the number of people working together.  Each additional person should make the project 60% more sophisticated.  A 6 person group should produce a project 400% more sophisticated than an individual project.  Students may elect any one of the following four topics – given in the order in which they are encouraged to select them:

·        Develop a client server pair in Java for collaborative editing of XML documents that uses XML based messaging for information interchange.  The system must provide for branch and element level locking and shared editing of an individual text element. The instructor will provide a prototype that accomplishes about 70% of the functionality in a very rough form.

·        Design a system that will produce and manage a tutorial Web site.  The goal of the effort will be to write an interactive program that will allow a naďve user to produce a tutorial set of pages in a minimal amount of time without knowing very much about web site design.  The resulting set of  pages should include appropriate variation in look and feel while providing a look and feel consistent with the other sites produced by the system.  The program should produce no less than 10 pages and scripts.  They should include navigation pages, tutorial materials, indices, exercises, tests, bulletin boards, etc.   Ideally, the program would be a simple set of questions that would be answered by the user. After answering these questions, the program would go ahead and write out all the files, create all the directories, etc., that would be needed to begin the web site.

·       Design a site management tool that extends the spider developed in assignment 4.  The tool should find lost links, new or updated files, heavily linked files—both target and source.  The tool should provide some basic utilities for fixing selected problems.

·       Conduct a research study on one of the following topics:

o       common characteristics of various types of web sites

o       e-business developments and strategies

o       the development of Web query languages

o       literature review of Web development in a given area such as customer service

o       the use of certificates, directory services, and security mechanisms


Keep in mind that each year, the School of Information Science sponsors the Information Engineering Competition. This awards competition is focused on recognition of excellence in the design and development of tools for information management. There are virtually unlimited opportunities for projects that will not only get you an A in this course, but a $500 award and recognition at Honors convocation and graduation.

Assignment Due Dates

Assignment 1, week 3

Assignment 2, week 4

Assignment 3 week 6

Assignment 4 week 8

Assignment 5 week 10

Assignment 6 week 12

Final Project last week of class


Grades for the course would then be as follows:

A = 90-100 points

B = 80-89 points

C = 65-79 points

F = 0-64 points

Course Outline

Scope of the Course

A definition of document processing may be developed from the component terms. Webster provides two definitions for document “a writing conveying information” and a material substance having on it a representation of the thoughts of men by the means of some conventional thought or symbol. Process is defined as to subject to some special process or treatment (as in the course of manufacture). While the definition implied is not quite comprehensive, it does provide a beginning point. The broad areas to be addressed in the course are:

·       Technologies

·       Standards

·       Algorithms and formalisms for text and image manipulation

·       Languages for electronic document processing

·       Document processing systems

·       Hypertext, hypermedia, and database documents

Within this broad scope, we will focus our attention on the topics listed below. These will include and overview of the Web, a look at page design, a look at servers, a look at structured documents, a look at CGI scripts and JavaScripts, etc. We will conclude with a look at spiders and infobots and at next generation collaboration tools.

It is important to note that the outline below is very tentative and is subject to change as the term moves on.

Lecture Outline

Week 1: Overview of the Course.

This lecture will introduce the basic objects and issues addressed in the course. The objects addressed in the course include structured documents, hypertext, client-server computing, and protocols. The issues addressed in the course include how documents are created, stored, searched, retrieved, manipulated, and used in this new environment. It will explore old roles that will be reduced in importance and new roles that will be created.

·       Using HTML, XML, and Java

o       Chapter 3: HTML 4.0

o       Chapter 4: Imagemaps

o       Chapter 5: Advanced Graphics

o       Chapter 6: Tables

o       Chapter 7: Frames


Week 2: Conceptual Overview of the Internet and the Web.

This lecture will address the relationship between the Internet and the World Wide Web. It will provide an introduction to the basic concepts of client-server computing and will describe how to create a server. Some of the important Internet protocols will be described in overview. The http protocol, version 1, will be explored in detail. The basic framework for the development of a Web server will be outlined.

Week 3: JavaScript and Scripting Languages

The concept of active pages, aka Java and active X, will be introduced. Lightweight applications for the web will be explored. The dynamics and economics of client versus server processing will be explored.

·       Using HTML, XML and Java

o       Chapter 18: Introduction to Java Scripting

o       Chapter 19: The Web Browser Object Model

o       Chapter 20: Manipulating Windows and Frames

o       Chapter 21: Using JavaScript to Create Smart Frames

o       Appendix A: JavaScript Language Reference

Week 4: Advanced Page Design: Forms, Style Sheets, and Applets

This lecture will begin with the definition of a set of principles for the design of a good web page. It will continue with the exploration of the capabilities that should be planned for as the web site is expanded to include new and anticipated web capabilities.

·       Using HTML, Java and JavaScript

o       Chapter 8: Forms

o       Chapter 9: Style Sheets

o       Chapter 36: Intro to Java

o       Chapter 37: Applets

o       Chapter 38: Using Input and Interactivity


Week 5: Structured Documents.

This lecture will review the various standards that support structured electronic documents on the Web. The lecture will begin with the history of SGML and an overview of the capabilities intended for SGML. HTML will next be reviewed as an example of an SGML document. Finally, XML will be introduced as the next generation standard. The lecture will also cover the basic functionality of hypertext and will introduce the URL structure, benefits, and liabilities. This lecture will provide a minimal introduction to Markup theory.

·        Using HTML, XML and Java

o       Chapter 11: Introduction to XML

o       Chapter 12: Anatomy of an XML Document

·       XML standard

Week 6: Writing CGI Scripts.

This lecture will introduce the basic paradigm for writing common gateway interface (CGI) scripts. The lecture will include an introduction Shell scripting, perl, and active pages as mechanisms for preparing pages computationally.

·       CGI Programming: Perl for the WWW

o       Chapters 1-8:

Week 7: CGI functions.

This lecture will introduce several potential applications using scripts to define dynamic pages. Surveys, logs, DBMS access, and other techniques for scripts will be demonstrated. In addition, security concerns and access restrictions in this kind of environment will be introduced.

·       CGI Programming: Perl for the WWW

o       Chapters 11-16

Week 8: XML, XPATH, and XSLT

This lecture will introduce the basic functionality of XML. It will also explore the relative merits of the various standards that are used for component parts of documents, as well as the various helper applications that are used by clients. The supplementary formats (and corresponding help applications) will include: Postscript, Portable Display Format (PDF), Graphics Interchange Formats (GIF), Joint Photographic Experts Group (JPG), Motion Picture Experts Group (MPEG), Windows Waveform Audio File (wav), Audio Video Interleaved (AVI).

·       Using HTML, XML and Java

o       Chapter 11: Intro to XML

o       Chapter 12: Anatomy of an XML Document

o       Chapter 13: Creating XML documents

o       Chapter 14: Creating XML DTDs

o       Chapter 15: Notations and Entities

o       Chapter 16: Document Validation

Week 10: Spiders and Knowbots.

This lecture will cover the design and implementation of spiders and other web crawling technologies.  It will explore issues related to the normalization of documents and of links.

Week 9: Web Servers.

This lecture will provide an overview of the capabilities and limitations of World Wide Web servers. The development of DBMS connections, authentication, and security will be explored. The benefits and limitations of stateless and idempotent servers will be explored.

·       Using HTML, XML and Java

o       Chapter 40: Network Programming

o       Chapter 41: Security


Week 11: Spiders Revisited

Week 12: Next Generation Web Capabilities.

This lecture will explore the impact of the next generation of standards for the Web, including WEBDAV will be explored. This lecture will explore the techniques that should be used to develop an organizational web site. It will explore the capabilities and characteristics of three basic kinds of web sites—informational/marketing sites, intranet sites, and electronic commerce sites. Various capabilities and services such as security, searching, visualization will be explored.

·       Using HTML, XML and Java

o       Chapter 20: Cookies and State Maintenance

o       Chapter 23: Dynamic HTML

o       Chapter 26: Webcasting(also 27-30)

Week 13: More on XML

This lecture will explore the intricacies of XML looking at schema, RDF, and the XML formatting language

Week 14: Collaboration Environments on the Web.

This lecture will explore the capability of the Web and the Internet to support various collaboration efforts—such as collaborative authoring or computer supported collaborative work.