next up previous contents
Next: References Up: Software to Aid Collaboration: Previous: Data Streams in

Introduction to SGML

Computer publishing technologies have influenced how documents are produced, stored and distributed. Documents are now increasingly available in electronic formats but the plethora of formats has itself become a problem. The Standard Generalized Markup Language (SGML) or ISO 8878 was developed to provide a single interchange format and a sophisticated structuring format that would provide great rigor and capability for document processing. It is a technology that allows additional information to be added in electronic documents so that the value of documents are maximized. This includes the ability to manage, access, automate, and detect structural errors. Examples of additional information include specifications for arranging, formatting the document for different types of outputs (electronic v.s. paper forms), cross-reference links to other documents, and the structure of the document that facilitates information finding, and document merging.

Word processing package and computer typesetting systems commonly use a procedural markup (Helvetica font, 18 points, etc.) which specify how text will be processed or will appear on an output. In contrast, SGML uses a structural markup (example, figure) which enables the description of structured information independent of how the information is processed. Every SGML-based document requires descriptions or a set of rules about structured information in a document. This description is called document type definition (DTD) and the SGML language provides a standard syntax for expressing DTDs. Any information that are marked up or added into an SGML document must follow the descriptions in a DTD. In other words, DTDs defines the specifications for valid components/parts within documents and rules about how subparts are organized. DTDs are explicit forms of preferences about the document that an author has during authoring. Documents of the same type can share the same DTD and each document written is considered as a document instance.

A DTD is the core element of SGML documents. The DTDs consist of the definitions about SGML markups (tags) allowed in a document. The definitions also include a formal description of the document and the relationship of the elements, such as chapters, footnotes, or indices) within the document. A marked-up element, an "element", has a start tag <Q> , content, and an end tag </Q>. For example, a title might be marked with the title tag as <TITLE> Document Processing </TITLE>. Markups can be for graphics, images, and other entities as well as text.

While the construction of a DTD and the markups of a document involves many considerations and can be extraordinarily complex, it is not unfair to think of an SGML document as consisting of three things. We believe this is a simple way to think about the SGML:

  1. An SGML declaration that specifies parameters that can be used to determine whether an SGML parser can interprete a document. For most users, SGML declarations are invisible.

  2. A DTD contains definitions for allowable markups. In practice, it may be a referent to a location where a DTD source is located rather than prepended entire definitions. A DTD specifies a general model for document elements.

  3. A document instance consists of structured marked-up text, elements. It is easier to imagine that the contents of a document instance may change. A DTD is written such that an infinite number of document instances can be generated. In other words, one DTD could control a thousand documents that look different.

  
Figure: SGML, DTDs, and SGML Documents

The world wide web makes the HTML (HyperText Markup Language) widely recognized. All the documents on the world wide web are document instances of one DTD (HTML). It is useful to keep in mind that the HTML is really nothing moren than a single, some might say simple and weakly written, DTD. There are two implications of this. Firstly, because the HTML DTD is written to allow very flexible markup, it does not yield all of the advantages SGML might provide via a highly structured DTD. For example, the current HTML DTD does not have an "attribute" for last revision date or an "author" of an individual element. Secondly, there are many other DTDs that might used such as those developed by the Text Encoding Initiative (TEI), the Association of American Publisher (AAP), the Air Transportation Association (ATA), DOD (CALS), O'Reilly & Associates (DocBook), etc.

SGML software, generally called an SGML or a markup parser, is required to interpret and process an SGML document. Among other things, a parser checks whether the document's architecture is consistent with the description in a DTD and reports any inconsistencies found. Users can use a regular text editor to create a marked-up document and use a separate markup parser to validate the created document. Alternatively, an interactive SGML editor provides more convenience by validating the document while an author is structuring a document. Markup parser tasks include:

SGML document processing covers the life cycle of document processing, including document creation, document management and document utilization. In document creation, a created document must have valid SGML markups and the result of the document creation includes both a DTD and a document instance. When an author decides to add several pieces of document into a single document, the parsing process will enforce that a new document still conforms DTD's definitions. However, once a DTD is created, it can be reused and distributed to other people. Document management systems may have an SGML parser that parses a document into fragments according to a DTD before storing them in a database. Conversely, in the document retrieval process a document, the system must extract document fragments and put them back into the original form. Lastly, documents can be used in various ways. For example, an SGML document may be processed to be displayed on an output device. Document exchange, revision and retrieval can benefit from preserving documents in SGML forms.

While HTML may not be a good DTD, it has introduced the world to structured documents; it has introduced hundreds of thousand of people to structural markup. At the same time, it has ignored the capability, power, and requirements of more generalized SGMLs. At the same time, it is less likely that the cost of a full SGML parser implementation will be embraced by Web browser developers. Therefore, it is unlikely that full SGML capabilities will be added to servers and clients. This implies that the web browsers will never recognize other DTDs. From this reason, an intermediate solution has been proposed. The Extensible Markup Language (XML) has been being developed by W3C. It maintains all the critical components of the SGML but reduces processing complexity of document by excluding the unused and unnecessary functionalities in the SGML. The features of the SGML that make it very difficult to implement and that are generally unused are excluded from the XML. Specifically, the XML will provide the ability to define new tags and attributes into a document and will allow Web clients to recognize the richness and complexity of SGML documents.



next up previous contents
Next: References Up: Software to Aid Collaboration: Previous: Data Streams in



Michael Spring
Fri Jan 31 13:59:00 EST 1997