Welcome to my weblog. It is an unconventional blog in that I am not planning to post daily or weekly, but only as topics of interest emerge. I enjoyed playing a little with my initials and the word blog and am amused by the fact that it is as much something I am slogging through as something I am blogging about. This listing only shows the five most recent posts.
I will try to discipline myself to keep a more or less regular set of reflections coming, but I can't promise. I have disabled commenting and discussion as it ended up being more maintainence and cleanup than I cared to deal with. That doesn't mean your comments and thoughts aren't welcome. Should you wish to comment on what I have said, I will be happy to add your comments verbatim so long as they are not spam. Simply send an email to me at Pitt -- see my home page. I will insert it in the appropriate post with attribution if you wish. Please reference the title and date of the post on which you are commenting. Also, if you want to suggest a topic that might be covered or discussed, let me know and I will try to include it.
Here is access my mBsLOG as an
rss feed.
The Next Generation Web and Social Capital (December 17, 2008)
Over the last month or so, a number of incidents have occurred that cause me to reflect on social capital and “Web 2.0”. I was surprised looking back on the BLOG that I had not addressed this issue directly in any of the posts. The need to address this topic began with an oversight board for the school. Last year, they advised the Dean that we should be doing more on social networking and Web 2.0. When I heard that was going to become a priority, I was a little perturbed. As far as I was concerned, we (and particularly I) had been working in this area for almost a decade. The fact that the board was not apprised of this work bothered me. At first, I blamed the Dean for not being aware of the work. In the last analysis, I blame myself for not talking more about it. This year, I made a presentation to the board in which I reviewed some of our work. I was pleased to hear that they were favorably impressed with our attention to the matter.
A part of the presentation had to do with work on collaborative authoring funded by NIST in the late nineties and doctoral dissertations that resulted from that work. We built a system that was designed to speed international standards development. In many ways we succeeded, but like so many other basic initiatives of that period, our work was swamped by the tsunami known as the World Wide Web. Two of the dissertations that came out of that work included Bordin Sapsomboon’s “Shared Defect Detection : The Effects of Annotations in Asynchronous Software Inspection” and Vichita Vathanophas’s “The Use Of Peripheral Social Awareness Tools In Collaborative Systems.” Both dissertations were published in 2000. Bordin’s was very traditional and demonstrated that defect detection could be improved using social software inspection. Vichita’s was more radical. She used the extensive logs maintained by the system to provide an indication of how people felt about the project they were working on and how willing they would be to contribute. For many, her dissertation smacked of big brother. I believe that what she was doing in the late nineties was no different than what is happing today. It was simply that the data collection and use was more explicit. Both of these dissertations demonstrated well controlled studies of the impact of social networking systems.
Shortly after I made the presentation to the Board, I was asked to speak to various groups of students about the topic. In that process, I began to use the terms aggregate annotations and social capital as an important concepts behind social networking and social tagging systems. I have addressed the issue of aggregate annotations in another post on this blog. (See my Seminar on Annotation Aggregation for more information.) Someone asked me about the term “social capital” and I did a web search so as to give them a reference. I was surprised to find one of my website pages on the first page of the search results! I found it referred to a doctoral seminar I gave in January of 1997. The seminar was inspired by a talk Robert Putnam had given at the first annual conference on leveraging cyberspace in October of 1996, which was co-sponsored by XEROX PARC and NIST. I had been invited to talk about Multi-level Navigation of Document Spaces. At this inaugural, and final, conference I was mesmerized by Robert Putnam, Marc Weiser, John Seeley Brown, and Paul Saffo. Truth be told, I thought every presenter at that intimate conference was spectacular. (See http://nvl.nist.gov/pub/nistpubs/jres/102/3/j23mol.pdf.) Returning from the conference, with Robert Putnam’s research and challenge clear in my mind, I wrote up the charge for the doctoral seminar. It began with the following:
This seminar explores two questions. The first question is "what is social capital?" Assuming we can come to a consensus answer to this question along the lines that have already been suggested by Putnam and others, the second and more interesting question to be addressed in this seminar is "how might systems be designed to prevent the erosion of, or encourage the development of, social capital?" (see my Seminar on Social Capital.)Would that I had followed up intelligently on my own hunch, I might not be writing about this, but sitting on top of linked-in or one of the other social networking sites!
Fast forwarding to today, we might ask a similar questions. “What is Web 2.0 and where are we going?” Personally, given all the confusion about Web2.0, Web3.0 and all of the technologies and applications, my personal preference is to ask what the Next Generation Web (NGW) might look like. In June of 2008, Cormode and Krishnamurthy of ATT Labs published a wonderful article on the evolution of the Web. In my opinion it is the single most intelligent article on the topic. (See Graham Cormode and Balachander Krishnamurthy, Key differences between Web 1.0 and Web 2.0. First Monday, Volume 13 Number 6 - 2 June 2008.) The article is worth reading in its entirety several times. For purposes of this discussion, I combine several of their elegant observations as follows:
Moving forward, how do we understand what is going on and more importantly predict where we might productively move? It may be that the call for Web Science by Tim Berners-Lee and others is the answer. Being somewhat more of a traditionalist, I like the arguments put forward by Ed Chi of PARC. (See Ed H. Chi,The Social Web: Research and Opportunities, IEEE Computer, Volume 41 Number 9,September 2008, pp88-91) Chi begins with a suggestion that the social web currently consists of three kinds of activities – information foraging, sharing and tagging, and collaborative creation. It makes sense to me to think about research aimed at “developing new theories and algorithms to model, mine, and understand socially constructed knowledge structures and social information networks.” This may indeed be exactly the same goal as others would set for “Web Science.” For me, the name of the discipline is not as important as the research questions. We have enough flexibility within our current disciplines to reach out collaboratively to address the basic questions. What is most important is that we forge intelligent questions based on a grounded conceptual framework.
Two final notes. I need to write a post for this blog on Knowledge Management and Collective Intelligence. Over the years, I have talked with disdain about these topics. Over the last couple years, I have changed my position. It is becoming clear to me that there are occasions when it is important to make tacit knowledge explicit. Indeed, this has become for me the mantra of knowledge management. The example that I use most frequently relates to the vast store of knowledge that existed in the brains of nuclear engineers who worked for Westinghouse. With the resurgence of interest in nuclear power, it has become apparent to some that the vast store of knowledge that existed in the heads of those engineers has diminished as they have retired and passed away. If some kind of social system to capture this information had been in place at Westinghouse over that last 50 years, it might be possible today to go back and harvest the nuggets of knowledge and resurrect a nuclear program at Westinghouse more easily than will now be possible. IBM and others have recognized this and begun to develop aggressive program that may serve to allow for better knowledge management.
Regarding collective intelligence, I have had a similar epiphany. It is based mainly on the work of one of my recent PhD students, Worasit Choochaiwattana, who developed a retrieval system based on the social bookmarking site delicious. The research was able to show that resources retrieved through his system were rated as slightly better than those retrieved by Google. The key here is that the set of resources used in his system was much smaller than the set used in Google – by three orders of magnitude. The implication of this finding for the size of the server farm needed as the base for the search engine is staggering. What makes this possible? There are two things. First and foremost are the rather brilliant algorithms Worasit developed. Second, and equally significant, along a very different dimension, is the filtering of the resources on which the search was conducted. As anyone who has searched recently understands, the number of “noise” resources that are returned as highly ranked is on the increase. Personally, I find little comfort in the fact that many other people have encountered the same problems I encounter. It used to be when I searched, I found people who were answering the question. Today, I find many people who are asking the same question. As it ends up, people don’t bookmark question pages much. They tend to bookmark pages with answers. It is this collective intelligence of bookmarking that delicious harvests. You may suggest that this is more common sense than intelligence, and I won’t argue. At the same time, I am coming to believe that we will find important ways to make use of this phenomenon, whatever we choose to call it. Personally, I am not opposed to calling it collective intelligence.
In conclusion, there are rich histories of the study of important concepts such as social capital which might inform our invention of a second generation of the web. It is important to rise above the rapid evolution of the web and the technologies employed and ask simple fundamental questions about what is going on. One of the most central of these concepts is that of social capital. Others include collective intelligence and annotation aggregation. At heart, the next generation web is about people as first class entities!
Will this course be a lot of work? (November 18, 2008)
I like to write blog entries that expose some information or contribute positively to how I see the issues in the field of Information Science. This entry tends to be something of a gripe and therefore I feel a strong need to preface, or maybe I should say justify, my remarks. I tend to be rather demanding as a teacher. At the same time, I work hard to help students. As one student put it, “he kicked us into gear so fast I picked up the material from his tough-love attitude, enthusiasm, and extremely clear and aggressive teaching style.” Obviously, I selected a quote that I think puts me in a favorable light. By way of more objective assessment, over more than a decade, course reviews have been consistently favorable – all within the top 25% of courses rated. Equally important to me, 90% of the students indicate that they learned much more in this course than in other course they have taken. Ok, hopefully you will accept that I am not simply whining about students, but trying to express something positive about the way I would like students to approach their responsibility for learning.
A student came into my office the other day and asked if I was busy. I wanted to say “no I have just been waiting for someone to walk-in unannounced to disrupt my concentration on the rather difficult problem I was trying to solve,” but I held my tongue and replied “what can I do for you.” The student wanted to take one of my basic courses, but hadn’t programmed in a long time, and wondered if there would be much programming. I said there would be. He asked if the course would be a lot of work. I said yes. He asked if it would be too much work for him. I said I didn’t know. I asked some questions about how well he had programmed, how much he remembered, what kind of work load he was used to, etc. This went on for about 15 minutes – with the same question being asked over and over again. Basically, he wanted me to assure him that the amount of work in the course would be reasonable by his standards. I finally had to tell him that I could not assure him the course would be easy. I told him I thought he could learn a lot if he took the course. Finally I told him that the syllabus and the notes for every lecture were online for his review. He left rather disappointed. I believe he wanted to take the course because a lot of people talk about how much they learn in this particular course – it has a good reputation, but they also complain that it is a lot of work. Like many other visitors to my office, I suspect that this young man had heard that a number of students have reported that the ability to talk about things learned in my classes helps in job interviews. He wanted the advantage of having things to talk about in job interviews –– but wasn’t sure he wanted to expend the energy to acquire the knowledge. This phenomenon has led me to make a standard disclaimer in the first lecture of most of my courses. “You may have heard that some students report that talking about things they learned in my course helped them in a job interview. Please note, and I know some students who have made this claim, and I can assure you that what they are saying is “what THEY LEARNED in my course helped them in the interview.” They are not saying “the COURSE HELPED them” or “what I TAUGHT HELPED them.” I offer you an opportunity to engage in deep learning of the subject matter, so that you really understand the subject and can talk about it. If you do not internalize the concepts and principles, you will leave just as ignorant as you arrived.”
I spend between seven and fifteen hours in preparation for every three hour lecture I give, and generally speaking, in the subjects I teach, that prep time does not decrease in a second or third offering of a course – because the landscape is changing. I have, on more than one occasion, spent in excess of two days solving a problem that once solved allowed me too explain it simply and completely to students in 15 minutes. I am not bragging about having simplified the concept or complaining about the time required to distill and simplify new concepts for the classroom. That is what I am paid to do. What bothers me is that many students don’t have a corresponding perspective, i.e. the amount that I learn will be correlated with the time spent in learning. I tell students at the beginning of each course that there will be 15 three hour lectures – 45 hours. I don’t give midterms or finals, so they get a full 45 hour exposition. My expectation is that a graduate student spends three hours outside of class for every hour in class. To make the multiplication easier, I round 45 to 50 and multiply by three. They should plan on 150 hours of work outside the classroom over the term. I allocate 50 of those to reading the textbooks and other material I provide. I am a slow reader. Others are faster. I remind them that if they are very slow readers either because they don’t read much or because they are not native speakers of English, reading time may exceed 50 hours. That leaves 100 additional hours that they will need to commit to the homework, exercises, and projects. The assignments and projects that make up the course each contribute between 0 and X points to their grade. The sum of the X’s is 100. Moreover, a project that contributes 10 points to their final grade has been constructed such that it should require about 10 hours of effort. If you are really prepared, you might be able to do it in an hour or two. If you are not well prepared, it might take 15-20. If you need less time, it is because you are well prepared. If you need more time, it is because you lack prerequisite skills. These time discrepancies are not my concern – i.e. I will not make the course harder because some students come prepared, nor will I make it easier because some students are ill prepared. (I must admit that this is not completely true. Ten years ago, I devoted an entire lecture teaching html. Today, I assume students come to graduate school with knowledge of html.)
My favorite assignment in this category is the first assignment in a course on client server systems. The assignment involves correcting a thirty line-long piece of code. It is important to note that the code “worked” flawlessly. That is, they were given client software, which included the code they were reviewing. The code compiled without error, linked without error, and executed against server code I provided and ran without error. I tell the students when I give them the code that the corrections can be made with less than 100 keystrokes. (Actually, in an efficient editor, it is less than 20 mouse and keystroke actions.) Less than 10 percent of the students achieve a perfect score, most get a little more than half of the points, and some get less than 4 out of the ten points. Some students claimed to have spent well in excess of twenty hours on the assignment. My in-class correction of the code, along with a DETAILED explanation, takes less than 15 minutes. This assignment is a very extreme example of a prepared mind versus an unprepared mind struggling with a problem. In general, I can do one of my assignments in about half the time I calculate for students. I have no doubt that for some it takes twice as much as the average time calculated. (As a footnote, the client server assignment had logical errors that normal use never triggered. The point of the assignment was to emphasize the importance of flawless programming in an environment where hackers exploit logical errors in code that needs to run 24X7 in a hostile environment. Getting the code to compile, link, and execute is not the issue. Writing correct and secure code is the issue. Footnote to the footnote, I am always flabbergasted to learn that the current generation of IDE based programmers seldom comprehend the difference between compiler, linker and runtime errors. Further, they seldom have a good mental model of the types of libraries and the different compiler and linker options – they don’t have much hope of understanding how paragraph alignment of variable storage can impact buffer overflow errors! )
To return to the question at hand – i.e. “Will this course be a lot of work?” Wrong question I think. A much better question would be “what will this course give me an opportunity to learn?” The answer to the first question is what most students want. I would like students interested in taking my courses to think about the cost-benefit ratio of the course. The amount of work in any course is a function of your preparation to take the course. Courses cover material from a beginning point defined by the course description and the explicit, assumed, and implicit prerequisites. Explicit prerequisites are those listed in the course description. If the course requires “Data Structures”, data structures will be used, but will not be taught. The assumed prerequisites are those that may have been stated elsewhere. For example, to be admitted, graduate students must have a structured programming language. Assumed prerequisites are things like the ability to understand English; to read a textbook, reference manual, journal article, and program; to calculate arithmetic expressions and statistical measures; to follow an algorithm or logical inference, etc. The amount of work you will do will reflect that required by a solid graduate course plus all of the work you will need to do to make up for the deficiencies in your prior education. I am sorry if that sounds harsh, but we need to draw the line somewhere. Otherwise we find ourselves in a position where advanced graduate courses need to teach basic sentence construction, addition and division, and common sense logic. That leaves little time to discuss the implications of procedural, object-oriented, and declarative programming languages or the vagaries of regular expressions and the XML Path language. I look forward to seeing you in my courses and hope they will provide a productive framework within which you can learn some things that will serve you well in your professional career – beginning with your first job interview.
The Science of Information (November 5, 2008)
This post is a revision of a post on the science of information that I wrote on October 23, 2008. The earlier post, which makes up the last part of this post starting with the paragraph that begins "In 1969, Herb Simon…" was done in a rush and ended up providing a grossly inadequate response to the topic. I have added some material to overcome the inadequacy of the discussion, at least as I see it. This post now replaces the earlier post which has been removed.
In 1979, as a part of my dissertation work, I struggled with the question of what constituted a profession. As a part of that work, I asked the question of how professions differed from disciplines, which include the sciences. That led me to a 1966 book by King and Brownell (King, A.R. and Brownell, J.A. The Curriculum and the Disciplines of Knowledge: A Theory of Curriculum Practice. New York: John Wiley and Sons, 1966.) In their treatise, they lay out and discuss the characteristics of a discipline. These include obvious characteristics -- e.g. it is a community of persons, and expression of human imagination, a tradition; derivative characteristics -- e.g. there is a specialized language, it is an "instructive community", it has a literature, etc.; and what I consider the fundamental characteristics -- it has a domain of inquiry, a mode of inquiry, and a conceptual structure. While I would bow to my colleagues in History and Philosophy of Science who I suspect have much better models, King and Brownell provide a simple framework that makes sense to me. The three characteristics I call fundamental is where I focus this discussion. (Although I must admit that it is fun to ask how the Science of Information defines a "valuative and affective stance.") Let me address the domain, conceptual structure, and method of inquiry for a science of information.
It is pretty clear that the domain of physics is the physical universe, the domain of biology is living organisms, the domain of literature is writings, etc. At gross levels, these domains are difficult to constrain, but as we talk about astrophysics, or vertebrate biology, the domains seem to become a little more sharply defined. Sometimes, they get fuzzier -- e.g. molecular biology, or social psychology, but let us avoid that confusion and simply ask what the domain of information science might be? I am not very happy when the discussion turns to everything being information and leads information science to have a domain which includes all the other disciplines. The disciplines get more clear as they get more focused, or as we shall see below, the conceptual structure gets more clear and universal. I am also not happy when we suggest that we don't need to define or circumscribe the definition of information to have a science of it. I would suggest that lacking a reasonable definition of living organisms would make it very difficult to define what biology is all about. A related issue here has to do with outliers. While you and I would have no trouble agreeing that a vertebrate or a plant is a living organism, there are surely some fungi and other fringe entities that lack one or more of the attributes we use to define living organisms. We can argue about these, but Biology began by demarcating the 99% of the domain we agree is living organisms. (Actually, I don’t think anyone knew about the special cases until much later in time.) When it comes to information, it seems we only want to argue about the fringe of the domain and ignore the 99% that is at the core. So what is the domain of information science. I would say that it is the messages exchanged between humans that change the state of what the receiving human "knows" -- another definitional problem, but we will get to it in a future post. The interested reader should see my earlier posting -- September 25, 2008 on a definition of information.
Ok, we are coming to grips with what we want to study. What is the conceptual structure we overlay on the phenomenon? In physics, we have had a number of conceptual models of the physical world, at both microscopic and macroscopic levels. Newtonian mechanics worked for a long time. Quantum mechanics takes another view, not necessarily contradictory, but in some arenas of matter, more explanatory. I will be careful not to anger my colleagues in Physics by exposing any more of my ignorance of the subtleties of the conceptual domain. Suffice it to say for my purposes here, that in the Newtonian conceptual structure we find concepts like force = mass * acceleration. This concept is not a part of the domain of inquiry, it is a part of the conceptual structure that is overlaid to explain something about the domain. So, what is the conceptual structure of the science of information? Some might suggest that it is Claude Shannon's conceptualization of information as the log of the sum of the inverse of the probabilities of the components of the message. I would suggest this is a good start -- we have seen our Newton but are still awaiting Einstein. Moreover, while force is one small part of physics that was conceptualized, Shannon's measure of the amount of information in a communications channel is yet a smaller component of the conceptual framework that needs to be defined for a science of information. Would that I could share with you a comprehensive conceptual structure for information, or better yet a grand unified theory. Sometimes, I sense that I see something, but all too often what seemed so clear in a state of deep thought vaporizes as I work it. What I am convinced is that information is a phenomenon worthy of our study and the domain can be demarcated. Further, if we "discipline" ourselves, we can begin to develop a conceptual structure. As in all the disciplines, that conceptual structure will evolve and face radical points of evolution over time. Today we are at a very primitive beginning with a few giants such as Claude Shannon and Alan Turing who have provided some first efforts at a conceptual structure.
And now we turn to the method of inquiry. There are two answers to this question. The first is a simple evolutionary answer. I think, again I would bow to my colleagues in History and Philosophy of Science, that most disciplines have evolved from an early period in which the primary mode of inquiry was simple observation and classification to a more evolved mode that was more formal and which enabled assessment of the validity of the conceptual model via replicable evaluation. For most sciences, this more formal method has become some variation of the scientific method. I suspect that the maturity of the science of information warrants a longer period of observation and classification to build a base of concepts that we may later be able to relate. I like Zipf's law, and Metcalfe's law, and the many others that are little more than observational science, but in academia, we are always driven to the more formal methods, and we "know" that the best are those of the old natural sciences. Herb Simon has suggested that might not be the most appropriate methodology, and I whole-heartedly agree.
In 1969, Herb Simon wrote “The Sciences of the Artificial” in which he discussed the differences between natural and artificial sciences. As I read the book, I am holding the second, 1981, edition, he was encouraging his colleagues to develop a new paradigm for conducting research in the “design sciences.” What is cogent in these remarks, I credit to Herb Simon. What is silly in my remarks, I take full responsibility for. Surely, this brief entry cannot do justice to the carefully reasoned arguments he posits in a little over 200 pages. Let me begin with two passages from the book:
My dictionary defines “artificial” as “Produced by art rather than nature; not genuine or natural; affected; not pertaining to the essence of matter.” It proposes, as synonyms: affected, factitious, manufactured, pretended, sham, simulated, spurious, trumped up, unnatural. As antonyms, it lists: actual, genuine, honest, natural, real, truthful, unaffected. Our language seems to reflect man’s deep distrust of his own products. I shall not try to assess the validity of that evaluation or explore the possible psychological roots. But you will have to understand me as using “artificial” in as neutral a sense as possible, as meaning man-made as opposed to natural. (2nd edition, page 6)
And:
… hence we can set the boundaries for sciences of the artificial:
- Artificial things are synthesized (though not always or usually with full forethought) by man.
- Artificial things may imitate appearances in natural things while lacking, in one or many respects, the reality of the latter.
- Artificial things can be characterized in terms of functions, goals, and adaptation.
- Artificial things are often discussed, particularly when they are being designed, in terms of imperatives as well as descriptives.(2nd edition, page 8)
In some sense, rereading his words for the fourth time in as many decades, I feel that there is little left to be said – he really has said it all. I firmly believe, as I have described in other posts, that information is an artifact of the human effort to communicate. If that is the case, information science is an artificial science, and not a natural science. Natural sciences endeavor to describe and explain the natural world around us and that natural world is a given. Artificial sciences endeavor to improve the design of the artifacts that we create. As Simon points out, talk about artificial anything and the sense is that it isn’t as good as the natural thing. Natural sweetener is obviously better than artificial sweetener. An artifact is a construct of human imagination. A science of artifacts or the artificial is a science of the things we build.
In academia, there is strong pressure to do good research. Many times this is equated to descriptive and explanatory research focused on the natural world around us. We can’t make a pulsar something it is not. We simple try to explain it. In their research, engineers would be like physicists and doctors would be like biologists. Maybe we need to rethink the paradigm of our science, more focused on the matter of our science – artifacts – than on the paradigms of those who study nature.
In a previous post, I talked about structured documents. These are not a product of nature, but things constructed by humans. We have the ability to define and redefine them so as to meet our needs to build systems of artifacts. For example, consider a stipulated definition that defines them as sequences of symbols. If we find that we can’t communicate what we wish to via the existing symbol set, we can change the symbol set. As another example, if we find a sequenced set of symbols does not provide adequate facility to manipulate and control the document, we can define document as a directed acyclic graph of elements over that symbol set. This might allow us to do partial locking and structural analysis. We could take the example further and introduce attributes and metadata to the model to give us additional capability. This kind of design science is very different from the descriptive natural science that says a document is what it is and it is our goal to describe it in its natural form.
In these examples, we are describing a science that changes the object of study so as to better achieve the design goals set for it. If our cars don’t provide adequate crash protection, we redesign them to provide better crash protection. Similarly, if buildings don’t survive earthquakes, we redesign them so that they will. If the structures and mechanisms by which we create and share information are inadequate, we need to build new structures that allow us to achieve our goals. I am reminded of the Serpent, who in Act 1 of George Bernard Shaw’s Back to Methuselah, says to Eve: “You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'."
“Structured Documents” – Concept and Form (October 10, 2008)
I was in a discussion with my PhD students this week and the subject of structured documents came up. I was flabbergasted by some of the thoughts that were expressed and by the lack of agreement about what was meant by a structured document, both conceptually and technically. In this posting, I would like to address the issue of structured documents. In my conclusion, I begin what will be another discussion about the appropriate level of document structuring.
More frequently than I would like, students mention memex, Bush, Engelbart, NLS, hypertext, and the World Wide Web when they start to talk about structured documents. While hypertext documents are interesting, and do have a structure associated with them, they have very little to do with “structured documents” as I understand and talk about them. Using a document style or theme in Word does not result in a structured document – it results in a document that is styled not structured. A form can be a structured document, but forms per se are often more like records than structured documents. The World Wide Web is not the origin of structured documents, but structured documents do become more important in the context of the World Wide Web. Technically, an html document is a structured document, but it is a little like saying three kids arguing in a school yard are a legislative body.
Let’s begin at a very simple level. We might make an argument that any string is a structured document. Here I use the term string technically – it as a sequence of characters that may include control characters such as tabs and newlines. In this case, each character has a position in the string. We can say some things about the number of words in the string, the number of lines, the number of sentences and paragraphs – which gets somewhat complicated, etc. There are many possibilities at this low level, and while a string is a structure, it does not define what we mean as a structured document. Ok, let’s now move back through the history of the written word and take a look at structured documents in two eras, the non-digital era and the digital era.
We might imagine a document such as the Iliad or one of the books of the bible. We have a story from the oral tradition that is written down as remembered. It is a stream of characters, or words. It tells a story. It has a beginning and an end. Surely it is logically structured. It may or may not have a title. It may or may not have the signature or the name of an author. It may or may not have parts. If the story is well told, one suspects that it is conceptually well structured. What makes a “good” story is in large part the quality of the flow in the story. Given that it is a transcription of what originated as an oral presentation, it may have little literary structure. As a document, these manuscripts, and many manuscripts produced before mass production printing, have little formal structure.
Fast forward to a modern text book. It has a title page. The title page has the title of the book, the author or authors, the publisher, and the city or cities in which the publisher has offices. The title may consist of a title and subtitle, but the title is singular. There is one publisher. There may be one or many authors. The title page is followed by a “cataloging page”, which my colleagues tell me is simply known as the “verso of the Title Page”, that contains among other things disclaimers, publisher address information, ISBN number, Library of Congress cataloging information, copyright, etc. These pages may be preceded by advertising pages, and are followed by dedication pages, forwards, table of contents, acknowledgements, and then the book proper. Normally, the book is made up of a series of chapters, but it may consist of “parts” which have chapters and the chapters may have sections and subsections. Within these structures, there can be paragraphs, figures, tables, examples, etc.
A modern textbook is a highly structured document. Over time, books have taken on a more structured form. A text book tends to be more highly structured than a trade book. A scholarly journal tends to be more highly structured than a magazine. A business letter tends to be more highly structured than a personal letter. An academic CV tends to be more highly structured than a resume. In the real world, documents have differing levels of structure that are appropriate to the purpose of the document. The source and force of that structure varies greatly. The source may be regulatory, contractual, or consensual. The form of proposals for government funding are a matter of regulation. The provision of information to be included in a textbook by publisher Y is contractual. The information contained in a course syllabus, at least at my institution is more a matter of general consensus. In all of these cases, no judgment is made about the appropriateness or sensibility of this structuring.
The history of the application of computer technology to document processing is long and complex. For the purpose of this discussion, I will divide it into four eras. The first era is the digital typesetting era. This era tends to be associated with procedural copy marking. In conventional publishing, layout editors knew how they would “structure” a textbook. This structure was reflected by the graphical layout of the “elements”. Layout editors learned that the title page was to be a recto page – generally the first page of the book. The index went at the back of the book and generally used two columns and a type size smaller than the body type in the book. Computer scientists worked to develop languages to instruct computerized typesetters to change fonts, margins, horizontal and vertical alignment, spacing, etc. Just as a layout editor would place layout copy marks in the manuscript for the typesetter, the user of early formatting software placed procedural commands in the text file. The high point in this era may well have been the development of Tex by Donald Knuth.
The second era is the heterogeneous device era. There was a period of time when line printers, laser printers, CRT screens, dot matrix printers, typesetters, and robot typewriters all coexisted. During this period, all the procedural languages script, runoff, tex, and a slew of PC based languages, wordstar, peachtext, etc. were evolving to macro languages such as GML, XICS, and LaTex. This era also saw the emergence of structural copymarking. In some ways, the difference between macros and structural copy marking is marginal. In other ways, it is very significant. Some people credit Charles Goldfarb with the “discovery” of structural copymarking and structured documents. I tend to credit Brian Reid who developed Scribe. Here’s the deal. A macro says that @title is associated with a set of procedural copymarks. It is easier to remember than all of the details, and it adds some standardization. A structural copymark says that @title is a copymark that can appear in a unit of the document called a @titlepage, and not elsewhere. Brian Reid developed Scribe to allow users to output their work to multiple devices. Thus, he created macros for each of the devices with common names. Then, he decided to go a step further. He developed what he called make files that contained information about the components in about a dozen types of documents – letter, slide, report, article, manual, etc. It was also possible, although very difficult to specify new types of documents. Here’s the important thing. Generally speaking a macro was designed to aggregate procedural copymarks and execute them all at once. So, sometext @macro othertext would result in output where othertext can after sometext but it was in a different style. In Scribe, you would say @chapter(SomeText) and that would cause SomeText to start on a new page, be some particular font, AND be saved to create a table of contents for the document. Similarly text@footnote[moretext] and text would cause “text and text” to be output with a superscripted number after the first text and “more text” to be output at the bottom of the page with a matching superscript. Reid was in part responding to the need to deal with heterogeneous devices and was trying to make his “descriptive markup” easier to use. While GML and SGML get a lot of the credit – appropriately in terms of the standardization and generalization of the concept. Reid’s scribe provided the earliest functional effort to develop structured documents that were much more than simple macro languages.
The third era is the WYSIWYG era. While the WYSIWYG era brought on by the development of the Alto and the STAR at the Xerox Palo Alto Research Center did a great many things to make our life better, the bitmapped screen and the laser printer caused some problems as well. As “all points addressable” devices, there was no need to deal with the different idiosyncrasies of different devices. Whatever you could put on the screen could be printed. In this era, styles technology was dominant. It was now possible to select a different font and type size for each word on the screen. Sweep out some selection of text, and make it the same as some other selection of text. It became possible to do infinitely complex procedural copy marking without ever knowing one of the commands. It was – is – a struggle to get people to use styles consistently. It is just to easy to do anything we damn well please. The power of descriptive of structural copy marking was lost on those who now could do anything they wanted.
The fourth era is the WWW era. As Berners-Lee moved forward the concept of a universal repository for information, he developed a mechanism for identifying resources – the URL, a mechanism for transporting requests and responses – http, and a mechanism for representing resources – html. Berners-Lee was familiar with SGML because technical papers at the CERN were formatted using an SGML Document Type Definition (DTD). He decided that he could write a DTD that could be used to represent documents in his system. It is not clear whether he actually intended to develop a universal document type or just the first and simplest of what would be many document types. It is clear that his document type was minimalist – a valid html document need only include a title element in the head, but most browsers were happy with less than that – i.e. nothing. Further, while it may not have been his intention, users used his elements as macros rather than descriptive or structural markup. Tables were used to establish formats and block quotes served to indent whole documents. The abuses were many. It quickly became clear that more was needed if we were to be able to unambiguously identify the author, publisher, publication date and structure of these resources. Thus, we began the process of revising SGML to occupy a smaller, more applicable footprint to solve some of the semantic issues related to resources on the web. Many, many coordinated pieces would be required to make this work.
XML and the family of XML standards – xslt, xpath, xslt-fo, schema, schema datatype, xlink, xforms, xquery, etc. provide the standards, specification, and technologies to create structured documents of varying levels. As most readers will know, this is done by specifying a schema. Schema are hard to understand, and even the XML Schema Compact Syntax (XSCS) (Wilde, Erik, and StillHard, Kilian. A Compact XML Schema Syntax. In Proceedings of XML Europe 2003. London. May 2003. (http://dret.net/netdret/docs/wilde-xmleurope2003.html)) can be daunting. Without endeavoring to specify a new syntax, let me suggest a simpler was to imagine a document type. Using simplified notations based on regular expressions, Backus Naur form, and the original SGML DTD syntax, let me specify the following rules:
Thus a simple definition might be given as follows:
Document type ::= Memo
Memo ::= headings[1,1], body[1,1], addenda[0,1]
headings ::= to[1,*] , from[1,1] , (date[1,1] & subject[1,1] )
body ::= paragraph[1,*] ,
addenda ::= cc[0,*] & enc[0,1]
to, from, subject, paragraph, cc, enc::= {string}
date::={date}
In this example we define a memo as having headings (required and only once), body (required and only once), and addenda (optional but not more than once) sections in that order. The headings component contains one or more to components followed by one from component followed by one date and one subject where they can be in any order. All of these are defined as strings without further sub components. Even in such a simple example, there are many design decisions and complexities such as:
Further, these design decisions – as with most things in our document world, cannot be made in the abstract, but must reflect the consensus of the users of these types of documents lest they be ignored. Consider for example a structured document definition of a syllabus. Make it too structured and faculty will ignore it. Make it such that it constrains nobody and likely it will have minimal useful structure. I can imagine a thousand useful things that might be done if syllabi for all college course used some standard format that contained a reasonable level of required detailed content. I can also imagine that the faculty in one small department of information science would not be able to come to an agreement as to what it should contain. Imagine getting all of the faculty in all of the departments of all of the colleges and universities in the United States to agree!
So, what is appropriate structuring? I begin with a simple classification of document types. Documents can be personal, group, organizational, enterprise (cross organizational), and archival. Documents may migrate from one category to another. At the personal document end of the continuum, the demand for enforced structure is minimal. I write a note to myself in any form I care to. My diary can be kept as I please. On the other hand, documents that need to be exchanged between organizations A, B, and Z need more structure. Consider as one example student transcripts. These documents need to contain certain information and we know what that is. I would contend that given the potential of structured documents, we should over structure so long as we can do it without increasing the burden of authorship. If we can structure our personal diaries such that they can serve as archival documents, we will have wasted our time with my diary, but the benefit of having the diary of Colin Powell in a structured form might prove invaluable
All of this has ignored the telling of a good story. In 1956, I began the process of learning how to tell a structured story in writing. This process went on for 12 years of daily instruction through high school. It involved learning to diagram sentences, outline topics, write good paragraphs, etc. It continued more seriously, but in formal ways less regularly through four years of college. Training then continued on the job – my first boss was a master writer of more than two dozen books. It also involved an intense period of mentored training as I wrote my dissertation. I believe that after 23 years, I had actually learned enough to consider myself a reasonably good writer of a structured story – my dissertation. Another ten years, including multiple articles, proposals, books etc. brought me to a point where I consider myself roughly competent. Before we can learn to write good structured documents using XML and the tools that will emerge, we will need some significant training, beginning in grade school, about how to make the best use of this capabilities. For those of us now in our latter years, it is unlikely, even if we are good writers conceptually, that we will feel and embrace the potential of structured documents – just let me do it my own damn way.
So What is Information (September 25, 2008)
I have been a faculty member in the School of Information Sciences -- formerly the Graduate School of Library and Information Science -- for more than two decades now. When I arrived, the name of the department was the Interdisciplinary Department of Information Science. In the early 1990's it was changed to the Department of Information Science and Telecommunications. Recently, as part of a school wide restructuring, it became the Graduate Information Science and Technology Program. The constant in all the changes and evolution would seem to be Information Science -- or the Science of Information. I will address the “science” of information in a future piece focusing on King and Brownell's work on communities of discourse and Simon's thoughts on sciences of the artificial. For now, I would like to focus on information.
Let's begin simply. My birth date is May 27th. That is probably information for you. It would not be information for my mother were she still alive. You may not value the information, but I suspect you would agree that it was something you did not know. Somewhere, the "fact" that May 27th is my birthday is recorded, but that is not information in an of itself -- we refer to it as a record. It might or might not be information to you. Thus we have some first conversational concepts that we can use to define information. Let's begin with the fact that the measure of information is relative to the recipient. What is information to one human is not information to another. Is it possible to "inform" a building, or a computer, or an automobile? I am not quite sure how to answer that. On the one hand, I am prepared to state that what we refer to as information is tightly bound to the human experience. Some of my colleagues would argue that ant scent trails constitute information, and that computers produce information displays -- sometimes regardless of whether they are used by humans. I am prepared to engage these arguments and be convinced that information is a concept that has a scope beyond the human experience, but for now, I will ask you to accept a temporary stipulation that sans humans there is no information. The reason for asking for this stipulated limit is a desire to be able to easily build a more complete definition. If we expand the scope too quickly to all these arguable extensions, some of the points I wish to make will be much more difficult. So for this argument, information is a function of the human experience. If a tree falls in the wood and there is no human, it may make a sound, but we do not have any information about the fact that it made a sound, and we do not know that it fell, nor why it fell.
We may now take another step. We are stipulating that the receiver is restricted to entities we call humans. Further I propose that the measure of information is dependent upon the receiver. We need to somehow qualify what it is that makes something information to one human but not another. We know what it is informally. Let's see if we can make it a little more formal. If I already have some information, receipt of the same fact, is not information. Messages contain information when the message is about something not already known to the receiver. At the risk of moving too fast, what I know, the knowledge I have, acts as the mediator of whether a message contains information.
This leaves us with several concepts that are of use in furthering our inquiry. The first is the notion of a store of information which we will call knowledge. The second is the notion of messages which are delivered to us. The third is the notion of the contents of the message measured along a dimension we call information. A message may contain no information, a little information, or a lot of information. We may have a lot of knowledge, a little knowledge, or no knowledge. What we know may be partitioned into domains. I know a lot about Pittsburgh, a little about Bangkok and New York, and next to nothing about Nairobi. We could go on for a while here, but let me simply add one more caveat at this level. Information may be true or false. I may be misinformed. I may receive a lot of bad information. Pretty neat. Ok, there is a lot more at this level, like the value of information, but we will leave those discussions for now and turn to the messages, and then to disembodiment of information.
We suggested above that "messages may carry information". What is a message? Again, I am going to suggest that a message is something received by a human. Unlike information, it would be hard to argue that only humans process messages. Here I would have to agree that computers process messages and that the scent trails laid down by ants constitute messages to other ants. So I fully agree that a message is a very general concept and surely not restricted to the human experience. At the same time, I will stipulate that for now, I am only speaking about messages that are received by humans. I would suggest that there are two broad classes of messages that are processed by humans. The first is messages that consist of raw data processed from the environment. The sound of screeching brakes, the sound of water running, the smell of smoke, a sunset, the warmth of the summer sun. All of these experiences, whether direct or indirect (e.g. a picture of a sunset, or a recording of a gun shot), may contain information. We don't always define these experiences as "messages" and that is fair, but I would argue that these sensations or experiences should be defined as low grade messages. There is much to be discussed here about how these patterns of signals move from signal to data (pattern) to information -- it is raining outside. It has a lot to do with the knowledge we bring to bare on the signals. Those signals that make no sense -- that form no pattern -- are commonly referred to as noise. One of the goals of natural science is to turn noise into information. So at this level of messaging, we begin to introduce the concept of noise as meaningless (information less) patterns. We can come back to this concept and mine it further, but for this discussion, I want to turn from low grade messages, to high grade messages.
If we agree that we can grudgingly refer to sensory experiences as messages that can carry information, what is it that we really want to think of as a message? I think it is pretty easy to agree that a message involves a sender and a receiver and that it is pretty easy to imagine one human sending a message to another human. Again, I acknowledge that “message” is a very general concept, but here I am talking about exchanges between humans. An email message constitutes a first class message. It might be considered a "prototype" for messages, in the same way that psychologists suggest that a robin is the "prototype" for bird. This does not mean that a penguin or ostrich is not a bird, they are just not as prototypical as a robin. Let's expand our discussion about human to human messages. If my sister kicks me under the table during a conversation over Thanksgiving dinner, she sends me a message -- probably about the foot I am about to stick in my mouth. So, a hug, a kiss, a punch, can all be messages exchanged between humans. And for millions of years, that was how humans exchanged messages. With the development of spoken language about forty thousand years ago, our ability to exchange messages greatly increased. Four to five thousand years ago with the advent of written language, our ability to exchange messages was increased again. Spoken language allowed for same-time-same-place messaging. The technology of written language allowed for messaging across time and space. We did not have to be at Gettysburg on the afternoon of Thursday November 16, 1863 to get the message from the president of the United States -- "that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion." So, information may be contained in messages which vary from the "low grade" messages that contain raw sensory data that must be interpreted to the "high grade" messages that consist of sets of symbols that are constructed explicitly to facilitate the communication of information between humans. Indeed, I would suggest that we could profitably restrict the study of information to these high grade messages consisting of symbols. Once we understand information in this form, we can extend our exploration to all the other forms -- including the scent trails of ants. In this discussion, I turn to one final topic in my trail -- and that is not the scent trail, but coding and the beginning of formalizing a definition for information.
For several thousand years, we have been using language to exchange information in encoded form. To a large extent this is what I was referring to when I talked about the disembodiment of information. This has been a great boon to the advancement of civilization. We have not only encoded the individual pieces -- I was tempted to say bits, but that would be premature -- but we have organized these pieces and begun the process of assessing the validity of the information. This process is deserving of study in its own right. We can say that there was a lot of information in a given book. We can identify new information. We can label information public or private. We can store, transmit and access ever greater amounts of information in various forms of messages. Can we measure information? I don't think we yet have adequate ways to do this, but there have been some interesting developments.
One of the most interesting came from Claude Shannon and Warren Weaver. It has been overplayed in some circles, but it is interesting for what it is. At the core, they hypothesized that one measure of information could be the probability of a given piece of information. If the probability of a given message is unity -- there is no information. If the probability of a message is 50% there is some information in the message. If the probability of a message is 25% there is more information in the message, etc. Working for Bell Labs, Shannon was interested in how much bandwidth was needed to communicate a message, or how much space was needed to store a message. With the advent of digital computer using binary units to store a message, it became useful to think about how many binary digits would be required to store a message. Shannon suggested (a simplistic explanation) that one measure of information was "I=log 1/probability of the message". We can thus say that if information is to be stored as an array of binary digits and the probability of the message is .5, we would need log2 1/.5 bits to store the message. The base 2 log of 1/.5 (i.e. 2) is 1. When the probability of a message is 50/50, I can record the message as a 0 or 1 in one binary digit. If the probability of a given message is .00390625 I would need log2 1/.00390625 or 8 bits. In this case, my magic number of .0039 is 1/256. What Shannon is saying is that if I wish to be able to have a message that can be any one of 256 different symbols, I would need 8 bits to represent it. A rich analysis of information is possible based on a measure of information as the probability of a message. It opens the doors to computation, transmission, encryption, compression, correction of messages stored in digital form. It provides a simple yet rich mathematical theory that allows us to do all sorts of things, and it is completely consistent with the discussion of information being put forward here. While Shannon and Weaver's definition is direct, elegant and powerful, I believe it lacks some of the richer notions I have tried to put forward here. It is not that their definition is wrong. Rather we require complementary definitions that allow us examine other aspects of the phenomenon.
I have attempted here to argue for a definition of information that allows us to meaningfully partition the space and study the phenomenon. The core of my argument is based on information as a part of the human experience. Without humans, we can't talk about information. With humans, we can talk about information as a measure of the degree to which a message transforms the state of awareness, the knowledge structure of the receiver. We can partition messages into at least two groups -- those received via direct observation of natural phenomena and those received via some form of symbolic communication from another human. I would argue that while a comprehensive study of information is desirable, it may be more productive to begin is with the analysis of symbolic messages between humans. Based on models developed in a simplified context, the theory and concepts of information that might be later extended more broadly.
So this is how I would begin the definition of information.
Aesthetics (September 10, 2008)
I was lecturing last night on e-business web site design. In the first hour I talked about making a business case for a web project where I talked about business goals, process reengineering, return on investment, etc. I told the students that their “project proposal” was not about building cute websites, but about building cost effective systems that advanced business goals and provided a strong return on investment. Later in the class I turned my attention to various techniques and architectures for styling HTML and XML. It covered the structure and scope of Cascading Style Sheets (CSS) 1 and 2 and a comparison with XSL/FO (The XML StyleSheet Language/Formatting Objects). In the process of the lecture, there were a number of times that I digressed into an presentation on font metrics and design issues that were more about aesthetics than ebusiness productivity. I think the lapses in focus may have been due to several discussions I have had with colleagues about courses on interface design have migrated to courses on web design. I won’'t digress much here from the topic of aesthetics, but suffice it to say that when I built the course on interface design years before the web, the focus was on a variety of principles and techniques for building quality interfaces –-- i-phone type interfaces that wrap around humans. For the most part, the web is the worst place to teach these principles and most of what people do on the web makes quality interface designers shudder. But back to font metrics, web design, and aesthetics.
Back in the early eighties I was heavily involved in the design of formatting software for the early laser printers heavily influenced by the challenges of achieving high quality graphic effects using low resolution laser printers. We thought a lot about type design, kern pairs, hyphenation algorithms, automatic page layout, white space allocation, etc. Last night, the lecture on style sheet design caused me to digress to aesthetics as I touched on several of these points. Let me give just a couple examples. When certain proportionally space characters are juxtaposed, the result is aesthetically poor. As I recollect, the three most frequently occurring combinations are Yo We Ta. Compared to say ll or TT, the apparent space between Y and o in Yo is too great and o needs to be “kerned” or moved back toward the Y to make the spacing look right. Thus, You looks better with kerning. I don’t think I have every heard a concern in web design about the effect of kern pairs.
The next digress came in talking about font size – as measured in points. This lead to a discussion of ems, picas, and x height. The measurement of font size derives from the size of the block of metal type on which the character was placed. In the image, the box surrounding the Y represents the type slug. You can see that if that box is 12 points high – a point is about 1/72.27th of an inch. Rounding a 12pt font is 1/6th of an inch and a 36 point font is ½ of an inch. Beyond that trivia, good type design assigns different x-heights (the height of those components of a font that fill the area of the small letter x. For example the “bowl” of the letter b in the image is determined by the x-height of the font. Open Word, type “goodbye my darling”, highlight it, select format font, set the size at 12 pts, and scroll through 30 or forty type faces, watch the relative size of the o’'s and ask yourself how it impacts readability. There will be a lot going on, but generally, relatively larger x-height tend to make fonts more readable.
Similarly, the mixture of font types (serif and san serif) for headings and body text can have an impact on both readability and page indexing – the ability to quickly scan for topics and text. How many web designers mix the font styles, heights, metrics colors, not to mention use padding and spacing to impact the accessibility and readability of a page. Of course many of the pro’s do, but I suspect that has more to do with graphic designer oversight that with web page designer knowledge.
We have gained much by moving to structural copymarking and by automating much of the display technology, but I fear we have lost, or failed to pass on to the new web designers much of what we have learned about the aesthetics and readability of text. I am not complaining about the progress we have made in computing and document processing. I love this brave new world. I am suggesting that it has made all of us, myself included, a little more lazy and complacent. In the August 2008 edition of Communications of the ACM, Donald Knuth talked about the publication of Volume 2 of The Art of Computer Programming:
One of the greatest disappointments in my whole life was the day I received in the mail the new edition of The Art of Computer Programming Volume 2, which was typeset with my fonts and which was supposed to be the crowning moment of my life, having succeeded with the TeX project. I think it was 1981, and I had the best typesetting equipment, and I had written a program for the 8-bit microprocessor inside. It had 5,000 dots-per-inch, and all the proofs coming out looked good on this machine. I went over to Addison-Wesley, who had typeset it. There was the book, and it was in the familiar beige covers. I opened the book up and I'm thinking, "Oh, this is going to be a nice moment." I had Volume 2, first edition. I had Volume 2, second edition. They were supposed to look the same. Everything I had known up to that point was that they would look the same. All the measurements seemed to agree. But a lot of distortion goes on, and our optic nerves aren't linear. All kinds of things were happening. I burned with disappointment. I really felt a hot flash, I was so upset. It had to look right, and it didn't, at that time. (CACM, August 2008, 51(8) page 33)
Professor Knuth reminds us that it is not only the content we produce, but the presentation of that content that is important. If Donald Knuth believes it is important enough to devote a decade of his productive energies to better presenting his brilliant work on algorithms, I would suggest the it behooves those of us less prolific in the production of significant new content to spend some time to understand and implement techniques for aesthetic presentation of those meager ideas we wish to share.
Online Education (August 21, 2008)
Online education is a topic surfacing more and more frequently in graduate professional schools at universities like the University of Pittsburgh. I find myself increasingly ambivalent about the topic and about the push to "make it so." My ambivalence comes from some history. First, while I have been programming since 1969, and have been on the technical faculty in Information Science for more than two decades, my academic preparation was in education, specifically in the area of structured curriculum design. Second, for about 15 years I served as an administrator and director of the distance education program at the University of Pittsburgh. The unit was responsible for delivering more than 150 courses per year to more than 2000 students across forty departments and three schools. Third, some of my early research was on assessing the relative quality of face to face and distance education. I served as an evaluator for the Commission on Higher Education of the Middle States Association with particular attention to non-traditional institutions. Fourth, I have experimented with a number of systems and techniques for delivering the content of my courses using various forms of technology that are not time or space bound -- a number of online lectures are mounted on my website in various forms of completion. This is all to say that at heart I am conversant with the various formats and technologies for distance and online education. Further, I would like to believe that I understand both the theory and practice of making it work. Yet I am resistant to some of the administrative mandates to "make it so".
The source of my resistance comes at two levels. The first relates to focus and commitment. The second relates the demands and rewards of technology. I discuss both of these points below in more detail.
While we build new dorms and classrooms at a cost of hundreds of millions of dollars without any let up, our investment in cyberspace is minimal at best. Yes the money spent on cyberinfrastructure is increasing, but seldom do we talk in terms of a five year or ten year plan in the same way we talk about physical infrastructure. Granted, it is hard to do plan far in advance given the rate of technological change, but it is possible to think about the future in terms of alternatives. I foolishly suggested to our Chancellor almost a quarter century ago that we should make an investment in technology equivalent to the investment we were making in buildings. If we begin to offer all of our education in selected areas -- graduate professional programs to select a target, we need a dramatically different physical infrastructure complimented by a significantly larger technical infrastructure. We also should consider that not all online education is equal. Our model should be Amazon, or Google. What I mean to say here is that Amazon is not just another bookstore, it is THE bookstore. Google is not just another search engine a.k.a. library, it is THE information source. At the risk of offending my professional colleagues, most of us are not good enough to be an educational Google or Amazon. There are faculty who are good enough, and they should be the focal point of the prototypical online courses. Again, I am reminded of some history. When I was director of External Studies, the composite rank of the faculty was the highest teaching undergraduates anywhere on campus. It was because we targeted full professors as those best able to express their lectures in writing. It should be no different with online education.
I would suggest that a strategy for mounting a successful online education effort should be more than offer courses online. There are at least four first targets for online education.
I have said more than I intended in this post, but not quite as much as I feel needs to be said. You may have an inkling from what I said about world class courses that a really good online education course is not simply some video and notes online with a periodic discussion. There is a lot of technology that can be brought to bear, and while some of it is new, some of it is actually quite old.
Last night, teaching e-business, I reminded the students that e-business is not simply about the use of technology. It is more about improving the bottom line via technology. This means one of several things, but the two most frequent goals are increased sales and improved productivity. If you spend $1,000,000 to offer new online education programs and simply shift your population from the classroom to their home, you have lost -- increased cost without increased revenue. Similarly, if you install course management software that decreases faculty productivity, you are not engaged in good e-business. So, it should be the case that effective online education is better, easier, faster, more efficient for both faculty and students. It should open new markets, or DRAMATICALLY improve customer satisfaction -- leading to increased donations from alumni, etc. Seldom do I see these assessment criteria applied. Our course management system must be great because it is costing us X million dollars a year. As best I can tell, few people are asking if it is making the faculty and students happier, more productive, or more efficient.
With no effort to be exhaustive, and because I am getting sleepy -- as you may be -- here are just a couple of the dozens of ways we could make online education better than -- not just as good as -- traditional education.
Online education is in our future, but we have not yet taken the time to plan an articulate set of goals, or made the investment to build the kind of infrastructure that makes this next generation of quality educational experiences a reality. It is not sufficient to say "make it so" unless the money, incentive, infrastructure, and most importantly vision are in place.
What we still don't do in collaborative authoring (August 13, 2008)
In the post prior to this one, I tried to answer the question "What happened to the research on 'collaborative authoring.'" In that post, I made a casual claim that today's web systems are lacking in several ways. This post speaks briefly to what we haven't yet gotten a grip on as we explore wiki and blog spaces. It is a possible roadmap for development of these new web based systems.
The research on collaborative authoring (August 10,2008)
I was recently asked what happened to the research on collaborative authoring that seemed to have died out around 2000. The question came to me because of the work I and my PhD students -- notably Bordin Sapsomboon, Wasu Chapanon and Marut Buranarach, had done on a system called CASCADE -- Computer Augmented Support fro Collaborative Authoring and Document Editing. I was reminded of a literature search this past summer on web-based group decision support systems undertaken by a beginning PhD student who was visiting for the summer. Much of the significant literature seemed to have been developed in the early to mid nineties. Since that time, there has been little new work. It is all to easy to simply blame the web. It does indeed seem that much of what students want to do today is simply build little toy systems that provide one small aspect of a solution to problems studied more seriously in the pre-web years. As with most situations, the question would appear to have a more complex answer.
What happened to research on collaborative authoring? Over the years personal reflections have included several reasons why more is not being done.
Educating for the Future (July 30, 2008)
I received a note recently from a 1991 graduate of our program. Michal was a great student, and wrote a great article on anticipatory standards while he was here. He has worked in corporate positions, in a startup, as a consultant, and most recently as a government contractor. His note made reference to the things he recalled from his education that seemed bogus now, and things that seemed to look forward. In part he wrote: "I can recall creating a DOS-based hypertext program for your document processing class and on demonstrating it seeing some of your students understand for the first time what hyperlinking really meant. Now’s hypertext surrounds us!"
I also recollect those early years with DOS machines and all the fun we had with wrting programs to: control the color registers for the monitors, directly manipulate the ports, edit the FAT tables and so many other things. Given the work at Xerox PARC, and other places, on hypertext, it was only reasonable for us to look at the technology. In the mid eighties, Xerox gave me several 8100's with both the office and development software to build our own systems. Notecards was a sophisticated hypertext system and a thing of beauty to work with. The knowledge provided by Xerox about what was possible coupled with the accessibility of the DOS machine made it easy to build a simpler but nonetheless functional series of hypertext systems.
Michal's note makes me wonder whether I will get another note two decades from now reflecting on something we are doing today. I would like to think the work we are doing on the social periphery in collaboration systems, or ontology development, or aggregate annotations will have some impact. At the same time, I am at a point in my career where I grow fearful that I am losing touch with the direction and shape of the technology trajectory. There is so much happening and I find it hard to see the themes and the directions. Sometimes, as I think is the case for most old curmudgeons, it appears that we are breaking no new ground, but simply revisiting, out of ignorance, what we learned many years ago.
In responding to Michal's observations, I tried my best to think about what we should be teaching today to educate our students for the coming years. In part my response said "I have taken a good portion of this summer to work on some ideas about where we are going. Two things keep banging me on the head, and I have been trying to think about what they mean.
This summer, I spent an inordinate amount of time writing programs that analyzed RSS feeds I like to read. As a result, I now have a prototype feed reader that analyzes what I read, statistically clusters it, breaks out important words and topics, makes them into weighted anchors that "attract" incoming news articles and visualizes my information space to let me know if I want to think about something. Statistical inferences, aggregate annotations, and visualization of ad hoc information stores in real time all seem to offer endless vistas for development.
Accesses since January 1, 2007:
