mBsLOG

    Welcome to my weblog. It is an unconventional blog in that I am not planning to post daily or weekly, but only as topics of interest emerge. I enjoyed playing a little with my initials and the word blog and am amused by the fact that it is as much something I am slogging through as something I am blogging about. This listing only shows the five most recent posts.

    • Here is an index of all the topics with direct links to the post.
    • Here are the posts from 2007.
    • Here are the posts from 2008.

    I will try to discipline myself to keep a more or less regular set of reflections coming, but I can't promise. I have disabled commenting and discussion as it ended up being more maintainence and cleanup than I cared to deal with. That doesn't mean your comments and thoughts aren't welcome. Should you wish to comment on what I have said, I will be happy to add your comments verbatim so long as they are not spam. Simply send an email to me at Pitt -- see my home page. I will insert it in the appropriate post with attribution if you wish. Please reference the title and date of the post on which you are commenting. Also, if you want to suggest a topic that might be covered or discussed, let me know and I will try to include it.

    Here is access my mBsLOG as an rss feed.


    Wed, 17 Dec 2008

    The Next Generation Web and Social Capital (December 17, 2008)

    Over the last month or so, a number of incidents have occurred that cause me to reflect on social capital and “Web 2.0”. I was surprised looking back on the BLOG that I had not addressed this issue directly in any of the posts. The need to address this topic began with an oversight board for the school. Last year, they advised the Dean that we should be doing more on social networking and Web 2.0. When I heard that was going to become a priority, I was a little perturbed. As far as I was concerned, we (and particularly I) had been working in this area for almost a decade. The fact that the board was not apprised of this work bothered me. At first, I blamed the Dean for not being aware of the work. In the last analysis, I blame myself for not talking more about it. This year, I made a presentation to the board in which I reviewed some of our work. I was pleased to hear that they were favorably impressed with our attention to the matter.

    A part of the presentation had to do with work on collaborative authoring funded by NIST in the late nineties and doctoral dissertations that resulted from that work. We built a system that was designed to speed international standards development. In many ways we succeeded, but like so many other basic initiatives of that period, our work was swamped by the tsunami known as the World Wide Web. Two of the dissertations that came out of that work included Bordin Sapsomboon’s “Shared Defect Detection : The Effects of Annotations in Asynchronous Software Inspection” and Vichita Vathanophas’s “The Use Of Peripheral Social Awareness Tools In Collaborative Systems.” Both dissertations were published in 2000. Bordin’s was very traditional and demonstrated that defect detection could be improved using social software inspection. Vichita’s was more radical. She used the extensive logs maintained by the system to provide an indication of how people felt about the project they were working on and how willing they would be to contribute. For many, her dissertation smacked of big brother. I believe that what she was doing in the late nineties was no different than what is happing today. It was simply that the data collection and use was more explicit. Both of these dissertations demonstrated well controlled studies of the impact of social networking systems.

    Shortly after I made the presentation to the Board, I was asked to speak to various groups of students about the topic. In that process, I began to use the terms aggregate annotations and social capital as an important concepts behind social networking and social tagging systems. I have addressed the issue of aggregate annotations in another post on this blog. (See my Seminar on Annotation Aggregation for more information.) Someone asked me about the term “social capital” and I did a web search so as to give them a reference. I was surprised to find one of my website pages on the first page of the search results! I found it referred to a doctoral seminar I gave in January of 1997. The seminar was inspired by a talk Robert Putnam had given at the first annual conference on leveraging cyberspace in October of 1996, which was co-sponsored by XEROX PARC and NIST. I had been invited to talk about Multi-level Navigation of Document Spaces. At this inaugural, and final, conference I was mesmerized by Robert Putnam, Marc Weiser, John Seeley Brown, and Paul Saffo. Truth be told, I thought every presenter at that intimate conference was spectacular. (See http://nvl.nist.gov/pub/nistpubs/jres/102/3/j23mol.pdf.) Returning from the conference, with Robert Putnam’s research and challenge clear in my mind, I wrote up the charge for the doctoral seminar. It began with the following:

    This seminar explores two questions. The first question is "what is social capital?" Assuming we can come to a consensus answer to this question along the lines that have already been suggested by Putnam and others, the second and more interesting question to be addressed in this seminar is "how might systems be designed to prevent the erosion of, or encourage the development of, social capital?" (see my Seminar on Social Capital.)
    Would that I had followed up intelligently on my own hunch, I might not be writing about this, but sitting on top of linked-in or one of the other social networking sites!

    Fast forwarding to today, we might ask a similar questions. “What is Web 2.0 and where are we going?” Personally, given all the confusion about Web2.0, Web3.0 and all of the technologies and applications, my personal preference is to ask what the Next Generation Web (NGW) might look like. In June of 2008, Cormode and Krishnamurthy of ATT Labs published a wonderful article on the evolution of the Web. In my opinion it is the single most intelligent article on the topic. (See Graham Cormode and Balachander Krishnamurthy, Key differences between Web 1.0 and Web 2.0. First Monday, Volume 13 Number 6 - 2 June 2008.) The article is worth reading in its entirety several times. For purposes of this discussion, I combine several of their elegant observations as follows:

    • Web 1.0 is a place that serves content; pages are first class entities; and organization tends to be hierarchical based on structured content
    • Web 2.0 is a destination that serves interests; users are first class entities; and organization tends to be network based on users forming connections
    For me, this provides a most astute explanation of what is going on and why it is going on. Forget Atom/RSS, wikis, Ajax, and other technologies and forms. The key is that we are moving from information to interaction. I often describe the evolution of the web as moving from information to interaction to transaction to transformation. Granted this is an evolutionary model that is most appropriate for describing the evolution of e-business sites. In this case, there is a distinction between interaction and completed transactions and yet another distinction between transactions that mimic traditional business and transformation of the business to new forms. In the case of NGW sites, it may be that ever more intimate and new forms of interactions is the end goal. Thinking about this leads one to marvel at the insight that leads to the new services that emerge daily from these social networking sites.

    Moving forward, how do we understand what is going on and more importantly predict where we might productively move? It may be that the call for Web Science by Tim Berners-Lee and others is the answer. Being somewhat more of a traditionalist, I like the arguments put forward by Ed Chi of PARC. (See Ed H. Chi,The Social Web: Research and Opportunities, IEEE Computer, Volume 41 Number 9,September 2008, pp88-91) Chi begins with a suggestion that the social web currently consists of three kinds of activities – information foraging, sharing and tagging, and collaborative creation. It makes sense to me to think about research aimed at “developing new theories and algorithms to model, mine, and understand socially constructed knowledge structures and social information networks.” This may indeed be exactly the same goal as others would set for “Web Science.” For me, the name of the discipline is not as important as the research questions. We have enough flexibility within our current disciplines to reach out collaboratively to address the basic questions. What is most important is that we forge intelligent questions based on a grounded conceptual framework.

    Two final notes. I need to write a post for this blog on Knowledge Management and Collective Intelligence. Over the years, I have talked with disdain about these topics. Over the last couple years, I have changed my position. It is becoming clear to me that there are occasions when it is important to make tacit knowledge explicit. Indeed, this has become for me the mantra of knowledge management. The example that I use most frequently relates to the vast store of knowledge that existed in the brains of nuclear engineers who worked for Westinghouse. With the resurgence of interest in nuclear power, it has become apparent to some that the vast store of knowledge that existed in the heads of those engineers has diminished as they have retired and passed away. If some kind of social system to capture this information had been in place at Westinghouse over that last 50 years, it might be possible today to go back and harvest the nuggets of knowledge and resurrect a nuclear program at Westinghouse more easily than will now be possible. IBM and others have recognized this and begun to develop aggressive program that may serve to allow for better knowledge management.

    Regarding collective intelligence, I have had a similar epiphany. It is based mainly on the work of one of my recent PhD students, Worasit Choochaiwattana, who developed a retrieval system based on the social bookmarking site delicious. The research was able to show that resources retrieved through his system were rated as slightly better than those retrieved by Google. The key here is that the set of resources used in his system was much smaller than the set used in Google – by three orders of magnitude. The implication of this finding for the size of the server farm needed as the base for the search engine is staggering. What makes this possible? There are two things. First and foremost are the rather brilliant algorithms Worasit developed. Second, and equally significant, along a very different dimension, is the filtering of the resources on which the search was conducted. As anyone who has searched recently understands, the number of “noise” resources that are returned as highly ranked is on the increase. Personally, I find little comfort in the fact that many other people have encountered the same problems I encounter. It used to be when I searched, I found people who were answering the question. Today, I find many people who are asking the same question. As it ends up, people don’t bookmark question pages much. They tend to bookmark pages with answers. It is this collective intelligence of bookmarking that delicious harvests. You may suggest that this is more common sense than intelligence, and I won’t argue. At the same time, I am coming to believe that we will find important ways to make use of this phenomenon, whatever we choose to call it. Personally, I am not opposed to calling it collective intelligence.

    In conclusion, there are rich histories of the study of important concepts such as social capital which might inform our invention of a second generation of the web. It is important to rise above the rapid evolution of the web and the technologies employed and ask simple fundamental questions about what is going on. One of the most central of these concepts is that of social capital. Others include collective intelligence and annotation aggregation. At heart, the next generation web is about people as first class entities!

    [/2008/12] permanent link


    Tue, 18 Nov 2008

    Will this course be a lot of work? (November 18, 2008)

    I like to write blog entries that expose some information or contribute positively to how I see the issues in the field of Information Science. This entry tends to be something of a gripe and therefore I feel a strong need to preface, or maybe I should say justify, my remarks. I tend to be rather demanding as a teacher. At the same time, I work hard to help students. As one student put it, “he kicked us into gear so fast I picked up the material from his tough-love attitude, enthusiasm, and extremely clear and aggressive teaching style.” Obviously, I selected a quote that I think puts me in a favorable light. By way of more objective assessment, over more than a decade, course reviews have been consistently favorable – all within the top 25% of courses rated. Equally important to me, 90% of the students indicate that they learned much more in this course than in other course they have taken. Ok, hopefully you will accept that I am not simply whining about students, but trying to express something positive about the way I would like students to approach their responsibility for learning.

    A student came into my office the other day and asked if I was busy. I wanted to say “no I have just been waiting for someone to walk-in unannounced to disrupt my concentration on the rather difficult problem I was trying to solve,” but I held my tongue and replied “what can I do for you.” The student wanted to take one of my basic courses, but hadn’t programmed in a long time, and wondered if there would be much programming. I said there would be. He asked if the course would be a lot of work. I said yes. He asked if it would be too much work for him. I said I didn’t know. I asked some questions about how well he had programmed, how much he remembered, what kind of work load he was used to, etc. This went on for about 15 minutes – with the same question being asked over and over again. Basically, he wanted me to assure him that the amount of work in the course would be reasonable by his standards. I finally had to tell him that I could not assure him the course would be easy. I told him I thought he could learn a lot if he took the course. Finally I told him that the syllabus and the notes for every lecture were online for his review. He left rather disappointed. I believe he wanted to take the course because a lot of people talk about how much they learn in this particular course – it has a good reputation, but they also complain that it is a lot of work. Like many other visitors to my office, I suspect that this young man had heard that a number of students have reported that the ability to talk about things learned in my classes helps in job interviews. He wanted the advantage of having things to talk about in job interviews –– but wasn’t sure he wanted to expend the energy to acquire the knowledge. This phenomenon has led me to make a standard disclaimer in the first lecture of most of my courses. “You may have heard that some students report that talking about things they learned in my course helped them in a job interview. Please note, and I know some students who have made this claim, and I can assure you that what they are saying is “what THEY LEARNED in my course helped them in the interview.” They are not saying “the COURSE HELPED them” or “what I TAUGHT HELPED them.” I offer you an opportunity to engage in deep learning of the subject matter, so that you really understand the subject and can talk about it. If you do not internalize the concepts and principles, you will leave just as ignorant as you arrived.”

    I spend between seven and fifteen hours in preparation for every three hour lecture I give, and generally speaking, in the subjects I teach, that prep time does not decrease in a second or third offering of a course – because the landscape is changing. I have, on more than one occasion, spent in excess of two days solving a problem that once solved allowed me too explain it simply and completely to students in 15 minutes. I am not bragging about having simplified the concept or complaining about the time required to distill and simplify new concepts for the classroom. That is what I am paid to do. What bothers me is that many students don’t have a corresponding perspective, i.e. the amount that I learn will be correlated with the time spent in learning. I tell students at the beginning of each course that there will be 15 three hour lectures – 45 hours. I don’t give midterms or finals, so they get a full 45 hour exposition. My expectation is that a graduate student spends three hours outside of class for every hour in class. To make the multiplication easier, I round 45 to 50 and multiply by three. They should plan on 150 hours of work outside the classroom over the term. I allocate 50 of those to reading the textbooks and other material I provide. I am a slow reader. Others are faster. I remind them that if they are very slow readers either because they don’t read much or because they are not native speakers of English, reading time may exceed 50 hours. That leaves 100 additional hours that they will need to commit to the homework, exercises, and projects. The assignments and projects that make up the course each contribute between 0 and X points to their grade. The sum of the X’s is 100. Moreover, a project that contributes 10 points to their final grade has been constructed such that it should require about 10 hours of effort. If you are really prepared, you might be able to do it in an hour or two. If you are not well prepared, it might take 15-20. If you need less time, it is because you are well prepared. If you need more time, it is because you lack prerequisite skills. These time discrepancies are not my concern – i.e. I will not make the course harder because some students come prepared, nor will I make it easier because some students are ill prepared. (I must admit that this is not completely true. Ten years ago, I devoted an entire lecture teaching html. Today, I assume students come to graduate school with knowledge of html.)

    My favorite assignment in this category is the first assignment in a course on client server systems. The assignment involves correcting a thirty line-long piece of code. It is important to note that the code “worked” flawlessly. That is, they were given client software, which included the code they were reviewing. The code compiled without error, linked without error, and executed against server code I provided and ran without error. I tell the students when I give them the code that the corrections can be made with less than 100 keystrokes. (Actually, in an efficient editor, it is less than 20 mouse and keystroke actions.) Less than 10 percent of the students achieve a perfect score, most get a little more than half of the points, and some get less than 4 out of the ten points. Some students claimed to have spent well in excess of twenty hours on the assignment. My in-class correction of the code, along with a DETAILED explanation, takes less than 15 minutes. This assignment is a very extreme example of a prepared mind versus an unprepared mind struggling with a problem. In general, I can do one of my assignments in about half the time I calculate for students. I have no doubt that for some it takes twice as much as the average time calculated. (As a footnote, the client server assignment had logical errors that normal use never triggered. The point of the assignment was to emphasize the importance of flawless programming in an environment where hackers exploit logical errors in code that needs to run 24X7 in a hostile environment. Getting the code to compile, link, and execute is not the issue. Writing correct and secure code is the issue. Footnote to the footnote, I am always flabbergasted to learn that the current generation of IDE based programmers seldom comprehend the difference between compiler, linker and runtime errors. Further, they seldom have a good mental model of the types of libraries and the different compiler and linker options – they don’t have much hope of understanding how paragraph alignment of variable storage can impact buffer overflow errors! )

    To return to the question at hand – i.e. “Will this course be a lot of work?” Wrong question I think. A much better question would be “what will this course give me an opportunity to learn?” The answer to the first question is what most students want. I would like students interested in taking my courses to think about the cost-benefit ratio of the course. The amount of work in any course is a function of your preparation to take the course. Courses cover material from a beginning point defined by the course description and the explicit, assumed, and implicit prerequisites. Explicit prerequisites are those listed in the course description. If the course requires “Data Structures”, data structures will be used, but will not be taught. The assumed prerequisites are those that may have been stated elsewhere. For example, to be admitted, graduate students must have a structured programming language. Assumed prerequisites are things like the ability to understand English; to read a textbook, reference manual, journal article, and program; to calculate arithmetic expressions and statistical measures; to follow an algorithm or logical inference, etc. The amount of work you will do will reflect that required by a solid graduate course plus all of the work you will need to do to make up for the deficiencies in your prior education. I am sorry if that sounds harsh, but we need to draw the line somewhere. Otherwise we find ourselves in a position where advanced graduate courses need to teach basic sentence construction, addition and division, and common sense logic. That leaves little time to discuss the implications of procedural, object-oriented, and declarative programming languages or the vagaries of regular expressions and the XML Path language. I look forward to seeing you in my courses and hope they will provide a productive framework within which you can learn some things that will serve you well in your professional career – beginning with your first job interview.

    [/2008/11] permanent link


    Wed, 05 Nov 2008

    The Science of Information (November 5, 2008)

    This post is a revision of a post on the science of information that I wrote on October 23, 2008. The earlier post, which makes up the last part of this post starting with the paragraph that begins "In 1969, Herb Simon…" was done in a rush and ended up providing a grossly inadequate response to the topic. I have added some material to overcome the inadequacy of the discussion, at least as I see it. This post now replaces the earlier post which has been removed.

    In 1979, as a part of my dissertation work, I struggled with the question of what constituted a profession. As a part of that work, I asked the question of how professions differed from disciplines, which include the sciences. That led me to a 1966 book by King and Brownell (King, A.R. and Brownell, J.A. The Curriculum and the Disciplines of Knowledge: A Theory of Curriculum Practice. New York: John Wiley and Sons, 1966.) In their treatise, they lay out and discuss the characteristics of a discipline. These include obvious characteristics -- e.g. it is a community of persons, and expression of human imagination, a tradition; derivative characteristics -- e.g. there is a specialized language, it is an "instructive community", it has a literature, etc.; and what I consider the fundamental characteristics -- it has a domain of inquiry, a mode of inquiry, and a conceptual structure. While I would bow to my colleagues in History and Philosophy of Science who I suspect have much better models, King and Brownell provide a simple framework that makes sense to me. The three characteristics I call fundamental is where I focus this discussion. (Although I must admit that it is fun to ask how the Science of Information defines a "valuative and affective stance.") Let me address the domain, conceptual structure, and method of inquiry for a science of information.

    It is pretty clear that the domain of physics is the physical universe, the domain of biology is living organisms, the domain of literature is writings, etc. At gross levels, these domains are difficult to constrain, but as we talk about astrophysics, or vertebrate biology, the domains seem to become a little more sharply defined. Sometimes, they get fuzzier -- e.g. molecular biology, or social psychology, but let us avoid that confusion and simply ask what the domain of information science might be? I am not very happy when the discussion turns to everything being information and leads information science to have a domain which includes all the other disciplines. The disciplines get more clear as they get more focused, or as we shall see below, the conceptual structure gets more clear and universal. I am also not happy when we suggest that we don't need to define or circumscribe the definition of information to have a science of it. I would suggest that lacking a reasonable definition of living organisms would make it very difficult to define what biology is all about. A related issue here has to do with outliers. While you and I would have no trouble agreeing that a vertebrate or a plant is a living organism, there are surely some fungi and other fringe entities that lack one or more of the attributes we use to define living organisms. We can argue about these, but Biology began by demarcating the 99% of the domain we agree is living organisms. (Actually, I don’t think anyone knew about the special cases until much later in time.) When it comes to information, it seems we only want to argue about the fringe of the domain and ignore the 99% that is at the core. So what is the domain of information science. I would say that it is the messages exchanged between humans that change the state of what the receiving human "knows" -- another definitional problem, but we will get to it in a future post. The interested reader should see my earlier posting -- September 25, 2008 on a definition of information.

    Ok, we are coming to grips with what we want to study. What is the conceptual structure we overlay on the phenomenon? In physics, we have had a number of conceptual models of the physical world, at both microscopic and macroscopic levels. Newtonian mechanics worked for a long time. Quantum mechanics takes another view, not necessarily contradictory, but in some arenas of matter, more explanatory. I will be careful not to anger my colleagues in Physics by exposing any more of my ignorance of the subtleties of the conceptual domain. Suffice it to say for my purposes here, that in the Newtonian conceptual structure we find concepts like force = mass * acceleration. This concept is not a part of the domain of inquiry, it is a part of the conceptual structure that is overlaid to explain something about the domain. So, what is the conceptual structure of the science of information? Some might suggest that it is Claude Shannon's conceptualization of information as the log of the sum of the inverse of the probabilities of the components of the message. I would suggest this is a good start -- we have seen our Newton but are still awaiting Einstein. Moreover, while force is one small part of physics that was conceptualized, Shannon's measure of the amount of information in a communications channel is yet a smaller component of the conceptual framework that needs to be defined for a science of information. Would that I could share with you a comprehensive conceptual structure for information, or better yet a grand unified theory. Sometimes, I sense that I see something, but all too often what seemed so clear in a state of deep thought vaporizes as I work it. What I am convinced is that information is a phenomenon worthy of our study and the domain can be demarcated. Further, if we "discipline" ourselves, we can begin to develop a conceptual structure. As in all the disciplines, that conceptual structure will evolve and face radical points of evolution over time. Today we are at a very primitive beginning with a few giants such as Claude Shannon and Alan Turing who have provided some first efforts at a conceptual structure.

    And now we turn to the method of inquiry. There are two answers to this question. The first is a simple evolutionary answer. I think, again I would bow to my colleagues in History and Philosophy of Science, that most disciplines have evolved from an early period in which the primary mode of inquiry was simple observation and classification to a more evolved mode that was more formal and which enabled assessment of the validity of the conceptual model via replicable evaluation. For most sciences, this more formal method has become some variation of the scientific method. I suspect that the maturity of the science of information warrants a longer period of observation and classification to build a base of concepts that we may later be able to relate. I like Zipf's law, and Metcalfe's law, and the many others that are little more than observational science, but in academia, we are always driven to the more formal methods, and we "know" that the best are those of the old natural sciences. Herb Simon has suggested that might not be the most appropriate methodology, and I whole-heartedly agree.

    In 1969, Herb Simon wrote “The Sciences of the Artificial” in which he discussed the differences between natural and artificial sciences. As I read the book, I am holding the second, 1981, edition, he was encouraging his colleagues to develop a new paradigm for conducting research in the “design sciences.” What is cogent in these remarks, I credit to Herb Simon. What is silly in my remarks, I take full responsibility for. Surely, this brief entry cannot do justice to the carefully reasoned arguments he posits in a little over 200 pages. Let me begin with two passages from the book:

    My dictionary defines “artificial” as “Produced by art rather than nature; not genuine or natural; affected; not pertaining to the essence of matter.” It proposes, as synonyms: affected, factitious, manufactured, pretended, sham, simulated, spurious, trumped up, unnatural. As antonyms, it lists: actual, genuine, honest, natural, real, truthful, unaffected. Our language seems to reflect man’s deep distrust of his own products. I shall not try to assess the validity of that evaluation or explore the possible psychological roots. But you will have to understand me as using “artificial” in as neutral a sense as possible, as meaning man-made as opposed to natural. (2nd edition, page 6)

    And:

    … hence we can set the boundaries for sciences of the artificial:
    1. Artificial things are synthesized (though not always or usually with full forethought) by man.
    2. Artificial things may imitate appearances in natural things while lacking, in one or many respects, the reality of the latter.
    3. Artificial things can be characterized in terms of functions, goals, and adaptation.
    4. Artificial things are often discussed, particularly when they are being designed, in terms of imperatives as well as descriptives.(2nd edition, page 8)

    In some sense, rereading his words for the fourth time in as many decades, I feel that there is little left to be said – he really has said it all. I firmly believe, as I have described in other posts, that information is an artifact of the human effort to communicate. If that is the case, information science is an artificial science, and not a natural science. Natural sciences endeavor to describe and explain the natural world around us and that natural world is a given. Artificial sciences endeavor to improve the design of the artifacts that we create. As Simon points out, talk about artificial anything and the sense is that it isn’t as good as the natural thing. Natural sweetener is obviously better than artificial sweetener. An artifact is a construct of human imagination. A science of artifacts or the artificial is a science of the things we build.

    In academia, there is strong pressure to do good research. Many times this is equated to descriptive and explanatory research focused on the natural world around us. We can’t make a pulsar something it is not. We simple try to explain it. In their research, engineers would be like physicists and doctors would be like biologists. Maybe we need to rethink the paradigm of our science, more focused on the matter of our science – artifacts – than on the paradigms of those who study nature.

    In a previous post, I talked about structured documents. These are not a product of nature, but things constructed by humans. We have the ability to define and redefine them so as to meet our needs to build systems of artifacts. For example, consider a stipulated definition that defines them as sequences of symbols. If we find that we can’t communicate what we wish to via the existing symbol set, we can change the symbol set. As another example, if we find a sequenced set of symbols does not provide adequate facility to manipulate and control the document, we can define document as a directed acyclic graph of elements over that symbol set. This might allow us to do partial locking and structural analysis. We could take the example further and introduce attributes and metadata to the model to give us additional capability. This kind of design science is very different from the descriptive natural science that says a document is what it is and it is our goal to describe it in its natural form.

    In these examples, we are describing a science that changes the object of study so as to better achieve the design goals set for it. If our cars don’t provide adequate crash protection, we redesign them to provide better crash protection. Similarly, if buildings don’t survive earthquakes, we redesign them so that they will. If the structures and mechanisms by which we create and share information are inadequate, we need to build new structures that allow us to achieve our goals. I am reminded of the Serpent, who in Act 1 of George Bernard Shaw’s Back to Methuselah, says to Eve: “You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'."

    [/2008/10] permanent link


    Fri, 10 Oct 2008

    “Structured Documents” – Concept and Form (October 10, 2008)

    I was in a discussion with my PhD students this week and the subject of structured documents came up. I was flabbergasted by some of the thoughts that were expressed and by the lack of agreement about what was meant by a structured document, both conceptually and technically. In this posting, I would like to address the issue of structured documents. In my conclusion, I begin what will be another discussion about the appropriate level of document structuring.

    A Couple Caveats

    More frequently than I would like, students mention memex, Bush, Engelbart, NLS, hypertext, and the World Wide Web when they start to talk about structured documents. While hypertext documents are interesting, and do have a structure associated with them, they have very little to do with “structured documents” as I understand and talk about them. Using a document style or theme in Word does not result in a structured document – it results in a document that is styled not structured. A form can be a structured document, but forms per se are often more like records than structured documents. The World Wide Web is not the origin of structured documents, but structured documents do become more important in the context of the World Wide Web. Technically, an html document is a structured document, but it is a little like saying three kids arguing in a school yard are a legislative body.

    Let’s begin at a very simple level. We might make an argument that any string is a structured document. Here I use the term string technically – it as a sequence of characters that may include control characters such as tabs and newlines. In this case, each character has a position in the string. We can say some things about the number of words in the string, the number of lines, the number of sentences and paragraphs – which gets somewhat complicated, etc. There are many possibilities at this low level, and while a string is a structure, it does not define what we mean as a structured document. Ok, let’s now move back through the history of the written word and take a look at structured documents in two eras, the non-digital era and the digital era.

    Structured Documents in the Pre-Digital Era

    We might imagine a document such as the Iliad or one of the books of the bible. We have a story from the oral tradition that is written down as remembered. It is a stream of characters, or words. It tells a story. It has a beginning and an end. Surely it is logically structured. It may or may not have a title. It may or may not have the signature or the name of an author. It may or may not have parts. If the story is well told, one suspects that it is conceptually well structured. What makes a “good” story is in large part the quality of the flow in the story. Given that it is a transcription of what originated as an oral presentation, it may have little literary structure. As a document, these manuscripts, and many manuscripts produced before mass production printing, have little formal structure.

    Fast forward to a modern text book. It has a title page. The title page has the title of the book, the author or authors, the publisher, and the city or cities in which the publisher has offices. The title may consist of a title and subtitle, but the title is singular. There is one publisher. There may be one or many authors. The title page is followed by a “cataloging page”, which my colleagues tell me is simply known as the “verso of the Title Page”, that contains among other things disclaimers, publisher address information, ISBN number, Library of Congress cataloging information, copyright, etc. These pages may be preceded by advertising pages, and are followed by dedication pages, forwards, table of contents, acknowledgements, and then the book proper. Normally, the book is made up of a series of chapters, but it may consist of “parts” which have chapters and the chapters may have sections and subsections. Within these structures, there can be paragraphs, figures, tables, examples, etc.

    A modern textbook is a highly structured document. Over time, books have taken on a more structured form. A text book tends to be more highly structured than a trade book. A scholarly journal tends to be more highly structured than a magazine. A business letter tends to be more highly structured than a personal letter. An academic CV tends to be more highly structured than a resume. In the real world, documents have differing levels of structure that are appropriate to the purpose of the document. The source and force of that structure varies greatly. The source may be regulatory, contractual, or consensual. The form of proposals for government funding are a matter of regulation. The provision of information to be included in a textbook by publisher Y is contractual. The information contained in a course syllabus, at least at my institution is more a matter of general consensus. In all of these cases, no judgment is made about the appropriateness or sensibility of this structuring.

    Structuring of Digital Documents

    The history of the application of computer technology to document processing is long and complex. For the purpose of this discussion, I will divide it into four eras. The first era is the digital typesetting era. This era tends to be associated with procedural copy marking. In conventional publishing, layout editors knew how they would “structure” a textbook. This structure was reflected by the graphical layout of the “elements”. Layout editors learned that the title page was to be a recto page – generally the first page of the book. The index went at the back of the book and generally used two columns and a type size smaller than the body type in the book. Computer scientists worked to develop languages to instruct computerized typesetters to change fonts, margins, horizontal and vertical alignment, spacing, etc. Just as a layout editor would place layout copy marks in the manuscript for the typesetter, the user of early formatting software placed procedural commands in the text file. The high point in this era may well have been the development of Tex by Donald Knuth.

    The second era is the heterogeneous device era. There was a period of time when line printers, laser printers, CRT screens, dot matrix printers, typesetters, and robot typewriters all coexisted. During this period, all the procedural languages script, runoff, tex, and a slew of PC based languages, wordstar, peachtext, etc. were evolving to macro languages such as GML, XICS, and LaTex. This era also saw the emergence of structural copymarking. In some ways, the difference between macros and structural copy marking is marginal. In other ways, it is very significant. Some people credit Charles Goldfarb with the “discovery” of structural copymarking and structured documents. I tend to credit Brian Reid who developed Scribe. Here’s the deal. A macro says that @title is associated with a set of procedural copymarks. It is easier to remember than all of the details, and it adds some standardization. A structural copymark says that @title is a copymark that can appear in a unit of the document called a @titlepage, and not elsewhere. Brian Reid developed Scribe to allow users to output their work to multiple devices. Thus, he created macros for each of the devices with common names. Then, he decided to go a step further. He developed what he called make files that contained information about the components in about a dozen types of documents – letter, slide, report, article, manual, etc. It was also possible, although very difficult to specify new types of documents. Here’s the important thing. Generally speaking a macro was designed to aggregate procedural copymarks and execute them all at once. So, sometext @macro othertext would result in output where othertext can after sometext but it was in a different style. In Scribe, you would say @chapter(SomeText) and that would cause SomeText to start on a new page, be some particular font, AND be saved to create a table of contents for the document. Similarly text@footnote[moretext] and text would cause “text and text” to be output with a superscripted number after the first text and “more text” to be output at the bottom of the page with a matching superscript. Reid was in part responding to the need to deal with heterogeneous devices and was trying to make his “descriptive markup” easier to use. While GML and SGML get a lot of the credit – appropriately in terms of the standardization and generalization of the concept. Reid’s scribe provided the earliest functional effort to develop structured documents that were much more than simple macro languages.

    The third era is the WYSIWYG era. While the WYSIWYG era brought on by the development of the Alto and the STAR at the Xerox Palo Alto Research Center did a great many things to make our life better, the bitmapped screen and the laser printer caused some problems as well. As “all points addressable” devices, there was no need to deal with the different idiosyncrasies of different devices. Whatever you could put on the screen could be printed. In this era, styles technology was dominant. It was now possible to select a different font and type size for each word on the screen. Sweep out some selection of text, and make it the same as some other selection of text. It became possible to do infinitely complex procedural copy marking without ever knowing one of the commands. It was – is – a struggle to get people to use styles consistently. It is just to easy to do anything we damn well please. The power of descriptive of structural copy marking was lost on those who now could do anything they wanted.

    The fourth era is the WWW era. As Berners-Lee moved forward the concept of a universal repository for information, he developed a mechanism for identifying resources – the URL, a mechanism for transporting requests and responses – http, and a mechanism for representing resources – html. Berners-Lee was familiar with SGML because technical papers at the CERN were formatted using an SGML Document Type Definition (DTD). He decided that he could write a DTD that could be used to represent documents in his system. It is not clear whether he actually intended to develop a universal document type or just the first and simplest of what would be many document types. It is clear that his document type was minimalist – a valid html document need only include a title element in the head, but most browsers were happy with less than that – i.e. nothing. Further, while it may not have been his intention, users used his elements as macros rather than descriptive or structural markup. Tables were used to establish formats and block quotes served to indent whole documents. The abuses were many. It quickly became clear that more was needed if we were to be able to unambiguously identify the author, publisher, publication date and structure of these resources. Thus, we began the process of revising SGML to occupy a smaller, more applicable footprint to solve some of the semantic issues related to resources on the web. Many, many coordinated pieces would be required to make this work.

    The Current State of Structured Documents

    XML and the family of XML standards – xslt, xpath, xslt-fo, schema, schema datatype, xlink, xforms, xquery, etc. provide the standards, specification, and technologies to create structured documents of varying levels. As most readers will know, this is done by specifying a schema. Schema are hard to understand, and even the XML Schema Compact Syntax (XSCS) (Wilde, Erik, and StillHard, Kilian. A Compact XML Schema Syntax. In Proceedings of XML Europe 2003. London. May 2003. (http://dret.net/netdret/docs/wilde-xmleurope2003.html)) can be daunting. Without endeavoring to specify a new syntax, let me suggest a simpler was to imagine a document type. Using simplified notations based on regular expressions, Backus Naur form, and the original SGML DTD syntax, let me specify the following rules:

    • The first line of the definition specifies the name of the document
    • The second and successive lines use the BNF form to define ever finer components of document. What precedes the ::= is the object being defined, what follows is the definition.
    • The left side of each line specifies the element(s) being defined.
    • The right side of each line provides the model for the element(s)
    • The element model will consist of elements or primitives
    • While primitives could be of many types and be extensible, we will restrict ourselves here to three primitives {string}, {number}, and {date}
    • To be complete, every element must ultimately be defined in terms of primitives
    • The element model uses parentheses to group the components
    • The components of the model must be connected using one of three connectors – ‘,’ indicating a sequence, ‘|’ indicating a choice, or & indicating that the members of the set of components so connected may be in any order.
    • Each component of the model must be modified by 2 numbers in brackets where [1,1] means required, [n,m] means at least n times and as many as m times. The second digit, m may be unbounded, in which case a * is used.

    Thus a simple definition might be given as follows:

    Document type ::= Memo
    Memo ::= headings[1,1], body[1,1], addenda[0,1]
    headings ::= to[1,*] , from[1,1] , (date[1,1] & subject[1,1] )
    body ::= paragraph[1,*] ,
    addenda ::= cc[0,*] & enc[0,1]
    to, from, subject, paragraph, cc, enc::= {string}
    date::={date}

    In this example we define a memo as having headings (required and only once), body (required and only once), and addenda (optional but not more than once) sections in that order. The headings component contains one or more to components followed by one from component followed by one date and one subject where they can be in any order. All of these are defined as strings without further sub components. Even in such a simple example, there are many design decisions and complexities such as:

    • Granularity of component modeling: I could have specified that to or from was a person or an organization. We then could have specified further component structure – such as a person who has a lastname, firstname, and optional title.
    • Semantics of components: I chose to use semantics based on common terms used in memos. I might have chosen to use author instead of from and recipents instead of to. Keep in mind, that we could still generate a memo using “To:” and “From:” followed by the strings in the recipient and author components.
    • Structuring of the canonical form: I used the general form of the components as we would see them in a typical memo. XSLT allows us to have presentation forms that differ from the canonical structure. Thus, I might have constructed my memo with two components – metadata and content. The cc and enc components could have been in the metadata and the body in the content. For presentation purposes, parts of the metadata could be extracted and placed after the content.

    Further, these design decisions – as with most things in our document world, cannot be made in the abstract, but must reflect the consensus of the users of these types of documents lest they be ignored. Consider for example a structured document definition of a syllabus. Make it too structured and faculty will ignore it. Make it such that it constrains nobody and likely it will have minimal useful structure. I can imagine a thousand useful things that might be done if syllabi for all college course used some standard format that contained a reasonable level of required detailed content. I can also imagine that the faculty in one small department of information science would not be able to come to an agreement as to what it should contain. Imagine getting all of the faculty in all of the departments of all of the colleges and universities in the United States to agree!

    Appropriate Structuring

    So, what is appropriate structuring? I begin with a simple classification of document types. Documents can be personal, group, organizational, enterprise (cross organizational), and archival. Documents may migrate from one category to another. At the personal document end of the continuum, the demand for enforced structure is minimal. I write a note to myself in any form I care to. My diary can be kept as I please. On the other hand, documents that need to be exchanged between organizations A, B, and Z need more structure. Consider as one example student transcripts. These documents need to contain certain information and we know what that is. I would contend that given the potential of structured documents, we should over structure so long as we can do it without increasing the burden of authorship. If we can structure our personal diaries such that they can serve as archival documents, we will have wasted our time with my diary, but the benefit of having the diary of Colin Powell in a structured form might prove invaluable

    All of this has ignored the telling of a good story. In 1956, I began the process of learning how to tell a structured story in writing. This process went on for 12 years of daily instruction through high school. It involved learning to diagram sentences, outline topics, write good paragraphs, etc. It continued more seriously, but in formal ways less regularly through four years of college. Training then continued on the job – my first boss was a master writer of more than two dozen books. It also involved an intense period of mentored training as I wrote my dissertation. I believe that after 23 years, I had actually learned enough to consider myself a reasonably good writer of a structured story – my dissertation. Another ten years, including multiple articles, proposals, books etc. brought me to a point where I consider myself roughly competent. Before we can learn to write good structured documents using XML and the tools that will emerge, we will need some significant training, beginning in grade school, about how to make the best use of this capabilities. For those of us now in our latter years, it is unlikely, even if we are good writers conceptually, that we will feel and embrace the potential of structured documents – just let me do it my own damn way.

    [/2008/10] permanent link


    Thu, 25 Sep 2008

    So What is Information (September 25, 2008)

    I have been a faculty member in the School of Information Sciences -- formerly the Graduate School of Library and Information Science -- for more than two decades now. When I arrived, the name of the department was the Interdisciplinary Department of Information Science. In the early 1990's it was changed to the Department of Information Science and Telecommunications. Recently, as part of a school wide restructuring, it became the Graduate Information Science and Technology Program. The constant in all the changes and evolution would seem to be Information Science -- or the Science of Information. I will address the “science” of information in a future piece focusing on King and Brownell's work on communities of discourse and Simon's thoughts on sciences of the artificial. For now, I would like to focus on information.

    Information is in the “I” of the beholder

    Let's begin simply. My birth date is May 27th. That is probably information for you. It would not be information for my mother were she still alive. You may not value the information, but I suspect you would agree that it was something you did not know. Somewhere, the "fact" that May 27th is my birthday is recorded, but that is not information in an of itself -- we refer to it as a record. It might or might not be information to you. Thus we have some first conversational concepts that we can use to define information. Let's begin with the fact that the measure of information is relative to the recipient. What is information to one human is not information to another. Is it possible to "inform" a building, or a computer, or an automobile? I am not quite sure how to answer that. On the one hand, I am prepared to state that what we refer to as information is tightly bound to the human experience. Some of my colleagues would argue that ant scent trails constitute information, and that computers produce information displays -- sometimes regardless of whether they are used by humans. I am prepared to engage these arguments and be convinced that information is a concept that has a scope beyond the human experience, but for now, I will ask you to accept a temporary stipulation that sans humans there is no information. The reason for asking for this stipulated limit is a desire to be able to easily build a more complete definition. If we expand the scope too quickly to all these arguable extensions, some of the points I wish to make will be much more difficult. So for this argument, information is a function of the human experience. If a tree falls in the wood and there is no human, it may make a sound, but we do not have any information about the fact that it made a sound, and we do not know that it fell, nor why it fell.

    We may now take another step. We are stipulating that the receiver is restricted to entities we call humans. Further I propose that the measure of information is dependent upon the receiver. We need to somehow qualify what it is that makes something information to one human but not another. We know what it is informally. Let's see if we can make it a little more formal. If I already have some information, receipt of the same fact, is not information. Messages contain information when the message is about something not already known to the receiver. At the risk of moving too fast, what I know, the knowledge I have, acts as the mediator of whether a message contains information.

    This leaves us with several concepts that are of use in furthering our inquiry. The first is the notion of a store of information which we will call knowledge. The second is the notion of messages which are delivered to us. The third is the notion of the contents of the message measured along a dimension we call information. A message may contain no information, a little information, or a lot of information. We may have a lot of knowledge, a little knowledge, or no knowledge. What we know may be partitioned into domains. I know a lot about Pittsburgh, a little about Bangkok and New York, and next to nothing about Nairobi. We could go on for a while here, but let me simply add one more caveat at this level. Information may be true or false. I may be misinformed. I may receive a lot of bad information. Pretty neat. Ok, there is a lot more at this level, like the value of information, but we will leave those discussions for now and turn to the messages, and then to disembodiment of information.

    Messages

    We suggested above that "messages may carry information". What is a message? Again, I am going to suggest that a message is something received by a human. Unlike information, it would be hard to argue that only humans process messages. Here I would have to agree that computers process messages and that the scent trails laid down by ants constitute messages to other ants. So I fully agree that a message is a very general concept and surely not restricted to the human experience. At the same time, I will stipulate that for now, I am only speaking about messages that are received by humans. I would suggest that there are two broad classes of messages that are processed by humans. The first is messages that consist of raw data processed from the environment. The sound of screeching brakes, the sound of water running, the smell of smoke, a sunset, the warmth of the summer sun. All of these experiences, whether direct or indirect (e.g. a picture of a sunset, or a recording of a gun shot), may contain information. We don't always define these experiences as "messages" and that is fair, but I would argue that these sensations or experiences should be defined as low grade messages. There is much to be discussed here about how these patterns of signals move from signal to data (pattern) to information -- it is raining outside. It has a lot to do with the knowledge we bring to bare on the signals. Those signals that make no sense -- that form no pattern -- are commonly referred to as noise. One of the goals of natural science is to turn noise into information. So at this level of messaging, we begin to introduce the concept of noise as meaningless (information less) patterns. We can come back to this concept and mine it further, but for this discussion, I want to turn from low grade messages, to high grade messages.

    If we agree that we can grudgingly refer to sensory experiences as messages that can carry information, what is it that we really want to think of as a message? I think it is pretty easy to agree that a message involves a sender and a receiver and that it is pretty easy to imagine one human sending a message to another human. Again, I acknowledge that “message” is a very general concept, but here I am talking about exchanges between humans. An email message constitutes a first class message. It might be considered a "prototype" for messages, in the same way that psychologists suggest that a robin is the "prototype" for bird. This does not mean that a penguin or ostrich is not a bird, they are just not as prototypical as a robin. Let's expand our discussion about human to human messages. If my sister kicks me under the table during a conversation over Thanksgiving dinner, she sends me a message -- probably about the foot I am about to stick in my mouth. So, a hug, a kiss, a punch, can all be messages exchanged between humans. And for millions of years, that was how humans exchanged messages. With the development of spoken language about forty thousand years ago, our ability to exchange messages greatly increased. Four to five thousand years ago with the advent of written language, our ability to exchange messages was increased again. Spoken language allowed for same-time-same-place messaging. The technology of written language allowed for messaging across time and space. We did not have to be at Gettysburg on the afternoon of Thursday November 16, 1863 to get the message from the president of the United States -- "that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion." So, information may be contained in messages which vary from the "low grade" messages that contain raw sensory data that must be interpreted to the "high grade" messages that consist of sets of symbols that are constructed explicitly to facilitate the communication of information between humans. Indeed, I would suggest that we could profitably restrict the study of information to these high grade messages consisting of symbols. Once we understand information in this form, we can extend our exploration to all the other forms -- including the scent trails of ants. In this discussion, I turn to one final topic in my trail -- and that is not the scent trail, but coding and the beginning of formalizing a definition for information.

    Coding information

    For several thousand years, we have been using language to exchange information in encoded form. To a large extent this is what I was referring to when I talked about the disembodiment of information. This has been a great boon to the advancement of civilization. We have not only encoded the individual pieces -- I was tempted to say bits, but that would be premature -- but we have organized these pieces and begun the process of assessing the validity of the information. This process is deserving of study in its own right. We can say that there was a lot of information in a given book. We can identify new information. We can label information public or private. We can store, transmit and access ever greater amounts of information in various forms of messages. Can we measure information? I don't think we yet have adequate ways to do this, but there have been some interesting developments.

    One of the most interesting came from Claude Shannon and Warren Weaver. It has been overplayed in some circles, but it is interesting for what it is. At the core, they hypothesized that one measure of information could be the probability of a given piece of information. If the probability of a given message is unity -- there is no information. If the probability of a message is 50% there is some information in the message. If the probability of a message is 25% there is more information in the message, etc. Working for Bell Labs, Shannon was interested in how much bandwidth was needed to communicate a message, or how much space was needed to store a message. With the advent of digital computer using binary units to store a message, it became useful to think about how many binary digits would be required to store a message. Shannon suggested (a simplistic explanation) that one measure of information was "I=log 1/probability of the message". We can thus say that if information is to be stored as an array of binary digits and the probability of the message is .5, we would need log2 1/.5 bits to store the message. The base 2 log of 1/.5 (i.e. 2) is 1. When the probability of a message is 50/50, I can record the message as a 0 or 1 in one binary digit. If the probability of a given message is .00390625 I would need log2 1/.00390625 or 8 bits. In this case, my magic number of .0039 is 1/256. What Shannon is saying is that if I wish to be able to have a message that can be any one of 256 different symbols, I would need 8 bits to represent it. A rich analysis of information is possible based on a measure of information as the probability of a message. It opens the doors to computation, transmission, encryption, compression, correction of messages stored in digital form. It provides a simple yet rich mathematical theory that allows us to do all sorts of things, and it is completely consistent with the discussion of information being put forward here. While Shannon and Weaver's definition is direct, elegant and powerful, I believe it lacks some of the richer notions I have tried to put forward here. It is not that their definition is wrong. Rather we require complementary definitions that allow us examine other aspects of the phenomenon.

    Conclusion

    I have attempted here to argue for a definition of information that allows us to meaningfully partition the space and study the phenomenon. The core of my argument is based on information as a part of the human experience. Without humans, we can't talk about information. With humans, we can talk about information as a measure of the degree to which a message transforms the state of awareness, the knowledge structure of the receiver. We can partition messages into at least two groups -- those received via direct observation of natural phenomena and those received via some form of symbolic communication from another human. I would argue that while a comprehensive study of information is desirable, it may be more productive to begin is with the analysis of symbolic messages between humans. Based on models developed in a simplified context, the theory and concepts of information that might be later extended more broadly.

    So this is how I would begin the definition of information.

    • Information is a concept tied to the human experience.
    • A message is a set of signals that move from a source to a receiver. Here, we are concerned with messages where the receiver is a human.
    • When a message moves from a human source to a human recipient using a symbolic form, we define the message as a first class message.
    • Information is a function which is a measure of the extent to which a message changes the knowledge state of a recipient of the message.
    • A message in a symbolic form may be defined as disembodied information. Further, one measure of the information in a message that has been proposed is the log of 1 divided by the probability of the message.
    • While information is defined here as a measure that is relative to an individual human, there is also a desire to provide a quantification that is independent of an individual human.
    • We have only touched on knowledge -- what an individual knows, but it should be equally clear that symbol sets allow us to "record knowledge" and thus there must be ways to quantify what I know and what humanity knows.
    • There are evident properties of knowledge that require clearer definition such as it validity, it partitionability, its divisibility, etc.
    • Similarly, there are evident properties of information that must be clearly defined, such as its value, its clarity, its validity, etc.

    [/2008/9] permanent link


    Wed, 10 Sep 2008

    Aesthetics (September 10, 2008)

    I was lecturing last night on e-business web site design. In the first hour I talked about making a business case for a web project where I talked about business goals, process reengineering, return on investment, etc. I told the students that their “project proposal” was not about building cute websites, but about building cost effective systems that advanced business goals and provided a strong return on investment. Later in the class I turned my attention to various techniques and architectures for styling HTML and XML. It covered the structure and scope of Cascading Style Sheets (CSS) 1 and 2 and a comparison with XSL/FO (The XML StyleSheet Language/Formatting Objects). In the process of the lecture, there were a number of times that I digressed into an presentation on font metrics and design issues that were more about aesthetics than ebusiness productivity. I think the lapses in focus may have been due to several discussions I have had with colleagues about courses on interface design have migrated to courses on web design. I won’'t digress much here from the topic of aesthetics, but suffice it to say that when I built the course on interface design years before the web, the focus was on a variety of principles and techniques for building quality interfaces –-- i-phone type interfaces that wrap around humans. For the most part, the web is the worst place to teach these principles and most of what people do on the web makes quality interface designers shudder. But back to font metrics, web design, and aesthetics.

    Back in the early eighties I was heavily involved in the design of formatting software for the early laser printers heavily influenced by the challenges of achieving high quality graphic effects using low resolution laser printers. We thought a lot about type design, kern pairs, hyphenation algorithms, automatic page layout, white space allocation, etc. Last night, the lecture on style sheet design caused me to digress to aesthetics as I touched on several of these points. Let me give just a couple examples. When certain proportionally space characters are juxtaposed, the result is aesthetically poor. As I recollect, the three most frequently occurring combinations are Yo We Ta. Compared to say ll or TT, the apparent space between Y and o in Yo is too great and o needs to be “kerned” or moved back toward the Y to make the spacing look right. Thus, You looks better with kerning. I don’t think I have every heard a concern in web design about the effect of kern pairs.

    The next digress came in talking about font size – as measured in points. This lead to a discussion of ems, picas, and x height. The measurement of font size derives from the size of the block of metal type on which the character was placed. In the image, the box surrounding the Y represents the type slug. You can see that if that box is 12 points high – a point is about 1/72.27th of an inch. Rounding a 12pt font is 1/6th of an inch and a 36 point font is ½ of an inch. Beyond that trivia, good type design assigns different x-heights (the height of those components of a font that fill the area of the small letter x. For example the “bowl” of the letter b in the image is determined by the x-height of the font. Open Word, type “goodbye my darling”, highlight it, select format font, set the size at 12 pts, and scroll through 30 or forty type faces, watch the relative size of the o’'s and ask yourself how it impacts readability. There will be a lot going on, but generally, relatively larger x-height tend to make fonts more readable.

    Similarly, the mixture of font types (serif and san serif) for headings and body text can have an impact on both readability and page indexing – the ability to quickly scan for topics and text. How many web designers mix the font styles, heights, metrics colors, not to mention use padding and spacing to impact the accessibility and readability of a page. Of course many of the pro’s do, but I suspect that has more to do with graphic designer oversight that with web page designer knowledge.

    We have gained much by moving to structural copymarking and by automating much of the display technology, but I fear we have lost, or failed to pass on to the new web designers much of what we have learned about the aesthetics and readability of text. I am not complaining about the progress we have made in computing and document processing. I love this brave new world. I am suggesting that it has made all of us, myself included, a little more lazy and complacent. In the August 2008 edition of Communications of the ACM, Donald Knuth talked about the publication of Volume 2 of The Art of Computer Programming:

    One of the greatest disappointments in my whole life was the day I received in the mail the new edition of The Art of Computer Programming Volume 2, which was typeset with my fonts and which was supposed to be the crowning moment of my life, having succeeded with the TeX project. I think it was 1981, and I had the best typesetting equipment, and I had written a program for the 8-bit microprocessor inside. It had 5,000 dots-per-inch, and all the proofs coming out looked good on this machine. I went over to Addison-Wesley, who had typeset it. There was the book, and it was in the familiar beige covers. I opened the book up and I'm thinking, "Oh, this is going to be a nice moment." I had Volume 2, first edition. I had Volume 2, second edition. They were supposed to look the same. Everything I had known up to that point was that they would look the same. All the measurements seemed to agree. But a lot of distortion goes on, and our optic nerves aren't linear. All kinds of things were happening. I burned with disappointment. I really felt a hot flash, I was so upset. It had to look right, and it didn't, at that time. (CACM, August 2008, 51(8) page 33)

    Professor Knuth reminds us that it is not only the content we produce, but the presentation of that content that is important. If Donald Knuth believes it is important enough to devote a decade of his productive energies to better presenting his brilliant work on algorithms, I would suggest the it behooves those of us less prolific in the production of significant new content to spend some time to understand and implement techniques for aesthetic presentation of those meager ideas we wish to share.

    [/2008/9] permanent link


    Thu, 21 Aug 2008

    Online Education (August 21, 2008)

    Online education is a topic surfacing more and more frequently in graduate professional schools at universities like the University of Pittsburgh. I find myself increasingly ambivalent about the topic and about the push to "make it so." My ambivalence comes from some history. First, while I have been programming since 1969, and have been on the technical faculty in Information Science for more than two decades, my academic preparation was in education, specifically in the area of structured curriculum design. Second, for about 15 years I served as an administrator and director of the distance education program at the University of Pittsburgh. The unit was responsible for delivering more than 150 courses per year to more than 2000 students across forty departments and three schools. Third, some of my early research was on assessing the relative quality of face to face and distance education. I served as an evaluator for the Commission on Higher Education of the Middle States Association with particular attention to non-traditional institutions. Fourth, I have experimented with a number of systems and techniques for delivering the content of my courses using various forms of technology that are not time or space bound -- a number of online lectures are mounted on my website in various forms of completion. This is all to say that at heart I am conversant with the various formats and technologies for distance and online education. Further, I would like to believe that I understand both the theory and practice of making it work. Yet I am resistant to some of the administrative mandates to "make it so".

    The source of my resistance comes at two levels. The first relates to focus and commitment. The second relates the demands and rewards of technology. I discuss both of these points below in more detail.

    Focus and Commitment

    While we build new dorms and classrooms at a cost of hundreds of millions of dollars without any let up, our investment in cyberspace is minimal at best. Yes the money spent on cyberinfrastructure is increasing, but seldom do we talk in terms of a five year or ten year plan in the same way we talk about physical infrastructure. Granted, it is hard to do plan far in advance given the rate of technological change, but it is possible to think about the future in terms of alternatives. I foolishly suggested to our Chancellor almost a quarter century ago that we should make an investment in technology equivalent to the investment we were making in buildings. If we begin to offer all of our education in selected areas -- graduate professional programs to select a target, we need a dramatically different physical infrastructure complimented by a significantly larger technical infrastructure. We also should consider that not all online education is equal. Our model should be Amazon, or Google. What I mean to say here is that Amazon is not just another bookstore, it is THE bookstore. Google is not just another search engine a.k.a. library, it is THE information source. At the risk of offending my professional colleagues, most of us are not good enough to be an educational Google or Amazon. There are faculty who are good enough, and they should be the focal point of the prototypical online courses. Again, I am reminded of some history. When I was director of External Studies, the composite rank of the faculty was the highest teaching undergraduates anywhere on campus. It was because we targeted full professors as those best able to express their lectures in writing. It should be no different with online education.

    I would suggest that a strategy for mounting a successful online education effort should be more than offer courses online. There are at least four first targets for online education.

    • World Class Courses offered on the internet should be designed to capture the entire market. The goal should be to be the singular brand for that service. The money should be invested with the aim of capturing the world. The content, the presentation, the services, the experience should all be first rate and designed to replace all equivalent courses offered by any other institution. This is what I mean by the Google/Amazon model. The fear that some have that institutions will lose students to online education is a valid one if the online classes are first rate. (I am not worried yet based on what I have seen.) Wouldn't it be exciting if the offerings of institutions were cut a hundred fold while the students in each of those offerings were increased a hundredfold. Each institution would offer its world class signature courses and students would have the benefit of a combined education across institutions that was unparalleled by the offering in any other form from a single institution. Imagine what the dynamics would be if you could have a class taught by the best faculty and serviced by the best PhD students with small group discussions among the 1000's of enrolled students going on 24 hours a day! I get so excited about what such a course might be like that I can hardly contain myself, but this is not the vision I hear being articulated and surely I don't hear plans to allocate enough money and resources to do it. More than 20 years ago, I worked on a project supported by the Annenberg Foundation to offer a national telecourse that was aired on PBS. It was called Planet Earth, and it was endorsed by the National Academy of Sciences. It was the first telecourse where the course materials were developed by an AAU institution. Our research goal, which we accomplished, was to be able to produce a custom textbook, which was integral, for each institution offering the course. We were able to produce camera ready copy of each ~600 page text book in about 2 hours. It required an hour of main frame computer time and about an hour of printing time. Today, this is not trivial, but it could be much more easily accomplished on a standard PC. The cost for the authoring, automation, and execution of the 300 individual textbooks was about $200,000 -- or $600 per master copy. (The cost of producing the video was another $1.5 million.) Today, for my first world class online course, I would set the goal of having a personalized set of materials for each student enrolled, and I would work toward an adaptive system that was continually adjusting to the particular needs and learning difficulties of the enrolled students. This is my vision of a world class course offering online. I would guess you could do it for well under a million dollars, and by my calculation if you attracted 1000 students at $1000/enrollment for the best class on X in the world, you would be at a break even. I seldom hear administrators talking about this kind of vision.
    • Nuggets would be those course offerings that can be mined from the knowledge already well formed in faculty. Nuggets, I have long held, abound in institutions of higher education. They are easy to imagine, and I believe almost as easy to find. At a theoretical level, imagine that everyone who has taught and done research for 10 or more years finds themselves at a cocktail party where for some reason they are motivated to explain what they know best to one of the guest's that shows a genuine interest. They wax eloquently for about a half hour and make clear something the guest could never have understood by reading for days. The faculty member has thought about it long and hard, tried to explain it to children, undergrads, and PhD students. They know it cold and they know how to explain why it is so exciting. I believe that there is, on average, one nugget per senior faculty member at a large research university. Granted, some faculty will be barren, but there will be others who have four or five. I would guess that at the University of Pittsburgh, there are about 3000 nuggets that could be mined, and at 30 minutes a piece that is 1500 hours of stimulating and provocative content. I would further be willing to bet that it would be relatively cheap to mine, and that at least 150 hours could be combined to form some new degree for people from 50-70 who want to know a little about all the aspects of our world from first rate minds that can explain it to an educated person. Even if you couldn'’t make a new degree program think of the value of such a collection for public and alumni relations. When I suggest nugget production and mining to my colleagues, their eyes glaze over. They don’'t know how they would sell a new degree program, or how to do alumni relations, or why it is important to let the public know the exciting parts of what we are doing. (BTW, at my website, in my online lectures, I mined what I hope are a couple of my own nuggets. My best effort is 26 minutes and 34 seconds on the last twenty years of my research -- – "“The Document Processing Revolution”.") Nuggets as the low hanging fruit of online education.
    • Building Blocks are those course components that are worth building for reuse. (The argument might be somewhat reminiscent of the move to consolidate statistics courses years ago.) This would be the topic covered in more than one course that others would use because they can't do it better, and maybe because it is not what they do best. Some nominees might be "how to properly cite references in a paper", "the assessment of statistical measures used in a research paper", "how to make notes on a book", “measures of central tendency and deviation." My personal take on one such topic is far less general, but it is of growing interest to my colleagues. I have been working on SGML and XML for more than two decades. As XML grows in popularity and its uses increase, I am asked more and more frequently to deliver a lecture or share my lecture notes. I suspect that from XML, to RFID, to TEI, to NMR, to ... there are topics that we would love to have others present for us as building blocks. I am not sure what the economic model for this is, but I don't think it is hard to develop one.
    • Content Focused Instruction is my name for that instruction that is good in any form because it is the content that is critical. I argued back in the late 70's while developing state wide continuing education courses for physicians delivered late at night or early in the morning by the Pennsylvania Public Television Network that we should be working toward a "television of abundance". (By the way, the TV series was called Physician Update, and it was a joint product of Pitt, Penn State, and Temple and it carried continuing education credit for physicians.) My argument then, and to some extent today, was that as we move from a few broadcast channels to hundreds of cablecast and now internet channels, content quality will trump production quality and we would see program selection guided more by the content than the production value. This is a lot of what is being done today, but I must admit that when I was talking about low fidelity in the 1970, I couldn’t have imagined just how embarrassing some of what is being produced today would be. If content is to trump production quality, there better be high quality content, not just some mindless drivel.

    Use of Technology

    I have said more than I intended in this post, but not quite as much as I feel needs to be said. You may have an inkling from what I said about world class courses that a really good online education course is not simply some video and notes online with a periodic discussion. There is a lot of technology that can be brought to bear, and while some of it is new, some of it is actually quite old.

    Last night, teaching e-business, I reminded the students that e-business is not simply about the use of technology. It is more about improving the bottom line via technology. This means one of several things, but the two most frequent goals are increased sales and improved productivity. If you spend $1,000,000 to offer new online education programs and simply shift your population from the classroom to their home, you have lost -- increased cost without increased revenue. Similarly, if you install course management software that decreases faculty productivity, you are not engaged in good e-business. So, it should be the case that effective online education is better, easier, faster, more efficient for both faculty and students. It should open new markets, or DRAMATICALLY improve customer satisfaction -- leading to increased donations from alumni, etc. Seldom do I see these assessment criteria applied. Our course management system must be great because it is costing us X million dollars a year. As best I can tell, few people are asking if it is making the faculty and students happier, more productive, or more efficient.

    With no effort to be exhaustive, and because I am getting sleepy -- as you may be -- here are just a couple of the dozens of ways we could make online education better than -- not just as good as -- traditional education.

    • Consider for example a multiple choice test. In class, you give the exam, score it, give it back, and discuss it in class. If you are doing it online, the test can be different for each student, students can be given immediate feedback, branching can allow a check to see if the question may have been confusing or whether the student might really understand the concept. Immediately after the test is administered, review material can be suggested based on the analysis of the answers. Wouldn't that be something?
    • Consider questions addressed to the instructor. Imagine they are computer mediated. Imagine question context, question, and answer are stored in a database. Further imagine that the next time a linguistically similar question is asked in the same context, the system asks the student if the previous answer helps. Think about just three of the implications. As an instructor, my effort is leveraged, the more I work with the system, the more I am freed from having to answer the same question personally. From the students point of view, it may be the case that after the course has been offered a couple times, my question will be answered not a day after I ask it but in a second! Finally, meta analysis of the data after a period of time might suggest revisions to the material!
    • Develop social awareness. About ten years ago, I participated in an online conference in which the participants were represented as a set of small squares on the left side of the screen -- there were about 200 of them. As the conference began the open squares turned white as people logged in. During the presentation, they stayed white, or turned blue, or turned red. Blue meant I am with you but bored, move faster; red meant I am lost and I need more info. I think yellow meant something as well. People could type questions at any point in time and they were filtered by staff and passed onto the instructor in real time. We all had feedback at many levels about the presentation.
    • For now, my final observation is about authoring. When I discovered I could take a standard PowerPoint slide set and voice narrate it, and turn it into html, I immediately did that. When I discovered it only worked in Internet Explorer, I stopped. Doing reasonable quality online course should not be an order of magnitude harder than doing regular teaching. A little harder is fine, and as easy as is better yet. The payoff of doing an online course should be at least as good as the payoff from doing a regular lecture and potentially better (e.g. the development of a question/answer system that saves me time.) We do not yet have the specialized authoring tools that make it as easy to do online education. They are coming, but Universities need to make significant investments.

    Online education is in our future, but we have not yet taken the time to plan an articulate set of goals, or made the investment to build the kind of infrastructure that makes this next generation of quality educational experiences a reality. It is not sufficient to say "make it so" unless the money, incentive, infrastructure, and most importantly vision are in place.

    [/2008/8] permanent link


    Wed, 13 Aug 2008

    What we still don't do in collaborative authoring (August 13, 2008)

    In the post prior to this one, I tried to answer the question "What happened to the research on 'collaborative authoring.'" In that post, I made a casual claim that today's web systems are lacking in several ways. This post speaks briefly to what we haven't yet gotten a grip on as we explore wiki and blog spaces. It is a possible roadmap for development of these new web based systems.

    1. Document locking. Ever try to have people really work on a document together. It is damn near impossible. We struggle to define schema that really model complex documents. Most people like html because it has no structure. On the other hand, good xml documents are real rich complex trees. They have a predictable structure. This allows branch pruning and grafting that allows for fine grained and coarse grained locks. We still do not see intuitive and easy to use document locking models.
    2. Document access controls. When we built CASCADE, we used five access levels -- executive, authoring, editing, commenting, and reading. Subsequent work suggests that it may be appropriate to have as many as seven. Given these new models it is relatively easy to use the existing work on role and time based access control to begin to build an easy to understand an use access control system.
    3. User and group awareness. Increasingly, systems are tailored to individual needs. It is the only way of dealing with information overload. What has happened since the last time I looked at the system and what demands my attention. Tell me only what I want/need to know and hide the rest. Similarly, whether I am a leader or follower, I need to be aware of what my teammates are doing in some meaningful and simple to interpret way.
    4. Wow tools. There are any number of tools that can be built once a base framework is in place. One of my favorites from a decade ago was what we called the comment report. Basically, every comment made in CASCADE was classified along up to four dimensions. I frequently used the dimensions of target audience, status, and type. So, a given comment might be an objection, which was open, and targeted at the editor. The comment report allowed you to select any number of pieces of the document, any or all classes of all the dimensions, and then have the system build a summary or detailed report. So, I could ask for all objections that were open and targeted at the editor. The system would produce a list of the 3 or 300 comments in a second and build a report that acted as an ad hoc hypertext document that would with a click take me to that portion of a vast document where the comment was located. Similarly, the data structures allowed me to access information about what a group, individual, set of groups, set of individuals were doing in terms of a large enumerated set of action types, across the project as a whole or any subset of files or folders. Again, the results were an active hypertext report. There were dozens of these tools that reduced hours of drudgery to seconds. But they were all dependent upon the infrastructure.
    5. Enhanced Communication. The term deixis refers to aspects of a communication whose interpretation depends on knowledge of the context in which the communication occurs. So for example, a commenting system that places the comment in context allows a comment like "what's this". It is easy to type with the meaning based on context. When one looks at wiki’s that allow comments only on the page as a whole or big sections, deixis is much more difficult. Would it be nice to comment on a word, a sentence, a person in an image, a small fragment of a video, etc. These add complications in coding and nightmares related to editing, but they are all theoretically possible. Of course context is potentially far more complicated. Who am I, who is the communication with, what is the nature of the hat I am wearing, etc. all impact what the communication means. Our auxiliary communication tools are all relatively primitive and isolated. Imagine systems that switch from voice to text to images as needed by the context. Imagine that people receive information in a form appropriate to their preferences.
    6. Lost in space. Perhaps one of the most frustrating parts of blogs and wikis for me is the lack of a visual navigation structure that allows me a high level overview of the structure. I am not pushing CASCADE, but it had a feature I really like. It began with the login. I was presented with a list of all my projects and a summary of the activity in each project since I last visited. The summary was a number that reflected the number of distinct atomic activities since my last visit -- examples of atomic activities included comments made, comments answered, comments reclassified, documents added, documents edited, documents deleted, etc. There were about 40 of them. For each project I would get a single number which aggregated them all -- and keep in mind, one of the wow tools allowed me to see a list of those of interest to me that was an active hypertext structure. Next, I always entered the project at the root, and could always get back to the root. (Never too lost in space) At the root, one would normally find a set of folders and a few documents. Folders had light type on dark backgrounds. Documents had dark type on light backgrounds. Dark Blue folders were like those you know. Dark brown folders were ordered -- i.e. you could add a folder or document without specifying the order. The system allowed for other folder types of be defined. Images were generally light blue, GIF's had red type, TIF's had blue type, etc. Text was light yellow, XML used blue type, ASCII, used black, etc. You get the idea. Finally, there was a thin red line across the bottom of the icon that indicated the number of document in a folder or the number of comments in the document. It was amazing how with a little practice and orientation, this system of visual navigation greatly reduced the feeling of being lost in hyperspace.

    [/2008/8] permanent link


    Sun, 10 Aug 2008

    The research on collaborative authoring (August 10,2008)

    I was recently asked what happened to the research on collaborative authoring that seemed to have died out around 2000. The question came to me because of the work I and my PhD students -- notably Bordin Sapsomboon, Wasu Chapanon and Marut Buranarach, had done on a system called CASCADE -- Computer Augmented Support fro Collaborative Authoring and Document Editing. I was reminded of a literature search this past summer on web-based group decision support systems undertaken by a beginning PhD student who was visiting for the summer. Much of the significant literature seemed to have been developed in the early to mid nineties. Since that time, there has been little new work. It is all to easy to simply blame the web. It does indeed seem that much of what students want to do today is simply build little toy systems that provide one small aspect of a solution to problems studied more seriously in the pre-web years. As with most situations, the question would appear to have a more complex answer.

    What happened to research on collaborative authoring? Over the years personal reflections have included several reasons why more is not being done.

    1. Maybe we were fooling ourselves. The research that many of us were doing in the mid to late nineties looked to "augment" the "collaborative authoring" process and make it better. My own system, CASCADE was designed to be both a functional system and a testbed for research on the complex phenomenon. At the heart of our work was the assumption that there are people who do, or want to, write collaboratively. Our prototype documents were national and international standards. When all was said and done, despite committees of hundreds, the writing was really still done by only one or two people with lots of comments from others. Indeed, the same seems to hold true in the academic world. While there may be many authors on the proposal or the paper, the vast majority of the work is done by one person. So, observation one is that "collaborative authoring" may be a pipe dream. Generally speaking individuals write, and depending on their situation, make use of comments and feedback from other people. Word with commenting and change tracking facilities may be enough.
    2. The demands of group authoring. In order to accomplish many of the goals of collaborative authoring, such as fine grain locks and style consistency, we needed editors that were somewhat more structured than people were used to -- e.g. we were using xml as the basis of our locking system. User feedback made clear that any editor was great so long as it was exactly like Word, or whatever editor the user was familiar with. At the time, and to a lesser extent still today, Word couldn't do what we needed to do, so we built our own more structured editors. It was an uphill struggle to get end users to accept a more structured approach. Beyond this, even if we could have mimicked Word, we still would have imposed more structure. Corporate documents are not like personal documents, but people don't like to hear that. (As a plug for my own work, I tell my students that the best way to classify documents is as personal, group, corporate, enterprise, or archival. Each is increasingly more restricted and standards bound. It may not be important how you write a personal love note, but documents that flow across enterprises have to be understood by all who encounter them. The penultimate demands for structure occur when a document -- a birth certificate or an academic record has to survive time and space for uses that may not yet be defined -- a requirement for archival documents. Our reality suggested more structure was needed. The user demand was to be able to do whatever they wanted.
    3. Microsoft does what people want. Word got a lot better at supporting informal collaboration. The emergence and refinement of the track changes and SharePoint linkages for word are a real boon. As refinements to the de facto standard -- Word got better, the demand for specialized software diminished. Even if Word is still less than what we were trying to do, it was close enough.
    4. We all fell in love with the Web. I was actively engaged in collaborative authoring research when the web appeared. I was well aware of Berners-Lee's vision of the WWW as an interactive system. When WEBDAV extensions came along, we moved a little closer to authoring as well as dissemination, but it clearly was not enough. Given a few more years we got blogs and wikis. While a wiki does only a tiny fraction of what we wanted to do, for the class of people who like a little collaboration with minimal structure, it was more than enough. So the Web took the low end of the market out. In addition the people who would be the early adopters -- the young people -- think wikis are great and see no reason for real collaborative tools. With time, the growing array of content managment systems will fill much of the gap. It still turns my stomach when I ask PhD students to assess the capabilities of the best wikis. They seldom if ever come up with the dozen or so things they don't do that we were working on in the late 90's. For today's generation concentrated analysis of what would make for great collabortative systems is weak.
    5. The rise of the firewalls. Related to the web, our software was connection oriented client server software. We has state information and knew who was working on what and when. This required a concurrent connection oriented client server system. The protocol ran on a port of its own -- i.e. it did not piggy back on a well known protocol. This was common for client server systems design in the pre web era. As our project moved forward, so did corporate firewalls and between 1996 and 2000 they were being locked down tightly to prevent the growing cadre of hackers trying to penetrate corporate resources. Nothing could get out from individual desktops unless it was directed to port 80. (At one point I wrote a tunneling proxy program that maintained state, but worked through idempotent http connections to satisfy the network Nazi's.) The situation has improved a little since then, but not much. The rich array of distributed software using sophisticated client server protocols were sent into oblivion as firewalls emerged and demanded that anything that was to be done would have to be done through the web protocol -- a stateless request/response protocol.
    6. Research on the social periphery. Some of the most exciting work we were doing was related to adaptation and the social periphery. That is, in face-to-face collaboration you begin to work as a seamless team and you get to read your teammates eyes and perspiration levels -- you can sense who is with you and who is against you. We were working to infer that social periphery based on micro actions and make it visible to team members. As you might guess, trying to make explicit what the best team leaders have been doing for years was not exactly something people endorsed. Too much like big brother. This was the most exciting research, and the most controversial.
    7. The mistaken belief that the goal is efficiency. My work was based on the fact that it takes about 10 years and 10 million dollars to produce a national or international standard in IT. We were going to reduce that to 5 years and 5 million dollars, and I am still convinced we could. The big savings came in supporting synchronous and asynchronous collaboration on the document via the computer rather than traveling to cities like Paris, London, Tokyo, San Francisco, etc, three times a year to meet and iron out details. We also eliminated all the formal paper ballots and mailings. Well guess what. The senior engineers really didn't want to be that efficient. They were at a point in their career where regular travel to global cities for meetings with old friends -- look at the rosters of standards committees -- was a boon not an obstacle. Further, the SDO secretariats at the time were made up of older engineers and the notion of substituting computer based systems for the paper systems that justified their large staff and corporate existence was not quite what they had in mind. ANSI, ISO, and the ITU also had regulations that were paper based in actual specification. Much of this has now changed with the relentless pressure to adopt technology into work processes. While IT standardization is the world I am most familiar with, if you look at the FDA requirements for new drug applications, or FAA requirements for plane documentation, a lot of them during that period were still dependent on paper trails for auditing.

    [/2008/8] permanent link


    Wed, 30 Jul 2008

    Educating for the Future (July 30, 2008)

    I received a note recently from a 1991 graduate of our program. Michal was a great student, and wrote a great article on anticipatory standards while he was here. He has worked in corporate positions, in a startup, as a consultant, and most recently as a government contractor. His note made reference to the things he recalled from his education that seemed bogus now, and things that seemed to look forward. In part he wrote: "I can recall creating a DOS-based hypertext program for your document processing class and on demonstrating it seeing some of your students understand for the first time what hyperlinking really meant. Now’s hypertext surrounds us!"

    I also recollect those early years with DOS machines and all the fun we had with wrting programs to: control the color registers for the monitors, directly manipulate the ports, edit the FAT tables and so many other things. Given the work at Xerox PARC, and other places, on hypertext, it was only reasonable for us to look at the technology. In the mid eighties, Xerox gave me several 8100's with both the office and development software to build our own systems. Notecards was a sophisticated hypertext system and a thing of beauty to work with. The knowledge provided by Xerox about what was possible coupled with the accessibility of the DOS machine made it easy to build a simpler but nonetheless functional series of hypertext systems.

    Michal's note makes me wonder whether I will get another note two decades from now reflecting on something we are doing today. I would like to think the work we are doing on the social periphery in collaboration systems, or ontology development, or aggregate annotations will have some impact. At the same time, I am at a point in my career where I grow fearful that I am losing touch with the direction and shape of the technology trajectory. There is so much happening and I find it hard to see the themes and the directions. Sometimes, as I think is the case for most old curmudgeons, it appears that we are breaking no new ground, but simply revisiting, out of ignorance, what we learned many years ago.

    In responding to Michal's observations, I tried my best to think about what we should be teaching today to educate our students for the coming years. In part my response said "I have taken a good portion of this summer to work on some ideas about where we are going. Two things keep banging me on the head, and I have been trying to think about what they mean.

    • About four years ago, I started to digitize all of my personal analog history. This past year completed the digitization of every video, slide, negative, audio tape I had collected. My long dead father now speaks of his childhood on a CD. All of the video tape of my children dating back to 1986 is now on a DVD. The 30 minute super 8mm movie I made in 1968 is now also a DVD! My sons, one of whom just graduated college and one of whom will turn a junior, will each inherit about a terabyte of digital data which will be my best effort to record their lives as well as my meager accomplishments. So, knock on the head number one is how does the world change if we can develop a stable, mine-able, complete, coherent digital life history. This will probably be possible, if not common place, for my children's children. It is the basis of Gordon Bell's Life Bits project. The opportunities and challenges of these kind of repositories are many and varied -- and perfect for exploratory projects. Maybe the next hypertext project should be organization and use of life repositories.
    • The second knock in the head has been some work on "aggregate annotation information" or at least that was the title of the last doctoral seminar I did on the topic. Here's the skinny. Google is the best search engine because it uses Page Rank -- named after Larry Page. Page said the most important pages will be those pointed to my the most pages that are important pages. This algorithm is quite rich and more complicated than this description, but it describes the essence of the theory. One of my doctoral students -- now graduated -- used the social bookmarking system delicious to do some searching. Using bookmarked pages from delicious, he got search results that were slightly better than Google. They weren't significantly better statistically, but the key really is that they were not significantly worse -- which I would argue they should have been. What is amazing is that by "reducing" the web to only those pages that were bookmarked, we eliminated 99.9% of the pages. Delicious has only one page for every 1000 in Google. Thus with a server farm 1/1000 the size of Google we were able to produce results as good as Google's. Why? Because we don't bookmark junk! Now all we need to do is collect the bookmarks of all the smart people in the world, and we will have a great filter -- aggregate annotation information.

    This summer, I spent an inordinate amount of time writing programs that analyzed RSS feeds I like to read. As a result, I now have a prototype feed reader that analyzes what I read, statistically clusters it, breaks out important words and topics, makes them into weighted anchors that "attract" incoming news articles and visualizes my information space to let me know if I want to think about something. Statistical inferences, aggregate annotations, and visualization of ad hoc information stores in real time all seem to offer endless vistas for development.

    [/2008/7] permanent link



     
     

    Accesses since January 1, 2007: