|Contribution to `Post Digital Library Futures'
Gio Wiederhold, Emeritus Professor,
26 May 2003
Download: PDF Version WORD
The problems addressed in this note focus on reducing the
human overhead in obtaining information from Digital Library
and general web resources, while retaining the valuable contents.
It is intended to deal with a problem recognized many years
ago by Herb Simon:
"What information consumes is rather obvious; it consumes
the attention of its recipients. Hence a wealth of information
creates a poverty of attention, and a need to allocate that
attention efficiently among the overabundance of information
sources that might consume it."
However, allocation is not a
simple task. Getting the right stuff to the right person has
many aspects, and means supporting a variety of technologies,
and understanding their benefits, costs and interactions. The
total of available attention in the world may well be less
than the total available information. We talk about billions
of on-line webpages, and a hidden web that is yet larger. And
yet, because so much potentially valuable information is lacking,
many initiatives are funded to put more on the web. A crucial
task is hence the reduction of available information to actionable
information, i.e., the specific information that will cause
a change in behavior, a reduction in further work, or the making
Many technologies to filter information have been investigated
in the past, and we list some of those, rapidly moving to harder
and more speculative tasks. Several are in routine use
by document contents. Associated with ranking is the assumption
that the consumer will only consider a
few documents on the top of list.
- Ranking by authority.
Giving preference to documents published at a site that
is valued in a context, for
that would be a journal versus a workshop report, for
many other sources it would be a recent document.
by reference authority -- Google's page ranking algorithm
extracts communal knowledge as evidenced
by references given.
- Elimination of redundancy. If similar
documents are retrieved present either the latest one,
a suitable criterion.
- Differences among documents:
obtaining what is different between a known document and
The task may
be as simple as looking for additional material
in a new version
or as hard as requiring a deep analysis of both
comparison on a higher level of abstraction.
the novelty of a new document in respect to a given document
seen as a generalization
of the high level of abstraction approach when
comparing two documents.
- Determining novelty
individual with respect to an individual. If all the knowledge
by an individual
can be captured
then one could truly find out what material
useful. Since there would be too much, domain
emphasis is needed,
and the unsolved (unsolvable?) problem of
`common knowledge' should be avoided,
- Abstraction of textual
documents to retain essentials. There has been work on
selecting sentences that appear to represent
the contents; better
abstractions can be gained for domain specific
texts, as pathology reports.
An interesting task here would be automatic
annotation of gene-sequences from relevant
- Abstraction of the contents of document
collections is an obvious generalization.
That task will
require integration, and also semantic
matching if the
sources used autonomous
- A complementary source here
is data-mining. I'd keep data-mining as such out of the
scope of digital
initiatives, but linking data-mining
results with information from textual
sources would strengthen the users
- Reduction of textual information
into a visual presentation is a step
would require the competence
of doing abstraction and the ability
to place the result into some model.
that has a temporal or spatial aspect:
Progress notes for a patient,
description of an exploratory journey,
or the progress of a scientific project.
- Moving yet to a greater level of difficulty is populating
model with such
information. Here again domain
specialization will be needed to
success. Having an analytic model
will allow manipulation,
not only to
discern novelty, but also as a
representation of normal behavior,
if the domain can be well characterized.
Domains that may lend themselves
are corporate finances from 10-K
and similar documents. Harder
would be domains as ecological
processes, a global change. A challenge would
be metabolic models,
an understanding of food, drug,
and environmental effects on organisms.
- Having populated models
will allow support of two further challenges.
The first one
is support for predictive
Current information technologies,
data-mining, and digital libraries
are seen as supporting decision-making,
but fall short of providing the
needed infrastructure. The decision
will copy the resulting
information into a spreadsheet,
and then add formulas to make
extrapolations into the future for various scenarios
of investment, probabilities
It should be
obvious that information
should not terminate their support
the past, but be able to extrapolate
And the outcomes of alternative
futures should be readily
- The second, yet harder
challenge is the discovery
when we try to use information
systems to look for terrorists.
i.e., reasonable frequent relationships
among data and, by abduction
the processes that
marketing folk, but not intelligence
tasks. One can locate unusual
or abnormal behavior
perhaps one based on recent
incidents, as the linking of flight-schools
enrollments and behavior
unknown abnormal linkages requires
populating a large model with
normal findings, since
be identified if normality
can be quantified. Unfortunately,
such models will be large since
observed data, say travel patterns,
aggregate of activities
in many domains,
here business, holidays, and
family visits and emergencies.
Many of these task may be handled semi-automatically, i.e.,
with human supervision, before full automation can be achieved.
But semi-automatic systems should have the capability to learn
from those interventions, so that the human load is reduced
All these tasks can be expanded by adding adjectives as `distributed',
`multi-media', or `ubiquitous', but those won't change he scientific
In order to assess the cost/benefits of alternative technologies
the setting has to be quantified. In some settings the cost
of missing a source entry (Type 1 error) is high; in other
settings the cost of having to reject irrelevant entries (Type
2 errors) is high. For instance, we the cost of missing a terrorist
is indeed high, but many schemes now being considered fail
because technologies that have a low rate of Type 1 errors
are typically associated with a huge rate of Type 2 errors,
so that even at a low cost per error rejection there may be
no acceptable cost/benefit ratio.
To support web-based businesses, as envisaged in the semantic
net initiatives, a very low rate of Type 2 errors, false hits
will be needed. Businesses already today routinely pre-qualify
suppliers that they will consider dealing with. The potential
cost of getting the wrong stuff, getting it late, or obtaining
the wrong information about stuff, is so much higher than the
benefits of `getting a good deal', say getting some supplies
at 5% less, are relatively negligible. Here certainly smaller
Assignment of costs to these two types of errors also depends
on ones background. Often senior people, having grown up in
an information-poor setting will want to get all the information.
It is often the generation in the trenches that realizes that
there is too much to devote attention to.