CS 745/845 - Introduction to Digital Libraries
Dr. Kurt Maly
Michael L. Nelson
Spring 1998, Thursdays, 7:15-9:45 pm, Education Building, Room 155
Grading: Take Home Mid-Term Exam (30%), Project (70%).
Practical Digital Libraries: Books, Bytes, and Bucks is recommended,
but not required. Other readings will be assigned.
References used in the slides and lectures
Project Proposals By Last Name (2/22/98)
Digital Libraries (DLs) are increasingly popular research area that encompass
more than traditional information retrieval or database methods and
techniques. We will cover a brief history of DL development, with emphasis on
World Wide Web implementations. Case studies will be performed on various
DLs. The class will focus heavily on project work, especially writing and/or
coding. At the end of the course, students will be prepared to develop,
evaluate, or apply digital library technologies in their work environment.
- What is a digital library (DL)? how is a DL diffrent from a database?
from the World Wide Web (WWW)? from traditional Information Retrieval (IR)
- History of Scientific and Technical Information (STI) distribution.
Recent trends in readership, authoring, and economics.
- STI literature: grey & white.
- Traditional library vs. DL
- Slides: week1.ppt
- Vannevar Bush, "As We May Think", Atlantic Monthly, July 1945.
In a very real sense, Dr. Bush invented the field of digital libraries and
all we are doing is implementing the vision he laid out 50+ years ago. No
serious discussion of DLs can begin without mention of this article.
- Esler & Nelson, "Evolution of Scientific and Technical Information
Distribution," JASIS 49(1), 1998.
This article provides the background and motivation for the DL work that
is being done at NASA LaRC and ODU. Relevant to the immediate lecture is
the discussion of the current journal system, grey literature, non-report
STI (i.e., software, datasets, images, etc.).
- Publishing models, DL input, scanning, re-keying, OCR
- SGML and other formats
- Metadata: purpose, formats, use
- Slides: week2.ppt
- Ginsparg, P., "First Steps Toward Electronic Research Communication,"
Computers in Physics, 8, 1994, pp. 390-396.
Not surprisingly, much of the literature concerning the intersection
between the Internet and scholarly publication has been by those who are
(openly or secretly) fascinated by the scholarly publication process itself.
Most of the discussions are simply proposals on how to re-implement existing
processes (review, payment, etc.) using the new technology. Ginsparg is
one of the first to argue that Internet technologies can radically alter
the research communication model, and not simply provide "electronic
journals." This is possible by having the research authors
directly input their results into a DL, bypassing the traditional
journal routing/review system.
- Lesk, M. "Books Into Bytes," Scientific American, March, 1997.
http://community.bellcore.com/lesk/sciam97/sciam97.html (original version)
http://www.sciam.com/0397issue/0397lesk.html (published version)
Most of the material in this article appears in the textbook. Using many
non-STI examples, Lesk gives brief examples and economic figures for
scanning, OCRing, rekeying, and media conversion. There are also mentions
of metadata use (Text Encoding Initiative), markup languages (SGML), and
format permanence. The focus of this article is converting information
into digital formats.
- Heery, R. "Review of Metadata Formats," Program 30(4), Octobert 1996.
There is a lot of literature about metadata ("data about data"); Heery gives a
readable overview with a small number of relevant examples. In traditional
libraries, the distinction between metadata and data was well established:
the metadata provided a "pointer" to the physical location of the object.
In a DL, this distinction has blurred, and DL builders must understand
the subtle distinctions when populating their DLs. "MARC" and "Dublin
Core" are among the relevant sample formats discussed.
- Searching, retrieval, indexing: algorithms and applications
- Metrics and evaluation: issues and case studies
- Slides: week3.ppt
- Gerald Salton, "A New Comparison Between Conventional Indexing (MEDLARS)
and Automatic Text Processing (SMART)," Cornell University Computer
Science Technical Report TR71-115, December 1971.
Finding an appropriate, on-line reading for traditional information
retrieval is difficult. IR is a mature field, so overview material exists
only in textbooks and the current research literature deals mainly with
esoterica. The above technical report is an early result from Gerald
Salton, a pioneer in modern IR, and features the SMART on-line retrieval
system, a testbed for many IR experiments. This report highlights
many key issues in IR: precision and recall; boolean searching vs vector
matching; relevance feedback; manual indexing vs automatic processing.
- Digital Libraries: history, definition, characteristics, architectures
- Kahn/Wilensky Framework and its derivatives
- Slides: week4.ppt
- The lecture for Week 4 will focus on what is known as the Kahn-Wilensky
Framework, and some its derivatives and implementations. Most DL projects
began in an ad-hoc, bottom up fashion that facilitated quickly providing
access to large stores of information. Kahn & Wilensky provide a strong,
consistent, and extensible framework for storing, managing and accessing
digital objects. The KWF will provide the basis for the next generation
of productions DLs. As such, knowledge of the KWF will allow us to
provide a more thoughtful critique of the upcoming DL case studies.
- Robert Kahn and Robert Wilensky, "A Framework for Distributed Digital
Object Services," cnri.dlib/tn95-01, May, 1995.
This is the original document outlining what has become known as the
Kahn-Wilensky Framework. No implementation issues are discussed, but many
key concepts and terms are introduced. A working understanding of this
document is necessary for a significant remainder of the class.
- C. Lagoze and D. Ely, "Implementation Issues in an Open Architectural
Framework for Digital Object Services", Cornell CS Dept
TR95-1540, September 12, 1995.
This document discusses some of the design concepts introduced in the KWF and
provides a treatment that is one step closer toward implementation.
- C. Lagoze, R. McGrath, E. Overly and N. Yeager, "A Design for
Inter-Operable Secure Object Stores (ISOS)", Cornell CS Dept TR95-1558
November 27, 1995.
This document is another iteration in the design process of KWF issues.
Its discussion includes proposed CORBA implementation.
- DL Case Studies: Netlib, NHSE, WATERS, LTRS, NTRS, LISAR/LAVA, NIX, UCSTRI, Physics e-print, NCSTRL
- Slides: week5.ppt
- This week we begin looking at the architecture of some significant WWW DLs.
I have tried to pick readings that are representative of classes of DLs.
In particular, these papers show DLs that have an increasing level of
sophistication that comes at the expense of greater synchronization and
participation requirements of the remote sites.
- M. VanHeyningen, "The Unified Computer Science Technical Report Index:
Lessons in Indexing Diverse Resources," Proceedings of the 2nd
International World Wide Web Conference, October 19-21, 1994, pp. 535-543.
This paper documents the development of the UCSTRI system. UCSTRI was
interesting in that it was an early effort that provided a surprisingly good
search interface to a collection of anonymous FTP servers. The key is
that the anonymous FTP sites did nothing to participate in UCSTRI; their
contents were cataloged and heuristics applied to guess the formats, etc.
This is a DL equivalent of current meta-searchers like Altavista, Infoseek
- M. Nelson and M.-H. Maa, "Optimizing the NASA Technical Report Server,"
Internet Research, 6(1), 1996, pp. 64-70.
This paper gives an overview of some architectural improvements in NTRS,
most notably integrating parallel searching. NTRS is a gateway to 15+
different DLs, and requires the remote sites to meet a minimal requirement
for participation (in contrast to UCSTRI). Most of the individual nodes
in NTRS are are of the hybrid http/ftp server variety discussed in the
- J. Davis and C. Lagoze, "The Networked Computer Science Technical Report
Library," Cornell CS TR96-1595, July, 1996.
This paper gives an overview of NCSTRL and the protocol that it is built
upon, Dienst. NCSTRL follows the independent, distributed publisher model
similar to DLs discussed in the above papers, but differs in that it
requires the installation of a sophisticated suite of software for
publication management, indexing, and serving. NCSTRL has 100+
participants, and Dienst is probably the most sophisticated and rich WWW
DL system in widespread use.
- DL Case Studies: DLI, JSTOR, LOC, STELAR, ADS, NACA, CORE
- Slides: week6.ppt
- This week we continue discussing the architecture and status of some
functioning DLs. Some notable projects such as JSTOR (www.jstor.org) and
the American Memory Project (rs6.loc.gov) have no published material about
their architecture. We will discuss their observed and inferred
architecture however. We will also go over some of the recent changes in
the Dienst 4.0 -> 4.1 protocol.
- A. Accomazzi, G. Eichhorn, M. J. Kurtz, C. S. Grant, S. S. Murray,
"Astronomical Information Discovery and Access: Design and Implementation
of the ADS Bibliographic Services," Astronomical Data Analysis Software
and Systems VI, Vol. 125, 1997, pp. 357-360.
This short paper discusses how the Astrophysics Data Systems (ADS;
http://ads.harvard.edu) and its current access mechanisms to its holdings.
The ADS is a NASA funded effort that does a remarkable job of providing
access to abstracts, articles (in both scanned and native formats), and
datasets. You are highly encouraged to take a tour of this service.
- R. E. McGrath, "UIUC DLI Project Scale-up: A Technical Evaluation,"
Decemember 15, 1996.
This (unpublished?) paper provides a thorough examination of the
architecture of the Digital Library Initiative (DLI) project at the
University of Illinois. The UIUC portion of the DLI program focuses on
"federating" the journal output of serveral professional societies (IEEE,
ASCE, AIAA, etc.), providing multi-disciplinary access to many traditional
journals using a variety of on-line mechanisms, WWW and otherwise. This
prototype is not publicly accessible, though one would assume differently
after a cursory review of the literature. However, given the broad scope
and high profile the project enjoys, it merits review in our discussion.
The UIUC DLI home page is at: http://dli.grainger.uiuc.edu/.
- Buckets: Use, current status, Bucket Matching System (BMS), related work
- NCSTRL+: modifications to Dienst to support "clustering" of the contents
- Aggregation use in DLs
- Intelligent agents and use in DLs
- Slides: week7.ppt
- Michael L. Nelson, Kurt Maly, Stewart N. T. Shen and Mohammad Zubair,
"NCSTRL+: Adding Multi-Discipline and Multi-Genre Support to the Dienst
Protocol Using Clusters and Buckets," ODU CS TR-97-40, December 1997.
This paper gives an overview of the LaRC / ODU joint program: NCSTRL+.
NCSTRL+ extends the Dienst protocol (4.0) to incorporate the ability to
"cluster" subsets of a DL's holdings. This includes the ability to
cluster along: publishing institution, subject category, and genre of the
STI. The NCSTRL+ project also includes the development of buckets.
A revised version of this paper will appear in the proceedings
of "Advances in Digital Libraries 98."
- Michael L. Nelson, Kurt Maly, Stewart N. T. Shen and Mohammad Zubair,
"Buckets: Aggregative, Intelligent Agents for Publishing," ODU CS TR-97-41,
There is some overlap between this paper and the above paper, but this one
focuses just on buckets and is thus able to discuss them in more detail.
This paper also discusses some of the projects that are similar to buckets.
- Meta searching: historical perspective (archie, veronica, etc), issues
- Current trends: commercial systems (Yahoo, Lycos, etc.) robots, STARTS, Lyceum
- Slides: week8.ppt
- Ming-Hokng Maa, Sandra L. Esler and Michael L. Nelson, "Lyceum:
A Multi-Protocol Digital Library Gateway," NASA TM-112871, July 1997.
This week we look at metasearching, robots and directories. The common
requirement for these applications is that the information to be indexed
and served to the user is "out there" and may not be known a priori.
Lyceum is a proof-of-concept meta-DL constructed by searching
individual DLs nodes. The nodes in Lyceum are of different protocols,
and Lyceum performs conversion of the queries to the protocols of the
target DLs. Lyceum does "HTML-scraping" to present the search results to
- L. Gravano, C.-C. K. Chang, H. Garcia-Molina, A. Paepcke, "STARTS:
Stanford Proposal for Internet Meta-Searching," Proc. of the 1997 ACM
SIGMOD International Conference On Management of Data, 1997.
STARTS approaches metasearching by defining a interoperability protocol to
be implemented by the different search engines. STARTS defines the
mechanism by which proxies can query indices (of differing protocols) and
have enough standard meta-information to filter, rank, and display the
- C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, Michael F.
Schwartz, and Duane P. Wessels. "Harvest: A Scalable, Customizable
Discovery and Access System. Technical Report CU-CS-732-94, Department of
Computer Science, University of Colorado, Boulder, August 1994 (revised
If you don't want to do protocol conversion for different indices,
or you cannot rely on the indices to comply with a protocol such as
STARTS, then for some applications it is reasonable to gather the remote
information yourself. The architecture of most commercial systems
(Altavista, Lycos, etc.) is proprietary information, but Harvest is a
freely available and popular system for gathering and serving remote
information that incorporates all the general components of its commercial
brethren. It has a clean, modular design and has the ability to
hierarchically arrange different Harvest servers.
- Non-textual DLs: Software, Datasets, Video, Image, Audio, Geographic, others
- Slides: week10.ppt
- R. J. McNab, L. A. Smith, D. Bainbridge and I. H. Witten, "The New Zealand
Digital Library MELody inDEX," D-Lib Magazine, May 1997.
This paper describes the MELDEX system which is designed to search and
retrieve musical recordings based on humming into a microphone. The audio
input is converted into regular musical notation, then regular string
matching is done to come up with a "best match" for the song that fits
- S.-F. Chang, J. R. Smith, H. J. Meng, H. Wang, and D. Zhong, "Finding
Images/Video in Large Archives: Columbia's Content-Based Visual Query
Project," D-Lib Magazine, February 1997.
This paper describes the SaFe (Spatial and Feature query system) and other projects
at Columbia University. It combines textual searching with content based
searching for videos and images. SaFe extracts "features" and text from
images and video and uses them to index them.
L. D. Bergman, V. Castelli, C.-S. Li, "Progressive Content-Based Retrieval
from Satellite Image Archives," D-Lib Magazine, October 1997.
This paper gives a good background on the issues of searching geo-spatial
data. Extensive examples are given, but I don't believe there is a demo
- J. Frew, M. Freeson, R. B. Kemp, J. Simpson, T. Smith, A. Wells, and Q.
Zheng, "The Alexandria Digital Library Testbed," D-Lib Magazine,
This paper gives an architectural overview of the Alexandria Digital
Library. It does not specifically focus on the issues of searching
geo-spatial information, but since Alexandria does have a workable on-line
demo, and Alexandria is a DLI-funded initiative, I have included this
paper in our readings.
- Intellectual property: copyright, security, commerce, terms and conditions
- Slides: week11.ppt
- U. Kohl, J. Lotspiech, M. A. Kaplan, "Safeguarding Digital Library
Contents and Users: Protecting Documents Rather Than Channels," D-Lib
Magazine, September 1997.
This document gives an overview of security in a DL context. They
contrast securing the communications channel via SSL, SHTTP, etc. to
securing the digital object itself. The latter philosophy, of digital
objects beings responsible for their own "security" is similar to
digiboxes and buckets, discussed in lecture 7. This allows what they
refer to as "super-distribution," in which the object is free to "move"
around. This also fits with model of "dumb repositories, smart objects"
we have put forth with buckets.
- R. Stallman, "Why Software Should Not Have Owners," 1994.
Richard Stallman is head of the Free Software Foundation, and both he and
Linus Torvalds (creator of Linux) are the de facto point men for the small but
vocal free software community. I present this reading half in jest, since
I've already disclosed that I'm an information radical ;-), and half seriously
since most of the concepts put forth in Chapter 10 of the Lesk textbook
can be considered as "philosophy." Perhaps the trouble we have
encountered in adapting intellectual property to a digital environment
should be considered an invitation to re-examine some of our
presuppositions about information.
- Guest Lectures:
- Richard S. McGinnis, Head of the EOSDIS DAAC Program Office
Richard will talk about the architecture, services, and direction of the
Earth Observation System Data Information System Distributed Active
Archive Center (EOSDIS DAAC). http://eosweb.larc.nasa.gov/ for more information.
- Mike M. Little, Head of the Information Management Branch
Mike will talk about the Langley Technical Library, Digital Libraries and
how they fit in with an information architecture for NASA Langley Research
Center. http://library-www.larc.nasa.gov/ for more information.
- Common Object Request Broker Architecture (CORBA), Internet Inter-ORB Protocol (IIOP), General Inter-ORB Protocol (GIOP), Stanford's Infobus Project.
- Slides: week13.ppt
- Steve Vinoski, "CORBA: Integrating Diverse Applications Within Distributed
Heterogeneous Environments," IEEE Communications Magazine, 4(2), Feb. 1997
This paper gives an overview of the Common Object Request Broker
Architecture (CORBA). CORBA is a core technology for establishing a
communications infrastructure between heterogeneous, distributed objects.
CORBA is being developed by the Object Management Group consortia and is
genericly known as "middleware." However, it will likely have significant
impact on the digital library community, esp. in projects such as buckets.
- M. Baldonado, C.-C. K. Chang, L. Gravano, A. Paepcke, "The Stanford
Digital Library Metadata Architecture," International Journal of Digital
Libraries, 1(2), September 1997, pp. 108-121.
This paper gives an overview of the Stanford University projects that are
a part of the DLI program. Included in this review is STARTS, which we
have already studied, and the Infobus, a project for DL communications
that uses CORBA as the core technology.
- Collaboration and Digital Libraries
- Slides: week14.ppt
- L. M. Simmons, Jr., "Collaboration Dreams (Guest Editorial)," D-Lib
Magazine, March 1997.
This editorial provides an insightful overview of collaboration. Simmons
begins with the explanation of the role of collaboration among humans in
general, then of collaboration in a research environment. He then
describes the function of the ideal electronic collaboration tool. A
short, readable article that motivates discussion of collaboration.
- M. Van Alstyne & E. Brynjolfsson, "Could the Internet Balkanize Science,"
Science, 294, Novemmber, 1996, pp. 1479-1480.
Van Alstyne & Brynjolfsson offer a less optimistic view of electronic
collaboration. Their thesis is that the electronic access to colleagues
that allows geographic barriers to be overcome for collaboration could
also produce a homogeneity of thought. Geographic balkanization
could be replaced by electronic balkanization, and the necessary
ingredient for multi-disciplinary collaboration could be lost.
- M. Bowman & B. Camargo, "Digital Libraries: The Next Generation in File
System Technology," D-Lib Magazine, February 1998.
In the spirit of collaboration, this paper describes file systems and area
of intersection between digital libraries and file systems. Not only are
file systems available with richer semantics for collaboration than many
of the current tools using http for transport, there are also
object-oriented experimental file systems in development that could be
used to deliver a digital object in the KWF sense.
- Off -- Finish up the projects!
Exams end 5/7/98.