Computer Science Department Events
PhD defense for Paul Flynn on Wednesday, April 2nd at 1:30pm in the E&CS building.WEDNESDAY APRIL 2
E&CS Building RM 3316 (3rd floor conference room)
Thesis title: Document Classification in Support of Automated Metadata Extraction from Heterogeneous Collections
Speaker: Paul Flynn
Dr. Steven Zeil
Dr. Kurt Maly
Dr. M. Zubair
Dr. Harris Wu, IT and Decision Sciences
Abstract: A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying metadata fields by inspecting the document is easy for a human. The visual cues in the formatting of the document along with accumulated knowledge and intelligence make it easy for a human to identify various metadata fields. Even with the best possible automated procedures, numerous sources of error exist, including some that cannot be controlled, such as scanned documents with text obscured by smudges, signatures, or stamps. A commercially viable process for metadata extraction must remain robust in the presence of these external sources of error as well as in the face of the uncertainty that accompanies any attempts to automate "intelligent" behavior. While extraction accuracy and completeness must be the primary goal of an extraction system, the ability to detect and report questionable results is equally important for a production quality system, since it promotes confidence in the system.
We have developed and demonstrated a novel system for extracting metadata. First, a document is examined in an attempt to recognize it as an instance of a known document layout. Then a template, a scripted description of how to associate blocks of text in the layout with metadata fields, is applied to the document to extract the metadata. The extraction is validated after post-processing to evaluate the quality of the extraction and, if necessary, to flag untrusted extractions for human recognition.
The success or failure of the template approach is directly tied to document classification, which is the ability to match the document to the proper template correctly and consistently. Document classification in our system is implemented as a module which applies every template available in the system to a document to find candidate templates that extract any data at all. The candidate templates are evaluated by a validation module to select the best performing template. This method is called "post hoc" classification. Post hoc classification is not only effective at selecting the correct class but it also excels at minimizing false positives. It is, however, very sensitive to changes in the template collection and to poorly written templates.
While this dissertation examines the evolution and all the major components of an automated metadata extraction system, the primary focus is on the problem of document classification. The main thrust of my research has been investigating alternative methods of document classification to replace or supplement post hoc classification. I experimented with machine learning techniques as an additional input factor for the post hoc classification script or the final validation script.