Extracting Semistructured Data from the Web: The DEByE Approach Alberto H. F. Laender Department of Computer Science Federal University of Minas Gerais Belo Horizonte, Brazil A large portion of the Web is composed of pages that can be regarded as containers of useful semistructured data that, although not readily available for automated processing, can be identified, extracted, and manipulated independently. In this talk we present DEByE (Data Extraction By Example), an approach to extracting data from Web sources based on a small set of examples specified by the user from a sample page. The novelty of this approach is the fact that the user specifies examples according to a structure of his liking and that this structure is described at example specification time. For the specification of the examples, the user interacts with a tool we developed that adopts nested tables as its visual paradigm. Nested tables are simple, intuitive, and allow shieding the user from technical details (such as HTML tags, formatting operators, and learning automata) related to the extraction problem. The examples provided by the user are then used to generate patterns that allow extracting data from new pages. For the extraction, DEByE adopts a new bottom-up procedure we propose which is very effective with various Web sources, as demonstrated by our experiments. In this talk we also discuss the use of the DEByE approach within the framework of the WebDL architecture for building an ETD digital library from the Web. About the speaker: Alberto H. F. Laender received the B.S. degree in Electrical Engineering and the M.Sc. degree in Computer Science from the Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, in 1974 and 1979, respectively, and the Ph.D. degree in Computing from the University of East Anglia, Norwich, UK, in 1984. He joined the Computer Science Department of the the Universidade Federal de Minas Gerais in 1975, where he is currently a Full Professor and the head of the Database Research Group. He was also the Coordinator of the Computer Science Graduate Program (1987-89 and 1993-96). In 1997, he was a visiting scientist at the Hewlett-Packard Palo Alto Laboratories. He has served as a program committee member for several international conferences on databases and Web-related topics, and was one of the program committee co-chairs of 19th International Conference on Conceptual Modeling held in Salt Lake City, Utah, in October 2000 and the program committee chair of the 9th International Symposium on String Processing and Information Retrieval held in Lisbon, Portugal, in September 2002. He is also a founder member of the Brazilian Computer Society and an Editorial Board member of the Journal of the Brazilian Computer Society and of the Information Systems Review. His research interests include conceptual database modeling, database design methods, database user interfaces, semistructured data, and Web data management.