Information Extraction with Markov Random Fields
Andrew McCallum, Ph.D. University of Massachusetts Amherst
Wednesday, May 7, 2003, 10:30 a.m.
6225 Medical Sciences Center, 1300 University Ave.
As we continue to be flooded with increasing amounts of text data, we have growing need for tools that will not only allow us to retrieve documents, but also mine the structured data buried inside their natural language text. Such structured representation then enables automated methods to model outliers, predict the future, and provide decision support. For example, U.S. ports would be safer if automated processes could detect suspicious patterns in the shipping manifests of cargo container ships; a system could suggest which DNA array experiments may be most fruitful by mining facts from biology research articles; we could better predict long-term weather trends by building a weather database from large collections of thousand-year-old Chinese diary entries.
Information extraction is the process of filling a structured database from unstructured text. It is a difficult statistical and computational problem often involving hundreds of thousands of variables, complex algorithms, and noisy and sparse data.
In this talk I will briefly review previous work in finite state, conditionally-trained Markov random field models for information extraction, and then describe three pieces of recent work: (1) the application of conditional Markov random fields to extraction of tables from government reports, (2) feature induction for these models, applied to named entity extraction, (3) a new, random field method for noun co-reference resolution that has strong ties to graph partitioning.
Andrew McCallum is an Associate Professor at University of Massachusetts, Amherst. He was previously Vice President of Research and Development at WhizBang Labs, a company that used machine learning for information extraction from the Web. In the late 1990's he was a Research Scientist and Coordinator at Justsystem Pittsburgh Research Center. He received his PhD in computer science from University of Rochester in 1995, and was a post-doctoral fellow at Carnegie Mellon University in 1996. He is on the editorial board of the Journal of Machine Learning Research, and has co-organized numerous technical workshops. For the past eight years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, document classification, finite state models, and learning from combinations of labeled and unlabeled data. Web page: http://www.cs.umass.edu/~mccallum.
Back to General Departmental Seminar Series