Automatic XML Markup of Documents

Abstract

This is a proposal for automatic marking up documents with XML using Bayesian network based langauge models in an interactive tool. The purpose of this document is to get feedback on the idea and find a sponsor for implementing the proposal and gathering emperical results of the quality of the proposed system.

Problem description

Most legacy documents is in a variety of non-XML formats which makes it difficult to find information through search engines or other electronic means. Marking up relevant parts of there documents using XML makes it possible to apply standard XML tools for common information retrieval tasks. The task of convertings legacey data to XML is expensive and tends to be labour intensive.

This document targets extracting semantic information from documents inside paragraphs. This is different from extracting semantic information based on the structure of a document, which is targeted in this proposal.

In practice, (semi-)automatic markup is based on pattern matching and word list matching. These techniques require domain specific knowledge to be modelled, which costs a lot of effort both from development staff and from domain experts. Furthermore, there is an ongoing maintenance cost to keep the markup rules up to date.

Language Models

It seems a good idea to use the knowledge and experience from the natural language processing field for extracting semantics from plain text. Langauge models are a way of modelling semantics in text using hidden markov models [1]. Hidden markov models are a special form of Bayesian network[2,3]. It seems therefore natural to combine the efforts of those two fields of research and design language models using Bayesian networks with hidden variables that allow more dependencies than Hidden markov models to overcome the limitations of currently used models.

Naturally stemming[g] and stopword[g] list techniques are part of the mix.

Advantage of a natural language processing based approach is that the model can learn from already marked up text and when the amount of XML already marked up increases, the better the model can expected to be.

Practical implementation

Applying automatic markup procedures can be done using batch processing of files. However, quality control needs to be done afterwards which makes the task a tedious and error-prone. Furthermore, it is difficult to distinguish parts of the document that need not be marked up (e.g. such as bibliography, computer code fragments, or passages that have been cited from other documents).

Another approach is to have an application with a GUI where a user indicates the part of the text to be marked up by putting the cursor at the start of the text to be processed. Alternatively, the user can select a block of text to be processed and then select the AutoMarkup button. This kicks off the conversion routine and whenever it finds a piece of text that looks like it needs tagging, it asks the user for confirmation. This approach is ideal for providing relevance feedback[g].

Online conversion business model

Provide application service through remote conversion. Benefits:

Risks:

Conclusion

In this document, a proposal is made for attacking the problem of automatic information extraction in legacy documents as part of an XML conversion process. Current research in langauge models combined with a GUI to provide a semi interactive approach makes this a very promissing approach for information extraction.

Due to lack of data and resources, no emperical data is available to support this claim as yet. Therefore, I am looking for a sponsor to implement the proposed system and perform experiments.

If you are interested or have any comments, please contact Remco Bouckaert.


1: Jay Ponte (1998), A language modeling approach to information retreival, PhD Thesis, University of Massachusetts Amherst.

2: Judea Pearl (1988) , Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, California.

3: S. L. Lauritzen and D. J. Spiegelhalter (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion). Journal of the Royal Statistical Society, Series B 50, 157-224.


Last updated: 7 Nov 2000 webmaster