Conversion to XML: An Interactive Bayesian approach

Abstract

This is a proposal for converting documents from commonly used file formats to XML. The purpose of this document is to get feedback on the idea and find a sponsor for implementing the proposal and gathering emperical results of the quality of the proposed system.

Problem Statement

There are many benefits in using XML as file format for documents. However, XML is a relatively new standard and though the benefits are well known, many companies are using legacy file formats for their documentation. One of the main reasons not to use XML is the huge costs involved in converting the legacy documents into XML. The conversion cost can easily supercede the costs of introducing new IT systems for supporting XML for example.

Another reason that witholds the acceptance of XML in a company is the exchange of documents with other parties. If those parties do not use the same form of XML (which is very likely in general) a conversion process is required every time a document is sent or recieved.

In short, conversion of documents into XML is an issue that keeps many organisations back from reaping the benefits promised by XML.

Commonly Used Techniques

Most used techniques for outsourcing are batch tools, converting documents from generally used file formats such as MS Word, Word Perfect, ASCII and RTF to XML conforming to a 'rainbow' DTD. A rainbow DTD is a DTD that mainly contains elements for marking up the style with which the text is marked up, but does not contain any semantics. For pragmatic reasons, it is easy to do an in-between step utilising MSWord to convert general file formats to RTF and then convert the RTF file to XML (a free omnimark script is available to do this last step but commercial filters are available as well ). What remains is an application specific step for converting the XML in the rainbow DTD to XML conforming to the application specific DTD.

The biggest problem is in converting from XML in the rainbow DTD to application specific DTDs. Good progress can be made by applying pattern matching rules. Any technique targeting pattern matching (e.g. algorithms from machine learning) can be used alternatively. But this will in most cases result in XML that anly partially conforms to the target DTD. Therefore, a post-processing step is required, which is often done manually. For example, by starting an XML editor with the partially converted file and manually apply correct tagging to the document. This is a labor intensive and expensive task.

Observation

Most often, the conversion process consists of a sequential number of steps without any feedback between steps. Therefore, parts of the document which may be hard to convert by the RTF to XML converter cannot be helped by human intervention.

Interactive Alternative

At least one way to progress seems to be to make the process cyclic instead of sequential. However, now the problem occurs that the knowledge contained in the xml2xml converter operates on files with conforming to a Rainbow DTD, while the resulting file is in the target DTD. So, it seems to be hard to replace the first XML file by the second and just rerun the xml2xml converter.

The alternative is to have two inputs into the xml2xml converter.

An interactive Bayesian approach to XML Conversion

In this section, a Bayesian approach is introduced which allows interactive human intervention in the conversion process.

Bayesian approach

Bayesians have a simple and straightforward look at the world: anyone starts with some belief called the prior, for example 'most sheep are white'. Every time an observation is made, the belief is updated. Every time a white sheep is encountered, the belief that all sheep are black is increased (this is a very common event in New Zealand). But it is also possible that the belief is not reinforced but decreased, e.g., by observation of a black sheep. The resulting belief is called the posterior.

Slightly more formally, starting with a probability distribution P(X) representing the prior belief we observe some data D. Then, we can update our belief with Bayes formula:

P(X|D) = P(X) P(D|X) / P(D)

This factor can be used for selecting a model that best represents the data as follows: consider two models M1 and M2. Selecting the best of both is achieved by calculating

P(X=M1|D) = P(X=M1) P(D|X=M1) / P(D)

P(X=M2|D)   P(X=M2) P(D|X=M2) / P(D)

But since P(D) is constant and appears in both terms, this can be simplified to

P(X=M1|D) = P(X=M1) P(D|X=M1)

P(X=M2|D)   P(X=M2) P(D|X=M2)

The resulting fraction (called Bayesian factor) is used to decide which model to choose: if it is over 1, model M1 is selected, otherwise model M2 is preferred.

Bayesian approach to XML conversion

In the context of the previous section, the aim of the conversion to XML is to find a model M for the document that maximizes the probability P(M|D) where D is the source document conforming to the rainbow DTD. This probability can be calculated using Bayes formula as P(M|D) = P(M) P(D|M) / P(D) µ P(M) P(D|M) where P(M) is the prior probability of the model. This prior probability may initialy be uniform over all possible models (i.e., not putting any preference on any particular conversion). However, once the conversion has passed a single cycle where a user has made some corrections in the converted document, the prior probability can be calculated on basis of the corrections of the user. For example, if the user has indicated where the title of the document is, any conversion that results in a tagging that has the title in a different place can get a prior probability of 0 and will hence be ignored.

This Bayesian approach remain two problems to be solved:

In general, it is not possible to calculate these terms exactly because of the computational complexity of the term. Therefore, independence assumptions have to be made in order to make the calculation computational feasable. The following sections presents just one way of making such assumptions, but other methods could be probed as well.

Calculating P(M|D)

An XML instance can be interpreted as an instance of a language defined by a DTD. From the observation that a DTD specifies a regular language, using standard parsing theory, we derive that every element model can be rewritten in a grammar that has only rules of the form S->AB and S->a where capitals are non-terminals and lower case letters terminals. Of course, #PCDATA sections should be dealt with as special case and can be interpreted as terminals. This means that the parse tree of an XML instance using such a grammer is a binary tree where the leaves are the terminals. So,

P(M|D) = P(Tree,Leaves|D)

Assume that application of the rules is independent of the data, then

P(Tree,Leaves|D) = P(Tree|D)P(Leaves|D,Tree) = P(Tree|D)P(Leaves|D,Tree)

Now,

P(Tree) = P(depth first applied rules) = ProdrulesP(rulei|rule1...rulei-1)

It seems reasonable to assume application of the rules in the grammar are mutually independent because the partial tree induced by rule1...rulei-1 contains no information except for the node that is the parent node for rulei. So, we can write,

P(Tree) = Prodnode in TreeP(rule applied|parent node)

The probabilities P(rule applied|parent node) can be estimated from previously (manually) converted documents. Estimates could be improved over time, which seems especially usefull in the situation that style of the input documents changes over time (i.e., this takes care of the situation that the probability with which input documents are generated is not stationair).

Leaves to be resolved the term P(Leaves|D,Tree). We assume that every data leave can be mapped onto a tree leave where #PCDATA leaves are automatically mapped onto #PCDATA leaves and other leaves are mapped according to an emperically determined probability. After some math P(Leaves|D,Tree) can be written as

P(Leaves|D,Tree)=ProdiP(Tree Leavei| Data Leavei)P(outputrulei)/P(Tree Leavei)

where outputrulei the rule applied to get the leave token, Tree Leavei and i ranges over all tree leaves. Again, all probabilities again can be estimated from existing converted documents.

Calculating P(M) on basis of updated document

Initially, no model can be assumed to be prefered over other documents. This makes the distribution uniform, i.e., P(M) = C > 0 for any M. When the user corrects a document, a simple approach is to assume that the user is always correct, which seems a practical assumption. Therefore, any documents that does not parse into a tree with the provided correction can be assumed to be incorrect and hence can be assigned a prior of zero. Since documents conforming to the correction can still not be differentiated, they could get a constant prior P(M') = C' > 0 for models M' that take into acount the users corrections.

Interactive Bayesian approach

The whole process can be supported by a nice GUI, where the File/Open menu allows for opening a document in any common file format. The program would perform the first cycle automatically on the document and show the document marked up with XML. For example as follows:

The GUI allows for manipulating the tree, and hence the markup. Every time the tree is manipulated, the XML2XML process can be rerun and a new tree formed.

Benefits and Risks

The Bayesian approach allows for explicit modeling of user input and also shows explicitly the independence assumptions that are made. This makes it possible to improve on the model by changing assumptions. An emperical study is required to test the power of the model based on the proposed independency assumptions. The proposed bayesian model makes it possible to improve the model by improving on the assumptions.

Online application servive provider may be possible as business model.

Conclusion

In this document, a proposal is made for attacking the problem of converting documents from commonly used file formats to XML. It is observed that conversion costs may be dramatically decreased by making the process interactive using a Bayesian approach.

Due to lack of data and resources, no emperical data is available to support this claim as yet. Therefore, I am looking for a sponsor to implement the proposed system and perform experiments.

If you are interested or have any comments, please contact Remco Bouckaert.


Last updated: 7 Nov 2000 webmaster