Topic maps and graphical structures

Content:

What is a topic map?

The topic map standard is an ISO standard (see this document for details) that aims at increasing navigatability of the internet. A set of resources can be found at www.topicmap.com. In particular, the TAO of Topic Maps is a good introduction.

In principle, a topic map consists of a set of

There are also some

Applications of topic maps include

Topic maps and probabilities

Topic maps offer a graphical structure over the topics with associations as edges. However, the current standard does not provide ways to quantify the strength of the links. So it looks only natural to somehow add numbers to the links. Furthermore, interpreting the numbers as probabilities gives directly a wide range of techniques and theory available to interpret, learn and manipulate the numbers.

The standard is in development. This implies that there may be a possibility to influence the standard by introducing probabilities in the framework. This raises the following questions:

Why not start from scratch in developing a graphical model based knowledge representation for Internet navigation?

A lot of thought have been put in developing Topic maps, both from a theoretical as well as from a practical perspective. Though most of the theoretical background stems from a logical perspective, it would at least be interesting to investigate whether the topic map formalism can be extended and improved by introducting probabilities.

Another reason is the fact that because topic maps form an ISO standard it can be expected that in the near future tools will be developed for developing topic maps and navigating them. If a simple and powerfull probabilistic extention can be developed, the acceptance of probabilistic topic maps in the wider community may have a large head start over a freshly developed one.

How to extend topic maps with probabilities?

An obvious and simple extention is by extending associations with frequencies of occurance. For example, given the topics 'egg' and 'yolk', three usefull frequencies may be considered:

With these frequencies, the probability of the topic 'yolk' can be calculated, given that 'egg' is the topic of interest, and vice versa.

How topic maps with probabilities better than without?

A sound theoretical probabilistic foundation can be formulated. Such a foundation makes simplification assumptions explicit, which may explain the behavior of topic map applications and furthermore provides a tool for improving the topic map formalism.

Applications can exploit the strength of the relation between topics through associations for example by showing the distance of topics in a 'brainmap' inverse proportional to the probability of related topics.

Queries on topic maps can exploit the strength of the relation between topics by ranking documents. Ranking is more difficult of associations and occurrances are not quantified.

Open problems

How to obtain a list of topics?

Preferably, the list of topics in the topic map is not generated manually. One way to obtain the list is to get an existing one. The alternative is to generated the list automatically from a set of documents. This is highly related to a research area in information retrieval concentrating on automatic index generation and concept extraction. (See e.g. this article.)

How to obtain the associations?

Discovering associations between topics can be considered a problem of learning a structure of a graphical model. On the other hand, there is a large body of resarch devoted to the automatic creation of thesauri, which seems to be highly related to this problem. Therefore, it seems a good idea to learn the lessons from this field before trying to apply the techniques from the graphical model learning community. (See links on IR at the end of this page for a starting point.)

How to interpret frequencies as probabilities?

In the proposal above, frequencies can be interpreted as probabilities. Are there alternative representations possible? An alternative of the above is to store frequencies of occurance of topics for each topic (e.g. frequency of occurrance of 'egg') and frequencies of occurance of associations with each association (e.g. frequency that egg and yolk occur together in documents). This still allows to calculat the same probabilities (e.g. because the "frequency that egg but not yolk occur in documents" = "frequency of occurrance of 'egg'" - "frequency that egg and yolk occur together in documents"). Are there more representations possible? What are the pro's and con's of these alternatives? What can be said about obtaining frequencies from expert opinions? Is integer representation sufficient or can a finer representation be justified?

How to interpret frequencies as conditional independency statements?

When the previous question has been answered, what can be said about the conditional independence structure of the resulting network? Can the conditional independency relation be exploited for efficient propagation of probabilities through the network?

How to propagate probabilities?

Which assumptions have to be made to make propagation possible? In the proposal above, only second order interactions (in the terminology of contingency table analysis) are considered. Which assumption will be violated in practice? Can this be fixed without adding too much to the computational effort required to do the propagation?

Tamas Rudas suggested interpreting the frequencies on associations as conditions on marginals over two variables and using iterative proportional fitting as a mechanism for propagation. How would this work out in practice?

Given that a network may contain a huge number of nodes (over 1.000.000) can assumptions be formulated about the impact of propagation on only a small part of the network? Can the error introduced by such assumptions be estimated?

Combinator operator for topic maps?

Since no single engine can cover the whole Internet, it is a sensible assumption that topic maps from different sources may need to be combined to get a more powerfull mechanism for navigation. What are reasonable assumptions under which to combine topic maps with overlapping domains? In the frequency proposal above, combining associations and adding frequencies of joint associations may be such a combination operator.

From the contingency table literature, it is known that there are cases where the combination of three constraints (e.g., one marginal constraing over AB, one over AC and one over BC) cannot be represented by any joint distribution over the domain (i.e, no P(ABC) would exist satisfying the conditions). Which assumptions are reasonable in such situations in order still to be able to combine information of both networks?

A techique suggested in the thesaurus merging literature suggest to link toppics that have differente terms but many common 'children' and 'parents' (where children and parents should be interpreted in the context of a thesaurus). For example, two different toppics could be considered to be the same if

#common neighbours
------------------------------------------------------------> threshhold
sqrt(#neighbours term 1 * #neighbours term 2)

(Chapter 14 of FORSYTH, R., and R. RADA. 1986. Machine Learning -- Applications in Expert Systems and Information Retrieval. West Sussex, England: Ellis Horwood Series in Artificial Intelligence). How can this technique be exploited for combining toppic maps? Are there other techniques from thesaurus merging that can be exploited?

How to benchmark a topic map?

Commonly used measures for the quality of an information retrieval system are Precision and Recall. How to measure the effectiveness of topic maps with respect to other information retrieval formalisms? How to measure the effectiveness of topic maps with and without probabilistic interpretation?

Benchmark databases:

(See also Mira 99: workshop on Evaluating interactive information retrieval and an Online IR course. )

Topic map advanced features

In the previous, only topics, associations and occurrances were mentioned, but there are also more advanced features for providing context (e.g., Tweetie is a bird implies that tweetie can fly in a general context, but in a non-monotone logic context more often than not Tweetie is a flightless bird.) Do these context feature occur naturally in a probabilistic context? If not, what would the impact and interpretation be when adding frequencies to contexts? Which other formalism / interpretations are possible and useful?


IR links

Information Retrieval, 2nd Edition [textbook] C.J. van Rijsbergen, 1979

list of IR resources

IR resources

WebKB web mining project


Last updated: 4 Nov 2000 webmaster