Abstract | Most text classification methods treat each document as anindependent instance. However, in many text domains, doc-
uments are linked and the topics of linked documents are cor-
related. For example, web pages of related topics are often
connected by hyperlinks and scientific papers from related
fields are commonly linked by citations.
We propose a
unified probabilistic model for both the textual content and
the link structure of a document collection. Our model is
based on the recently introduced framework of Probabilistic
Relational Models (PRMs), which allows us to capture cor-
relations between linked documents. We show how to learn
these models from data and use them efficiently for classifi-
cation. Since exact methods for classification in these large
models are intractable, we utilize belief propagation, an ap-
proximate inference algorithm. Belief propagation automat-
ically induces a very natural behavior, where our knowledge
about one document helps us classify related ones, which in
turn help us classify others. We present preliminary empiri-
cal results on a dataset of university web pages.
|