TopSBM: Topic Models based on Stochastic Block Models
Installation and usage instructions
To run it yourself, you need to:
- Install graph-tool:
- Download the package SBM topic models:
- Using git :
git clone https://github.com/martingerlach/hSBM_Topicmodel.git
- or: download zip file.
- Explore the tutorials using Jupyter notebooks:
Explanation: Topic modeling with text data
Topic models are a popular way to extract
information from text data, but its most popular flavours (based on
Dirichlet
priors, such
as LDA)
make unreasonable assumptions about the data which severely limit its
applicability. Here we explore an alternative way of doing topic
modelling, based
on stochastic block models (SBM), thus
exploiting a mathematical connection with
finding community
structure in networks.
To briefly illustrate some of the limitations of Dirichlet-based topic
modelling, consider the simple multi-modal mixture of three topics shown
below on the left. Since the Dirichlet distribution is unimodal, it
severely distorts the topics inferred by LDA, as shown in the middle
— even thought it is just a prior distribution over a
heterogeneous topic mixture. The SBM formulation, on the other hand, can
easily accommodate this kind of heterogeneity, since it is based on more
general priors (see here, here and here).
In addition to this, the SBM method is based a nonparametric “symmetric”
formulation that allows for the simultaneous hierarchical clustering of
documents as well as words. Due to its nonparametric Bayesian nature,
the number of topics in each category, as well as the shape and depth of
the hierarchy, are automatically determined from
the posterior
distribution according to the statistical evidence available,
avoiding
both overfitting and
underfitting.
To illustrate the application of the method using real data, we show
below an example using wikipedia articles.
Example: 63 Wikipedia articles related to Physics
The method divides both the documents and words into hierarchical
groups. The divisions found in the first three hierarchical levels can
be inspected below.
Topic modeling beyond just text data
In many cases, we have additional information available about the documents, such as metadata or hyperlinks.
We can incorporate these different types of data into a unified topic model by extending the above approach using multilayer stochastic block models.
In this approach, each type of data is represented as a different layer in the network.
We can represent the information contained in the Wikipedia dataset with the following three layers:
- Text: The bipartite network of words and documents as nodes, where edges represent the number of times a word appears in a document (same as above)
- Hyperlinks: The network of documents, where a directed edge corresponds to a link between documents
- Metadata: The bipartite network of categories and documents, where an edge represents that a category contains a given document.
Using this representation, we can obtain a hierarchical clustering of words (i.e. topics) and documents not only based on the text but taking into account links and metadata information.
We illustrate the differences in the document clustering when taking into account different types of data.
References:
- M. Gerlach, T. P. Peixoto, and E. G. Altmann, A network approach to topic models, Science Advances 4, eaaq1360 (2018) or [arXiv:1708.01677] .
- T. P. Peixoto, Bayesian stochastic blockmodeling , in Advances in Network Clustering and Blockmodeling" (2019) or [arXiv: 1705.10225].
- C. C. Hyland, Y. Tao, L. Azizi, M. Gerlach, T. P. Peixoto, and E. G. Altmann, Multilayer Networks for Text Analysis with Multiple Data Types , EPJ Data Science 10, 33 (2021) .
- H. Chan TopSBM. V.1.0. Australian Text Analytics Platform, Software (2024).