A basic tutorial.
%load_ext autoreload
%autoreload 2
import os
import pylab as plt
%matplotlib inline
import graph_tool.all as gt
from sbmtm import sbmtm
1) We have a list of documents, each document contains a list of words.
2) We have a list of document titles (optional)
The example corpus consists of 63 articles from Wikipedia taken from 3 different categories (Experimental Physics, Chemical Physics, and Computational Biology).
path_data = ''
## texts
fname_data = 'corpus.txt'
filename = os.path.join(path_data,fname_data)
with open(filename,'r') as f:
x = f.readlines()
texts = [h.split() for h in x]
## titles
fname_data = 'titles.txt'
filename = os.path.join(path_data,fname_data)
with open(filename,'r') as f:
x = f.readlines()
titles = [h.split()[0] for h in x]
i_doc = 0
print(titles[0])
print(texts[i_doc][:10])
## we create an instance of the sbmtm-class
model = sbmtm()
## we have to create the word-document network from the corpus
model.make_graph(texts,documents=titles)
## we can also skip the previous step by saving/loading a graph
# model.save_graph(filename = 'graph.xml.gz')
# model.load_graph(filename = 'graph.xml.gz')
## fit the model
gt.seed_rng(32) ## seed for graph-tool's random number generator --> same results
model.fit()
The output shows the (hierarchical) community structure in the word-document network as inferred by the stochastic block model:
The result is a grouping of nodes into groups on multiple levels in the hierarchy:
model.plot(filename='tmp.png',nedges=1000)
For each word-group on a given level in the hierarchy, we retrieve the $n$ most common words in each group -- these are the topics!
model.topics(l=1,n=20)
Which topics contribute to each document?
## select a document (by its index)
i_doc = 0
print(model.documents[i_doc])
## get a list of tuples (topic-index, probability)
model.topicdist(i_doc,l=1)
The stochastic block models clusters the documents into groups. We do not need to run an additional clustering to obtain this grouping.
model.clusters(l=1,n=5)
Application -- Finding similar articles:
For a query-article, we return all articles from the same group
## select a document (index)
i_doc = 2
print(i_doc,model.documents[i_doc])
## find all articles from the same group
## print: (doc-index, doc-title)
model.clusters_query(i_doc,l=1,)
In the stochastic block model, word (-nodes) and document (-nodes) are clustered into different groups.
The group membership can be represented by the conditional probability $P(\text{group}\, |\, \text{node})$. Since words and documents belong to different groups (the word-document network is bipartite) we can show separately:
p_td_d,p_tw_w = model.group_membership(l=1)
plt.figure(figsize=(15,4))
plt.subplot(121)
plt.imshow(p_td_d,origin='lower',aspect='auto',interpolation='none')
plt.title(r'Document group membership $P(bd | d)$')
plt.xlabel('Document d (index)')
plt.ylabel('Document group, bd')
plt.colorbar()
plt.subplot(122)
plt.imshow(p_tw_w,origin='lower',aspect='auto',interpolation='none')
plt.title(r'Word group membership $P(bw | w)$')
plt.xlabel('Word w (index)')
plt.ylabel('Word group, bw')
plt.colorbar()