An Introduction to Unsupervised Topic Segmentation with Implementation in Python

2023-03-21
15 min read

Text Segmentation is a task in Natural Language Processing that aims to divide a text document into semantically or topically coherent sections. This is useful for creating topic-specific summaries, organising long documents into sections for ease of reading, reducing noise, improving information retrieval, and more.

The goal of the task is to identify breakpoints between pairs of sentences where the topic deviates significantly. It is not necessary to know the number or identities of the topics themselves beforehand, and classifying the derived segments into meaningful categories can be done as a later step.

This task was first described in the TextTiling paper as an unsupervised learning task. Although it is possible to build a supervised model with sufficient training data, doing so would require additional effort for data annotation, and the model may overfit on the domain of training data texts. Therefore, in this post, we will focus on the unsupervised approach, which is more robust to various document types and domains.

This type of semantic segmentation is distinct from sentence boundary disambiguation, as we are not concerned with the syntactic structure of text, but rather the topical changes within it.

This approach is also applicable to spoken text directly. Human conversations, however, have other cues that can define a topic shift, such as pauses, changes in tone, or pitch. This technique cannot leverage such cues directly as it only uses lexical information to identify boundaries, but it could potentially do so with some modifications.

The algorithm

This technique operates by running a sliding window over sentences in the input text and calculating similarity scores between adjacent windows. Post-processing is then performed on these scores to extract boundary candidates to split the document.

The scores can be calculated using a text similarity function that takes two pieces of text as an input and returns a score between 0 and 1. This function can be based on cosine similarity between bag-of-word features, topics derived from LDA, Word2Vec, large language models like BERT, etc.

In the next sections, we will implement the above algorithm to perform topic segmentation on a news article from the internet.

Preliminary Steps

We will use a Washington Post article as a sample document. I copied the text content and put it into a file called sample.txt where the paragraphs are separated by newlines.

Loading the Text

We can read the file using the following code, texts is a list containing the paragraphs in the document.

with open('input/sample.txt') as f:
	texts = f.read().splitlines()

The paragraphs need to be segmented into sentences, and each sentence has to be tokenised into words to carry out later steps. For both of these tasks, we will use the English Spacy model.

import spacy
nlp = spacy.load('en_core_web_sm')

en_core_web_sm is an english language model for Spacy and can be downloaded via:

python -m spacy download en_core_web_sm

Details of other models are on the spacy website

Splitting into Sentences

Since there is no specific sentence marker, detecting sentence boundaries in English is not very straightforward and is a research topic on its own. Spacy has a built-in sentence segmenter that does a good enough job most of the time.

The following code will parse the text with the Spacy model and break the paragraphs into sentences.

sents = []
for text in texts:
	doc = nlp(text)
    for sent in doc.sents:
        sents.append(sent)

Next, we will apply the previously described algorithm to group the sentences into semantically related segments. The results can vary based on the type of text similarity function used. We will cover two approaches for similarity computation; by using a topic model and via a neural language model.

a) Topic Based Segmentation

Latent Dirichlet Allocation (LDA) is a topic model that can learn topics from unannotated text corpora. Under this model, each document can be decomposed into a probability distribution over topics, and each topic in turn is a probability distribution over words.

The original TextTiling paper use cosine similarity of bag-of-words features as a measure of semantic similarity. TopicTiling extended it using topic IDs computed by LDA instead of a bag-of-word features. This makes sense because boundaries where the topics changes significantly can be good candidate for topic segmentation.

The only caveat is that LDA requires a training corpus to generate high-quality topics. Since we do not have such a corpus on hand, we will simply use the given document itself to build the topic model.

Tokenisation, Lemmatisation and Filtering

LDA tends to generate topics consisting of only numbers and stop words, due to their higher frequency. To generate high-quality topics, it is necessary to remove these unnecessary tokens in advance.

Here we will tokenise the text, remove punctuation, stop words and short tokens; lowercase, and lemmatise the tokens. All steps are performed using the same Spacy model as before.

MIN_LENGTH = 3
tokenized_sents = [[token.lemma_.lower() for token in sent 
                    if not token.is_stop and 
                    not token.is_punct and token.text.strip() and 
                    len(token) >= MIN_LENGTH] 
                  for sent in sents]

Building the topic model

Gensim is a useful library for building Topic models and other unsupervised NLP models. It provides a simple interface using few lines of code.

from gensim import corpora, models
N_TOPICS = 5
N_PASSES = 5
dictionary = corpora.Dictionary(tokenized_sents)
bow = [dictionary.doc2bow(sent) for sent in tokenized_sents]
topic_model = models.LdaModel(corpus=bow, id2word=dictionary, 
                              num_topics=N_TOPICS, passes=N_PASSES)

Here dictionary is a mapping from all tokens in the dataset to their unique ids. bow is simply the original corpus converted to a sparse bag-of-words vector representation. In addition to these inputs we also need to specify the number of topics. We set it to 5 as our dataset is very small.

We can check the generated topics by running:

topic_model.show_topics()

Which returns a probability distribution of words over topics:

[(0, ‘0.050*“court” + 0.026*“judge” + 0.018*“republican” + 0.018*“supreme” + 0.018*“far” + 0.018*“president” + 0.010*“texas” + 0.010*“contender” + 0.010*“opening” + 0.010*“see”'),

(1, ‘0.031*“court” + 0.030*“5th” + 0.030*“circuit” + 0.021*“orleans” + 0.021*“new” + 0.012*“judge” + 0.012*“federal” + 0.012*“trump” + 0.012*“president” + 0.011*“biden”'),

(2, ‘0.006*“conservative” + 0.005*“circuit” + 0.005*“appeals” + 0.005*“5th” + 0.005*“lean” + 0.005*“u.s.” + 0.005*“long” + 0.005*“court” + 0.005*“new” + 0.005*“orleans”'),

(3, ‘0.045*“court” + 0.019*“case” + 0.019*“law” + 0.019*“georgetown” + 0.019*“supreme” + 0.019*“federal” + 0.019*“conservative” + 0.019*“judge” + 0.011*“say” + 0.011*“gervasi”'),

(4, ‘0.037*“school” + 0.030*“judge” + 0.023*“texas” + 0.023*“law” + 0.016*“controversial” + 0.016*“trump” + 0.016*“conservative” + 0.016*“5th” + 0.016*“pick” + 0.016*“circuit”')]

Each run of LDA training may generate different topics and the overall quality might vary.

Topic Inference

Now we will use the topic model on the same document again to infer the topics, we ignore topics with probability less than a small threshold value. This will give us a topic distribution over each sentence.

THRESHOLD = 0.05
doc_topics = list(topic_model.get_document_topics(bow, minimum_probability=THRESHOLD))

doc_topics will look something like this for the first 10 sentences

[[(3, 0.9553131)],
 [(0, 0.965127)],
 [(1, 0.92673564)],
 [(1, 0.96641266)],
 [(0, 0.95975894)],
 [(0, 0.8991579)],
 [(0, 0.9526328)],
 [(0, 0.9528284)],
 [(4, 0.93837)],
 [(4, 0.9651593)]]

As we can see, all sentences belong to only one topic, but multiple topics are also possible.

Next we select the top 3 topics for each sentence.

k = 3
top_k_topics = [[t[0] for t in sorted(sent_topics, key=lambda x: x[1], reverse=True)][:k] 
                for sent_topics in doc_topics]

At this point, it can be interesting to visualise the topic distributions to see if we can identify potential break points by visual inspection. We can plot a scatterplot over the above result using pyplot:

Window-based similarity calculation

The next steps are pretty straightforward, we will pass a sliding window over the above topic distribution to collect all topics that appear in each window.

from itertools import islice
def window(seq, n=3):
	it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n: 
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

The function window performs a sliding window operation on the input sequence. We will apply it on the topics extracted from the previous step.

from itertools import chain
WINDOW_SIZE = 3
window_topics = window(top_k_topics, n=WINDOW_SIZE)
window_topics = [list(set(chain.from_iterable(window))) 
                for window in window_topics]

where window_topics looks like the following, where each element represents a 3-sentence window

[[0, 1, 3],
 [0, 1],
 [0, 1],
 [0, 1],
 [0],
 [0],
 [0, 4],
 [0, 4],
 [1, 4],
 [1, 3, 4]]

Binarization

We would like to convert the above to a binary one-hot vector for cosine similarity calculation. Sklearn will allow us to do this easily.

from sklearn.preprocessing import MultiLabelBinarizer
binarizer = MultiLabelBinarizer(classes=range(N_TOPICS))
encoded_topic = binarizer.fit_transform(window_topics)

encoded_topic now contains the following, where a 1 for an index means that the corresponding topic appears at least once in that window. Thus, instead of bag-of-words, we compute bag-of-topics.

[[1, 1, 0, 1, 0], 
 [1, 1, 0, 0, 0], 
 [1, 1, 0, 0, 0], 
 [1, 1, 0, 0, 0], 
 [1, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0], 
 [1, 0, 0, 0, 1], 
 [1, 0, 0, 0, 1], 
 [0, 1, 0, 0, 1], 
 [0, 1, 0, 1, 1]]

Coherence score calculation

Using the above result, we simply have to compute a cosine similarity between adjacent pairs on windows. The similarity measures the coherence between two windows of text and a low value indicates a possible point of topic shift.

from sklearn.metrics.pairwise import cosine_similarity
coherence_scores = [cosine_similarity([pair[0]], [pair[1]])[0][0] 
                for pair in zip(encoded_topic, encoded_topic[1:])]

We can visualise it as a line plot, where the x-axis represents a boundary index and y axis denotes the coherence score between two pairs of sliding windows that go over that boundary.

Computing Depth Scores

In principle, we are interested in the local minimal of the above plot. However, we need a way to quantify the depth of these minima in order to differentiate significant topic shifts from minor ones.

We quantify this depth via a new measure called the depth score. This measures how deep or shallow a minima is compared to others. This depth corresponds directly with how strongly the topic changed between two adjacent sentences.

Here, we will use the version of the depth score formula from the topic tiling paper.

To understand the above formula, we need to refer to the coherence score diagram above. $p$ refers to the index of the boundary we wish to calculate the depth for, in this case $p=7$.

$c_p$ is simply the coherence score at point $p$

$hl$ is a function that takes a point $p$ as input and returns the coherence score for the leftmost index where the score keeps increasing consecutively. $hr$ does the same but returns the rightmost such index.

Thus, given a valley whose deepest point is at point $p$, $hl(p)$ and $hr(p)$ will return the height of the edges of that valley and $c_p$ is the height of the lowest point.

Given a sequence of scores $l$ and an index $i$, the function $climb$ will return either $hl(i)$ or $hr(i)$

def climb(seq, i, mode='left'):
    if mode == 'left':
        while True:
            curr = seq[i]
            if i == 0:
                return curr
            i = i-1
            if not seq[i] > curr:
                return curr
    if mode == 'right':
        while True:
            curr = seq[i]
            if i == (len(seq)-1):
                return curr
            i = i+1
            if not seq[i] > curr:
                return curr

Using the function above, we can apply the depth score formula to calculate the depths for each point in an input sequence. get_depths performs this operation.

def get_depths(scores):
    depths = []
    for i in range(len(scores)):
        score = scores[i]
        l_peak = climb(scores, i, mode='left')
        r_peak = climb(scores, i, mode='right')
        depth = 0.5 * (l_peak + r_peak - (2*score))
        depths.append(depth)
    return np.array(depths)

We run it as

depth_scores = get_depths(coherence_scores)

This will return the following depth score sequence, we notice that the valleys have been replaced by peaks and most of the other information in the sequence has been filtered out. A plot of depths_scores is as follows.

Removing plateaus from depth scores

A problem with the above scores is that some valleys tend to be be relatively wider, resulting in multiple points with high depth close to each other in a plateau shape. This has a chance to create multiple segments that are very close to each other.

To avoid this, we sharpen the plateaus by applying a local maxima filter. For each point, we only keep it if it is a local maxima of its adjacent points. The scipy routine argrelmax allows us to do this.

from scipy.signal import argrelmax

def get_local_maxima(depth_scores, order=1):
     maxima_ids = argrelmax(depth_scores, order=order)[0]
     filtered_scores = np.zeros(len(depth_scores))
     filtered_scores[maxima_ids] = depth_scores[maxima_ids]
     return filtered_scores

We run it as

filtered_scores = get_local_maxima(depths_scores, order=1)

A plot of filtered_topic gives:

After applying the above filter, we notice that a lot of noise has been removed and the peaks are sharper, removing ambiguity in selecting segmentation candidates.

Computing threshold

At this point we could set an arbitrary threshold value and simply apply that to select segmentation points. However, threshold selection can be difficult as there is no meaning to the scale of the scores calculated so far.

The TextTiling paper described an approach to automatically calculate the optimal threshold by using the mean and standard deviation of the depth scores.

$threshold = mean(depths) - std(depths) / 2$

The following function will calculate the threshold automatically using the above equation:

def compute_threshold(scores):
     s = scores[np.nonzero(scores)]
     threshold = np.mean(s) - (np.std(s) / 2)
     # threshold = np.mean(s) - (np.std(s))
     return threshold

and we apply it as

threshold = compute_threshold(filtered_scores)

Running above function on the above scores results in a threshold value of 0.2327

Selecting Segments

We can get the segment ids by using the above threshold on the filtered depth scores.

def get_threshold_segments(scores, threshold=0.1):
     segment_ids = np.where(scores >= threshold)[0]
     return segment_ids

The above function is applied to get the segmented ids

segment_ids = get_threshold_segments(filtered_scores, threshold)

The results look like this:

array([3, 5, 7])

We can also plot the threshold and selected segments. The yellow line represents the threshold and the red dots represent the selected segments with highest topic shift.

Applying to text

Since the above ids represent segment gaps for adjacent windows, we need to translate them back to sentence gap ids, for this we need to add the WINDOW_SIZE parameter we used previously.

segment_indices = segment_ids + WINDOW_SIZE

segment_indices now contains

array([ 6, 8, 10])

Finally, we will convert these ids to index slices to apply to the original sentences

segment_indices = [0] + segment_indices.tolist() + [len(sents)]
slices = list(zip(segment_indices[:-1], segment_indices[1:]))

This returns

[(0, 6), (6, 8), (8, 10), (10, 19)]

Now we can simply apply these slices to our original document

segmented = [sents[s[0]: s[1]] for s in slices]

The results can be seen in this text file. The SEGMENT tag represents the breakpoints.

Analysis

The text is broken into 4 segments based on the topic model, upon manual inspection we notice that the segment do correspond to topic shifts somewhat.

  • The first segment gives an overview of the story.
  • The second segments talks mainly about the new judges.
  • The third segment details a quote by one of the judges.
  • The fourth segments covers the remaining story.

Higher quality segments could be extracted via this method by training a topic model on a large news corpus with a higher number of topics.

b) LLM Based Segmentation

In the previous approach we used a topic model to identify similarities between text segments. A limitation of this approach is that a topic model has to be trained on a domain-specific corpus beforehand. Even then, the quality of the generated topics may vary on each run and the model may not work well on texts from different domains.

Large Language Models (LLMs) like BERT can provide an alternative approach to text similarity computation. These are generally pre-trained on a large text corpus covering various domains and can thus be used out-of-the-box for various tasks.

We can create a variation of topic-tiling where we use an LLM for the similarity computation part while keeping everything else the same.

Initialising the model

The SentenceTransformer module provides a simple interface to access various large language models and apply them directly to sentence units.

from sentence_transformers import SentenceTransformer
MODEL_STR = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_STR)

The model all-MiniLM-L6-v2 is a variation of Microsoft’s MiniLM model which is further fine-tuned on a large sentence-pairs dataset. According to the sentence transformers documentation, it is suitable for semantic search and is intended to be used for sentences and short paragraphs. Seems like a good match for our task.

Applying the Sliding Window

Since there is no model training step, we can proceed directly to sliding window computation. Since MiniLM uses its own tokeniser, we can also skip the tokenisation and preprocessing steps.

In the following code we apply the window function directly to the raw sentences.

WINDOW_SIZE = 3
window_sent = list(window(sents, WINDOW_SIZE))
window_sent = [' '.join([sent.text for sent in window]) for window in window_sent]

Each element of window_sent now contains a sequence of sentences of length WINDOW_SIZE

Model Inference

We can encode the windowed sentences directly using MiniLM

encoded_sent = model.encode(window_sent)

encoded_sent is now a vector of shape (17, 384) where 17 is the number of windows and 384 is the dimensionality of the large language model’s vector space.

Coherence Score Computation

We can directly use cosine similarity on the encoded sentences to compute the coherence scores.

coherence_scores = [cosine_similarity([pair[0]], [pair[1]])[0][0] for pair in zip(encoded_sent, encoded_sent[1:])]

Depth Score Computation

Now we can compute the depth scores

depth_scores = get_depths(coherence_scores)

Filtering the Depth Scores

The filtering step is also the same as before

filtered_scores = get_local_maxima(depths_topic, order=1)

Thresholding Step

Finally, we can compute the threshold and generate segments.

threshold = compute_threshold(filtered_scores)
segments_ids = get_threshold_segments(filtered_scores, threshold)

This time we got a threshold value of 0.1127

We can generate a plot summarising the thresholding step.

Segmentation Output

We can repeat the previous steps to generate the segmented sentences. The final output can be seen on this file. There are some minor differences from the topic segmentation result. The first and third segments are pretty generic but the second segment seems to focus on details about a certain court building in New Orleans.

Model Evaluation

So far, we haven’t covered the numerical evaluation of the techniques described in this article. An easy way to evaluate topic segmentation is to attach two snippets from different articles together and then segment it using one of the previous approaches.

The algorithm can then be evaluated on its ability to identify the point where the article changes to a different one. More rigorous evaluation would require hand-annotated data however.

This article gives a good overview of some evaluation metrics for text segmentation as well as datasets in the research literature.

Conclusion

In this article I described an approach to text segmentation using topics models and LLMs. However, any other similarity function is also applicable as long as it can return a similarity value between pairs of texts.

In a later article I will cover hands-on evaluation of topic segmentation in detail.

Code and Interactive Demo

A notebook containing the code in this post can be accessed here.

An interactive demo is built using Streamlit is hosted here.

References

  1. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages
  2. TopicTiling: A Text Segmentation Algorithm based on LDA
  3. Text Segmentation - Approaches, Datasets, and Evaluation Metrics