What does it mean to a computer for 2 sets of documents (texts) to be similar?

Methods & their Limitations

Posted by Alfred Prah on April 06, 2023 · 6 mins read
How does a computer know if 2 sets of text are similar, and how does it go about computing a score for their determined similarity or lack thereof? For as long as Natural Language Processing (NLP) has been a field of study, the goal has been to help computers understand and process human language. In this article, we will explore a huge piece of this goal: what it means to a computer for 2 sets of text to be similar, and how this similarity can be measured using popular methods like cosine similarity. I'll also introduce other methods briefly, together with pros & cons of each one of them.

Text Similarity

Measuring similarity between 2 pieces of text can be useful in various contexts, such as building a search engine, sentiment analysis, document summarization and chatbots.

Let's take an example to better understand this concept via the cosine similarity approach. Suppose we have 2 pieces of text:
    Text A: Alfred is a Data Scientist
    Text B: Alfred is going to school for Data Science

To calculate the cosine similarity between these 2 texts, we first represent them as vectors using a bag-of-words model, which counts the frequency of each word in the text.

Bag-of-words representations:
    Text A: {"Alfred": 1, "is": 1, "a": 1, "Data": 1, "Scientist": 1}
    Text B: {"Alfred": 1, "is": 1, "going": 1, "to": 1, "school": 1, "for": 1, "Data": 1, "Science": 1}

We can then represent each of these bags of words as a vector:
    Text A: [1, 1, 1, 1, 1]
    Text B: [1, 1, 1, 1, 1, 1, 1, 1]

To calculate the cosine similarity (CS) between these 2 vectors, we use the formula:
    CS(A,B) = dot_product(A, B) / (magnitude(A) * magnitude(B))

Applying this formula to our example gives us a cosine similarity of 0.89, which indicates that these 2 texts are fairly similar.

Other popular methods for measuring text similarity

There are other methods for measuring text similarity in NLP. Here are some of them:
  • Jaccard similarity: This method measures similarity based on the number of common words between 2 sets of text. It calculates the size of the intersection divided by the size of the union of the sets.
  • Euclidean distance: This method calculates the distance between 2 vectors in a high-dimensional space. It measures the length of the shortest path between the 2 vectors.
  • Manhattan distance: This method calculates the distance between 2 vectors by summing the absolute differences of their components.
  • Pearson correlation coefficient: This method measures the linear correlation between 2 sets of data. It calculates the covariance divided by the product of the standard deviations of the 2 sets.
  • Edit distance: This method measures the number of operations required to transform one set of text into another. The operations can be insertion, deletion, or substitution of characters.
  • Deep learning approach: This method uses neural networks to learn the patterns and relationships between words and sentences. This approach has shown promising results in many NLP tasks, such as language translation and sentiment analysis.

Limitations of these methods

While these methods can be useful in measuring text similarity, they have some limitations to consider for them, across board. Here are some of them:
  • Bag-of-words models don't capture word order: In bag-of-words models, the order of words in a sentence is not taken into account. As a result, 2 sentences with different word order but similar meaning may be considered dissimilar.
  • High-dimensional space: As the number of unique words in the text increases, the dimensionality of the vector space also increases, making it harder to compute similarity accurately.
  • Meaning of words: These methods do not consider the meaning of the words used in the text, which can lead to inaccuracies. For example, the sentences "I love ice cream" and "I hate ice cream" might be considered dissimilar, even though they have opposite meanings.
  • Deep learning models may require large datasets: Deep learning models can achieve state-of-the-art performance in NLP tasks, but they often require large amounts of data to train effectively. This can be a limitation in cases where the amount of available data is limited.

    I've personally considered creating synthetic data to increase text corpus sizes and seen promising results but that naturally requires a lot of intentiionality in data training and testing to ensure you aren't testing models on synthetic data which could be less likely to "occur" in the real world. Deep learning models may also be computationally expensive; A work-around I've found for this is to use transfer learning on the final layer(s) of a pre-trained model with similar predetermined functiionality, to train a model on our "smaller data" while benefiting from the learning than via much larger datasets. However, there's the other aspect of the models themselves being impractical to use on smaller devices or in real-time applications.

Despite these limitations, these methods remain useful for a wide range of applications, and new techniques are being developed to address these issues.
In conclusion, measuring text similarity is an important task in natural language processing, and several methods exist for achieving this goal. The choice of which method to use depends on the specific task and context, and it is important to consider the limitations of each method when applying them.