Augmenting Lexicons Automatically: Clustering Semantically Related Adjectives

Augmenting Lexicons Automatically: Clustering Semantically Related Adjectives

Abstract:

Our work focuses on identifying various types of lexical data in large corpora through statistical analysis. In this paper, we present a method for grouping adjectives according to their meaning, as a step towards the automatic identification of adjectival scales. We describe how our system exploits two sources of linguistic knowledge in a corpus to compute a measure of similarity between two adjectives, using statistical techniques and a clustering algorithm for grouping. We evaluate the significance of the results produced by our system for a sample set of adjectives.

(K. McKeown, V. Hatzivassiloglou)

https://dl.acm.org/doi/10.3115/1075671.1075732

https://dl.acm.org/doi/pdf/10.3115/1075671.1075732

https://www.semanticscholar.org/paper/Augmenting-Lexicons-Automatically%3A-Clustering-McKeown-Hatzivassiloglou/162078c7170d5f6a3a2805439a5529e1a636f12e

Extracts

This paper is about teaching computers to understand words, especially adjectives (describing words like "big," "small," "happy," or "sad"). The authors want to group adjectives that have similar meanings, like putting "hot" and "warm" in one group and "cold" and "cool" in another. This helps computers learn how words are related and can be used to figure out scales, like how "hot" is stronger than "warm."

To do this, the authors use two main things:

  1. Linguistic knowledge: They use rules and patterns about how words work in sentences.

  2. Statistical techniques: They look at how often words appear together in a big collection of texts (called a corpus) to figure out which words are similar.

They also use a special computer program (a clustering algorithm) to group the adjectives based on their similarities. Finally, they test their system on a set of adjectives to see if it works well.

In simple terms, this paper is about teaching computers to group describing words by their meanings, so they can better understand how we use language.

Exercise

Python Programming Exercise: Grouping Adjectives by Similarity

Objective:

Use statistical techniques and a clustering algorithm to group adjectives based on their similarity in a corpus.


Steps:

  1. Preprocess the Corpus: Tokenize the text and extract adjectives.

  2. Compute Word Similarity: Use co-occurrence statistics or word embeddings to measure similarity between adjectives.

  3. Cluster Adjectives: Apply a clustering algorithm (e.g., K-Means) to group similar adjectives.

  4. Evaluate Results: Visualize or analyze the clusters to see if they make sense.


Code Implementation:

python

Copy

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Download NLTK resources (run once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

# Sample corpus
corpus = [
    "The big dog barked loudly at the small cat.",
    "The happy child played with a sad puppy.",
    "A bright sun shone over the dark clouds.",
    "The hot coffee warmed her cold hands.",
    "The fast car overtook the slow truck.",
    "The loud music drowned out the quiet whispers.",
    "The tall man stood next to the short woman.",
    "The strong wind blew away the weak branches."
]

# Step 1: Preprocess the corpus and extract adjectives
def extract_adjectives(corpus):
    adjectives = set()
    stop_words = set(stopwords.words('english'))
    for sentence in corpus:
        words = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(words)
        for word, pos in pos_tags:
            if pos.startswith('JJ') and word.lower() not in stop_words:  # JJ = adjective
                adjectives.add(word.lower())
    return list(adjectives)

adjectives = extract_adjectives(corpus)
print("Adjectives in the corpus:", adjectives)

# Step 2: Compute similarity between adjectives using co-occurrence
def compute_cooccurrence_matrix(corpus, adjectives):
    vectorizer = CountVectorizer(vocabulary=adjectives, binary=True)
    X = vectorizer.fit_transform(corpus)
    cooccurrence_matrix = (X.T @ X).toarray()  # Adjacency matrix of co-occurrences
    return cooccurrence_matrix

cooccurrence_matrix = compute_cooccurrence_matrix(corpus, adjectives)
print("Co-occurrence Matrix:\n", cooccurrence_matrix)

# Step 3: Cluster adjectives using K-Means
def cluster_adjectives(cooccurrence_matrix, adjectives, n_clusters=3):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(cooccurrence_matrix)
    return clusters

n_clusters = 3  # Number of clusters (e.g., groups of similar adjectives)
clusters = cluster_adjectives(cooccurrence_matrix, adjectives, n_clusters)

# Step 4: Visualize the clusters
def visualize_clusters(cooccurrence_matrix, adjectives, clusters):
    pca = PCA(n_components=2)  # Reduce dimensions to 2D for visualization
    reduced_matrix = pca.fit_transform(cooccurrence_matrix)

    plt.figure(figsize=(8, 6))
    for i, adj in enumerate(adjectives):
        plt.scatter(reduced_matrix[i, 0], reduced_matrix[i, 1], label=adj)
        plt.text(reduced_matrix[i, 0], reduced_matrix[i, 1], adj, fontsize=9)
    plt.title("Adjective Clusters")
    plt.xlabel("PCA Component 1")
    plt.ylabel("PCA Component 2")
    plt.show()

visualize_clusters(cooccurrence_matrix, adjectives, clusters)

# Step 5: Print the clusters
for cluster_id in range(n_clusters):
    cluster_adjs = [adj for adj, cl in zip(adjectives, clusters) if cl == cluster_id]
    print(f"Cluster {cluster_id + 1}: {cluster_adjs}")

Explanation of the Code:

  1. Adjective Extraction:

    • The code uses NLTK to tokenize sentences and extract adjectives based on their part-of-speech (POS) tags.
  2. Co-occurrence Matrix:

    • A matrix is created to count how often each pair of adjectives appears in the same sentence.
  3. Clustering:

    • The K-Means algorithm groups adjectives based on their co-occurrence patterns.
  4. Visualization:

    • PCA (Principal Component Analysis) reduces the co-occurrence matrix to 2D for easy visualization.
  5. Cluster Output:

    • The code prints the adjectives grouped into clusters.

Example Output:

For the given corpus, the output might look like this:

Copy

Adjectives in the corpus: ['big', 'small', 'happy', 'sad', 'bright', 'dark', 'hot', 'cold', 'fast', 'slow', 'loud', 'quiet', 'tall', 'short', 'strong', 'weak']

Cluster 1: ['big', 'small', 'tall', 'short']
Cluster 2: ['happy', 'sad', 'bright', 'dark']
Cluster 3: ['hot', 'cold', 'fast', 'slow', 'loud', 'quiet', 'strong', 'weak']

What You’ll Learn:

  • Text preprocessing: How to extract specific parts of speech (e.g., adjectives) from a corpus.

  • Co-occurrence analysis: How to measure word relationships based on their appearance in the same context.

  • Clustering: How to group similar words using machine learning algorithms.

  • Visualization: How to visualize high-dimensional data in 2D for better understanding.


This exercise is a great way to explore how statistical techniques and clustering can be used to analyze and group words based on their meanings and usage in a corpus.