Abstract:

The term word association is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor. ) We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). This paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.

(Kenneth Ward Church, Patrick Hanks)

https://www.semanticscholar.org/paper/Word-Association-Norms%2C-Mutual-Information%2C-and-Church-Hanks

Extracts

This paper is about how words like to stick together. Sometimes, when we hear one word, it makes us think of another word really fast, like "doctor" and "nurse." The paper wants to study why this happens and how words work together in sentences.

The authors use a special math idea called "mutual information" to figure out which words like to be together. Usually, scientists ask lots of people which words they think of when they hear another word, but that takes a long time and can be tricky. Instead, the authors use computers to read lots of books and texts to find out which words like to be together. This way, they can study thousands of words much faster and easier.

They call their new method the "association ratio," and it helps them understand how words connect, like "doctor" and "nurse" or even words like "run" and "to." This helps people who study language and make dictionaries understand words better.

Key Concepts

Mutual Information:
- MI compares the joint probability of observing two events (or words) together, P(x,y), with the probability of observing them independently, P(x)P(y).
- If P(x,y) is significantly larger than P(x)P(y), it indicates a strong association, resulting in I(x,y)>0.
- Conversely, if P(x,y) is similar to P(x)P(y), then I(x,y)≈0, suggesting no significant relationship.
- If x and y are in complementary distribution, they do not occur together. This means that P(x,y) is very low or approaches zero..
Estimation of Probabilities:
- The probabilities P(x) and P(y) are estimated by counting occurrences in a corpus, denoted as f(x) and f(y), and normalizing by the total corpus size N.
- Joint probabilities P(x,y) are estimated by counting how often x is followed by y within a specified window size www (e.g., 5 words).
Window Size:
- The choice of window size affects the type of relationships captured:
  - Smaller windows identify fixed expressions (like idioms).
  - Larger windows capture broader semantic relationships.
- A window size of 5 words is chosen as a compromise to balance capturing meaningful relationships without losing contextual adjacency.
Count Threshold:
- The authors set a threshold, avoiding pairs with very small counts (e.g., f(x,y)<5), to maintain stability in the association ratio. This avoids unreliable estimates that can arise from low counts.
Symmetry in Probabilities:
- MI is symmetric (P(x,y)=P(y,x)), meaning the relationship holds regardless of the order of the words.
- The association ratio, however, is not symmetric because it captures linear precedence (the order of appearance). This asymmetry can reveal interesting biases or relationships in data, such as syntactic patterns or sociolinguistic trends.

Lab Works:

https://colab.research.google.com/drive/1f5yfmhAocDZ9bHeg1QKXY_086dEHIVTi

Word Association Norms, Mutual Information, and Lexicography

Extracts

Key Concepts

Lab Works: