Research Article: The natural selection of words: Finding the features of fitness

Date Published: January 28, 2019

Publisher: Public Library of Science

Author(s): Peter D. Turney, Saif M. Mohammad, Richard A Blythe.


We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word’s length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.

Partial Text

Words are a basic unit for the expression of meanings, but the mapping between words and meanings is many-to-many. Many words can have one meaning (synonymy) and many meanings can be expressed with one word (polysemy). Generally we have a preference for one word over another when we select a word from a set of synonyms in order to convey a meaning, and generally one sense of a polysemous word is more likely than the other senses. These preferences are not static; they evolve over time. In this paper, we present work on improving our understanding of the evolution of our preferences for one word over another in a set of synonyms.

Much has been written about the evolution of words. Van Wyhe [16] provides a good survey of early research. Gray, Greenhill, and Ross [17] and Pagel [18] present thorough reviews of recent work. Mesoudi [19] gives an excellent introduction to work on the evolution of culture in general. In this section, we present a few relevant highlights from the literature on the evolution of words.

Predicting the rise and fall of words in a synset could be viewed as a time series prediction problem, but we prefer another point of view. Fifty years from now, will ecstatic still dominate its synset, or will it perhaps be replaced by rapt? This is a classification problem, rather than a time series prediction problem. The classes are winner and loser.

Now that we have training and testing datasets, we apply supervised learning to predict when the leadership of a synset will change. We do this in three more steps, as follows.

This section presents four sets of experiments. The first experiment evaluates the system as described above; we call this system NBCP (Naive Bayes Change Prediction). The second experiment evaluates the impact of removing features from NBCP to discover which features are most useful. The third experiment varies the cycle length from thirty years to sixty years. The final experiment takes a close look at the model that is induced by the naive Bayes classifier, in an effort to understand what it has learned.

Throughout this work, our guiding principle has been simplicity, based on the assumption that the evolution of words is a complex, noisy process, requiring a simple, robust approach to modeling. Therefore we chose a classification-based analysis, instead of a time series prediction algorithm, and a naive Bayes model, instead of a more complex model. The success of our approach is encouraging, and it suggests there is more signal and structure in the data than we expected. We believe that more sophisticated analyses will reveal interesting phenomena that our simpler approach has missed.

This work demonstrates that change in which word dominates a synset is predictable to some degree; change is not entirely random. It is possible to make successful predictions several decades into the future. Furthermore, it is possible to understand some of the causes of change in synset leadership.




Leave a Reply

Your email address will not be published.