Collaborators: Barend Beekhuizen, Suzanne Stevenson
Summary
Recent literature claims that prototype-based distributional semantic models (DSMs) falls short behind instance-based DSMs in representing multiple senses of ambiguous words (homonyms or polysemes). We show that Word2Vec, a prototype-based DSM has a model-internal way of disambiguating the senses with its two sets of vectors. Moreover, we demonstrate the model’s robustness against infrequent meanings with an experiment of controllably generated “pseudo-homonyms”.
Pseudo-Homonyms
Homonyms are ambiguous words with multiple unrelated meanings (E.g. river BANK and financial BANK). Computational treatment of their meaning components has been a challenging task, because it is not always easy to deduce the meaning in context without a sense-annotated corpus.
This is why we opt for the method of pseudo-homonyms: We select two real words (E.g. pizza and water), and replace every instance of the two words in a corpus with a single shared token (E.g. pizzaxwater, the orthography of course does not matter). If we train a word embedding algorithm (e.g. Word2Vec) on this modified corpus, we can get an embedding for the newly created pseudo-homonym. This way, we can study the properties of the pseudo-homonym embedding in relation to the two meaning components, whose embeddings we can obtain from the unmodified corpus.
One caveat is that if we select the two component words randomly, the resulting pseudo-homonym does not resemble real homonyms. We know this from the fact that a classification task on real vs. fake homonyms is 67% accurate, if the pseudo-homonyms use randomly selected components. To make the pseudo-homonyms resemble more like real homonyms, we match their components with real homonyms in a number of psycolinguistic properties (called covariates, since they are often correlated with results in human experiments). After matching, the classification accuracy goes down to 56%, meaning they are much more indistinguishable from real homonyms.
This part of the project was primarily my contribution.
Result

In the figure above, every dot is a pseudo-homonym. We controllably generate pseudo-homonyms where their components have different relative frequencies.
We observe that Word2Vec (blue), a prototype-based model, consistently improves upon the baseline (guessing the more frequent meaning), and the degredation is graceful as the baseline becomes harder. On the other hand, ITS (orange), an instance-based model, starts to perform worse than the baseline as the component frequencies become more unbalanced. This demonstrates that although Word2Vec uses a single vector representation for each word, it has the capacity to distinguish the different senses of an ambiguous word better than some instance-based models.
This work resulted in a conference paper publication: Beekhuizen, B., Cui, C. X., Stevenson, S., Representing lexical ambiguity in prototype models of lexical semantics, in Proceedings of the Cognitive Science Society (2019).