Spectral Methods for Modeling Language

We use spectral methods (SVD) to building statistical language models. The resulting vector models of language are then used to predict a variety of properties of words including their entity type (E.g., person, place, organization ...), their part of speech, and their "meaning" (or at least their word sense). Canonical Correlation Analysis, CCA, a generalization of Principle Component Analysis (PCA), gives context-oblivious vector representations of words. More sophisticated spectral methods are used to estimate Hidden Markov Models (HMMs) and generative parsing models. These methods give state estimates for words and phrases based on their contexts, and probabilites for word sequences. These again can be used to imrpove performance on many NLP tasks.

Core to this work is the use of the Eigenword, a real-valued vector associated with a word that captures its meaning in the sense that distributionally similar words have similar eigenwords. Eigenwords are computed as the singular vectors of the matrix of co-occurrence of words and their contexts. They can be context-oblivious (the vector does not depend on the word's context, only on the word) or context-sensitive (the vector depends on the context).

For more information

Eigenword collections and software
Using Regression for Spectral Estimation of HMMs SLSP 2013 Jordan Rodu, Dean P. Foster, Weichen Wu, and Lyle H. Ungar
Experiments with Spectral Learning of Latent-Variable PCFGs NAACL 2013 Shay Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar
Multi-View Learning of Word Embeddings via CCA NIPS 2011: Dhillon, Foster and Ungar
Spectral dimensionality reduction for HMMs ArXiV 2012: Foster, Rodu and Ungar
Spectral Learning of Latent-Variable PCFGs ACL 2012 Cohen, Stratos, Collins, Foster and Ungar
Spectral Dependency Parsing with Latent Variables EMNLP-CoNLL 2012 Dhillon, Rodu, Collins, Foster and Ungar
Two Step CCA: A new spectral method for estimating vector models of words ICML 2012 Paramveer Dhillon, Jordan Rodu, Dean Foster and Lyle Ungar and supplemental material

Our 2013 NAACL Tutorial on Spectral Learning Algorithms for Natural Language Processing has more references

Key Collaborators

Dean Foster and Michael Collins
Students and postdocs on the team: Paramveer Dhillon, Jordan Rodu, Shay Cohen, Karl Stratos, Andy Schwartz

home: ungar@cis.upenn.edu