Idea matching and measurement
The partisan dimensions of religious rhetoric: Merging qualitative and natural language processing approaches to measure Congressional behavior
Natural language processing for analyzing disaster recovery trends expressed in large text corpora
We are developing a new natural language processing (NLP) method to facilitate analysis of text corpora that describe long-term recovery. The aim of the method is to allow users to measure the degree that user-specified propositions about potential issues are embodied within the corpora, serving as a proxy for the disaster recovery process. The presented method employs a statistical syntax-based semantic matching model and was trained on a standard, publicly available training dataset. We applied the NLP method to a news story corpus that describes the recovery of Christchurch, New Zealand after the 2010-2011 Canterbury earthquake sequence. We used the model to compute semantic measurements of multiple potential recovery issues as expressed in the Christchurch news corpus that span 2011 to 2016. We evaluated method outputs through a user study involving twenty professional emergency managers. User study results show that the model can be effective when applied to a disaster-related news corpus. 85% of study participants were interested in a way to measure recovery issue propositions in news or other corpora. We are encouraged by the potential for future applications of our NLP method for after-action learning, recovery decision making, and disaster research.
Semantic matching against a corpus: New applications and methods
Preprint; presented at NW-NLP
We consider the case of a domain expert who wishes to explore the extent to which a particular idea is expressed in a text collection. We propose the task of semantically matching the idea, expressed as a natural language proposition, against a corpus. We create two preliminary tasks derived from existing datasets, and then introduce a more realistic one on disaster recovery designed for emergency managers, whom we engaged in a user study. On the latter, we find that a new model built from natural language entailment data produces higher-quality matches than simple word-vector averaging, both on expert-crafted queries and on ones produced by the subjects themselves. This work provides a proof-of-concept for such applications of semantic matching and illustrates key challenges.
Syntax, semantics, and representations
Parsing with multilingual BERT, a small corpus, and a small treebank
In Findings of EMNLP
(2020); presented at SIGTYP
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled and unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.
Situating sentence embedders with nearest neighbor overlap
As distributed approaches to natural language semantics have developed and diversified, embedders for linguistic units larger than words have come to play an increasingly important role. To date, such embedders have been evaluated using benchmark tasks (e.g., GLUE) and linguistic probes. We propose a comparative approach, nearest neighbor overlap (N2O), that quantifies similarity between embedders in a task-agnostic manner. N2O requires only a collection of examples and is simple to understand: two embedders are more similar if, for the same set of inputs, there is greater overlap between the inputs' nearest neighbors. Though applicable to embedders of texts of any size, we focus on sentence embedders and use N2O to show the effects of different design choices and architectures.
Improving natural language inference with a pretrained parser
We introduce a novel approach for incorporating syntax into natural language inference (NLI) models. Our method uses contextual token-level vector representations from a pretrained dependency parser. Like other contextual embedders, our method is broadly applicable to any neural model. We experiment with four strong NLI models (decomposable attention model, ESIM, BERT, and MT-DNN), and show consistent benefit to accuracy across three NLI benchmarks.
Analysis of news text
PolitiFact language audit
Technical report for PolitiFact (2018).
We report on attempts to use currently available automated text analysis tools to identify possible biased treatment by Politifact of Democratic vs. Republican speakers, through language. We begin by noting that there is no established method for detecting such differences, and indeed that "bias" is complicated and difficult to operationalize into a measurable quantity. This report includes several analyses that are representative of the tools available from natural language processing at this writing. In each case, we offer (i) what we would expect to see in the results if the method picked up on differential treatment between Democrats vs. Republicans, (ii) what we actually observe, and (iii) potential problems with the analysis; in some cases we also suggest (iv) future analyses that might be more revelatory.
Gender demographics of invited seminar speakers reflect gender disparities of faculty hosts
Rachel A. Hutto, Lisa Voelker, Jacob J. O'Connor, Lucy H. Lin, Natalia Mesa, Claire Rusch
Cascading failures in financial networks
Lucy H. Lin
Undergraduate senior thesis (advisor: Andrea LaPaugh).