Lucy H. Lin
PhD thesis (2021), advised by Noah A. Smith.
For social scientists and other data practitioners, the abundance of available digital text data is a rich potential source for understanding social phenomena. As a result, practitioners have increasingly used text analysis methods on relevant corpora to help answer their substantive research questions; common abstractions for these analyses include text classification, topic modeling, and fixed keyword matching. While these tools are powerful, they impose strong assumptions about the structure of human language (e.g., documents as bags of words), and as a result limit the kinds of inferences that practitioners can draw from corpora. On the flip side, richer models trained on large corpora provided by the natural language processing community do not necessarily transfer to the needs of practitioners' applications.
In this work, we propose semantic comparison as another lens for studying social phenomena in text data. We introduce two novel applications of semantic comparison methods for which standard abstractions are insufficient. First, we demonstrate the utility of finding semantic matches of a query sentence in a broader corpus through two case studies: community recovery after the 2010-2011 Christchurch, New Zealand earthquake sequence, as expressed in local news text; and policy attitudes in the United States Congress across 2000-2013, as expressed in archived websites from the .gov domain. We discuss model selection and end-user challenges involved, and introduce a procedure (nearest neighbor overlap) to compare sentence embedder behavior in the context of a corpus.
Second, we discuss sensationalism in medical journalism and the possible utility of NLP -- particularly semantic comparison -- in identifying sensationalized text. We survey past studies across communications, medicine, and psychology to illustrate the complexity of how and why sensationalism manifests in the health communications pipeline. In doing so, we critique the common NLP setup of attempting to label social phenomena in text with high accuracy and provide recommendations for developing user-facing NLP systems that seek to identify or reduce the occurrence of sensationalism.