6 Week 4: Natural language, complexity, and similarity

This week we will be delving more deeply into how language is used in text. In previous weeks, we have tried out two main techniques both of which rely, in different ways, on counting words. This week, we will be thinking about some more sophisticated techniques to identify and measure language use, as well as how to compare texts to each other. The article by Gomaa and Fahmy (2013) provides an overview of different approaches. We will be covering these technical dimensions in the lecture.

The article by Urman, Makhortykh, and Ulloa (2021) investigates a key question in contemporary communications research—what information we are exposed to online—and shows how we might compare between web search results using similarity measures. The Schoonvelde et al. (2019) article, on the other hand, looks at the “complexity” of texts, and compares how politicians of different ideological stripes communicate.

Questions:

  1. How do we measure linguistic complexity/sophistication?
  2. What biases might be involved in measuring sophistication?
  3. What other applications might there be for similarity measures?

Required reading:

  • Urman, Makhortykh, and Ulloa (2021)
  • Schoonvelde et al. (2019)
  • Gomaa and Fahmy (2013)

Further reading:

  • Voigt et al. (2017)
  • Peng and Hengartner (2002)
  • Lowe (2008)
  • Bail (2012)
  • Ziblatt, Hilbig, and Bischof (2020)
  • Benoit, Munger, and Spirling (2019)

Slides: