2 Week 2: Tokenization and word frequencies

When approaching large-scale quantiative analyses of text, a key task is how we identify and capture the unit of analysis. One of the most commonly used approaches, across diverse analytical contexts, is text tokenization. Here, we are splitting the text into word units: unigrams, bigrams, trigrams etc.

The chapters by Manning, Raghavan, and Schtze (2007), listed below, provide a technical introduction to the task of “querying” text according to different word-based queries. This is a task we will be studying in the hands-on assignment for this week.

For the seminar discussion, we will be focusing on some widely-cited examples of research in the applied social sciences employing token-based, or word frequency, analyses of large corpora. The first, by Michel et al. (2011) uses the enormous Google books corpus to measure cultural and linguistic trends. The second, by Bollen et al. (2021a) uses the same corpus to demonstrate a more specific change over time—so-called “cognitive distortion.” In both examples, we should be attentive to questions of sampling covered in previous weeks. This question is central to the back-and-forths in the short responses and replies to the articles by Michel et al. (2011) and Bollen et al. (2021a).

Questions:

  1. Tokenizing and counting: what does this capture?
  2. Corpus-based sampling: what biases might threaten inference?
  3. If you had to write a critique of either Michel et al. (2011) or Bollen et al. (2021a), what would it focus on?

Required reading:

Further reading:

  • Rozado, Al-Gharbi, and Halberstadt (2021)
  • Alshaabi et al. (2021)
  • Campos et al. (2015)
  • Greenfield (2013)

Slides: