1 Week 1: Retrieving and analyzing text

Our first task when conducting large-scale text analyses is gathering and curating the text information itself. This is the focus of the chapters by Manning, Raghavan, and Schtze (2007) listed below. Here, you’ll find an introduction to different ways in which we can reformat and ‘query’ text data in order to begin asking questions of it. This is often referred to in computer science and natural language processing contexts as “information retrieval” and is the foundation of many search, including web search, processes.

The articles by Tatman (2017) and Pechenick, Danforth, and Dodds (2015) will be the focus of our seminar (Q&A). These articles will get us thinking about the fundamentals of text discovery and sampling. When reading the articles we should think about where we are locating our texts, how we are sampling them, what biases might inhere in this sampling process, and what these texts represent; i.e., about what population or phenomenon of interest they might provide inferences.

Questions for seminar:

Where do we access text? What do we need to consider when doing so?
How do we sample texts?
What biases do we need to keep in mind?

Required reading:

Tatman (2017)
Pechenick, Danforth, and Dodds (2015)
Manning, Raghavan, and Schtze (2007) (chs.1 and 10): https://nlp.stanford.edu/IR-book/information-retrieval-book.html
Klaus Krippendorff (2004) (ch. 6)

Further reading:

Olteanu et al. (2019)
Biber (1993)
Barberá and Rivero (2015)

Slides:

Week 1 Slides

Introduction to R

2 Week 2: Tokenization and word frequencies