Course Overview

In recent years, the use of computational techniques for the quantitative analysis of text has exploded. The volume and quantity of text data to which we now have access in the digital age is enormous. This has led social scientists to seek out new means of analyzing text data at scale.

We will see that text records, be they in the form of digital traces left on social media platforms, archived works of literature, parliamentary speeches, video transcripts, or print news, can help us answer a huge range of important questions.

Learning outcomes

This course will give students training in the use of computational text analysis techniques. The course will prepare students for dissertation work that uses textual data and will provide hands-on training in the use of the R programming language and (some) Python.

The course will provide a venue for seminar discussion of examples using these methods in the empirical social sciences as well as lectures on the technical and/or statistical dimensions of their application.

Course structure

We will be using this online book for the ten-week course in “Computational Text Analysis” (PGSP11584). Each chapter contains the readings for that week. The book also includes worksheets with example code for how to conduct some of the text analysis techniques we discuss each week.

Each week (with the partial exception of week 1), we will be discussing, alternately, the substantive and technical dimensions of published research in the empirical social sciences. The readings for each week generally contain two “substantive” readings—that is, examples of the application of text analysis techniques with empirical data—and one “technical” reading that focuses mainly on the statistical and computational aspects of a given technique.

We will study first the technical aspects of analytical approaches and, second, the substantive dimensions of these applications. This means that, when discussing the readings, we will be able to discuss how satisfactory a given approach is for illuminating the question or topic at hand.

Lectures will primarily be focused on the technical dimensions of a given technique. The seminar (Q&A) that follows will give us the opportunity to study and discuss questions of social scientific interest, and how computational text analysis had been used to answer these.

Course pre-preparation

NOTE: Before the lecture in Week 2, students should complete two introductory R exercises. Those students who have already done this for my courses in Semester 1 do not need to do this.

For those who haven’t done any of the pre-preparation tasks already, you should, first, consult the worksheet here, which is an introduction to setting up and understanding the very basics of working in R. Second, Ugur Ozdemir has provided such a more comprehensive introductory R course for the Research Training Centre at the University of Edinburgh and you can follow the instructions here to access this.

Reference sources

There are several other reference texts that will be of use during this course:

Wickham, Hadley and Garrett Grolemund. R for Data Science: https://r4ds.had.co.nz/
Silge, Julia and David Robinson. Text Mining with R: https://www.tidytextmining.com/
- For learning tidytext, this online tutorial will be used: https://juliasilge.shinyapps.io/learntidytext/
(later in the course) Hvitfelft, Emil and Julia Silge. Supervised Machine Learning for Text Analysis in R: https://smltar.com/

In several weeks, we will also be referring to two other textbooks, available online, on information retrieval and text processing. These are:

Jurafsky, Dan and James H. Martin. Speech and Language Processing (3rd ed. draft): https://nlp.stanford.edu/IR-book/information-retrieval-book.html
Manning, Christopher D.,Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval: https://nlp.stanford.edu/IR-book/information-retrieval-book.html

Assessment

Fortnightly worksheets

Each fortnight, I will provide you with one worksheet that walks you through how to implement a different text analysis technique. At the end of these worksheets you will find a set of questions. You should buddy up with someone else in your class and go through these together.

This is called “pair programming” and there’s a reason we do this. Firstly, coding can be an isolating and difficult thing—it’s good to bring a friend along for the ride! Secondly, if there’s something you don’t know, maybe your buddy will. This saves you both time. Thirdly, your buddy can check your code as you write it, and vice versa. Again, this means both of you are working together to produce and check something as you go along.

At the subsequent week’s lecture, I will pick on a pair at random to answer each one of that worksheet’s questions (i.e., there is ~1/3 chance you’re going to get picked each week). I will ask you to walk us through your code. And remember: it’s also fine if you struggled and didn’t get to the end! If you encountered an obstacle, we can work through that together. All that matters to me is that you try.

The remainder of the seminar on worksheet weeks will be dedicated to seminar discussion where we discuss the readings together.

Fortnightly flash talks

On the weeks where you are not going to be tasked with a coding assignment, you’re not off the hook… I will again be selecting a pair at random (the same as your coding pair) to talk me through one of the readings. I will pick a different pair for each reading (i.e., ~ 1/3 chance again).

Don’t let this be cause of great anguish: I just want thirty seconds to a few minutes where you lay out for me at least one—but preferably two or three—criticisms you had of any of the articles that are required reading for that week,

Here, you will want to think about whether the article really answered the research question, whether the data was appropriate for answering that question, whether the method was appropriate for answering that question, and whether the results show what the author claims they show.

The remainder of the seminar on flash talk weeks will be dedicated to group work where we go through the coding Worksheet together.

Final assessment

Assessment takes the form of one summative assessment. This will be a 4000 word essay on a subject of your choosing (with prior approval by me). For this, you will be required to select from a range of data sources I will provide. You may also suggest your own data source.

You will be asked to: a) formulate a research question; b) use at least one computational text analysis technique that we have studied; c) conduct an analysis of the data source you have provided; d) write up the initial findings; and e) outline potential extensions of your analysis.

You will then provide the code you used in reproducible (markdown) format and will be assessed on both the substantive content of your essay contribution (the social science part) as well as your demonstrated competency in coding and text analysis (the computational part).

“Computational Text Analysis” (PGSP11584)

Introduction to R