11 Week 6 Demo

11.1 Setup

First, we’ll load the packages we’ll be using in this week’s brief demo.

library(topicmodels)
library(dplyr)
library(tidytext)
library(ggplot2)
library(ggthemes)

Estimating a topic model requires us first to have our data in the form of a document-term-matrix. This is another term for what we have referred to in previous weeks as a document-feature-matrix.

We can take some example data from the topicmodels package. This text is from news releases by the Associated Press. It consists of around 2,200 articles (documents) and over 10,000 terms (words).

data("AssociatedPress", 
     package = "topicmodels")

To estimate the topic model we need only to specify the document-term-matrix we are using, and the number (k) of topics that we are estimating. To speed up estimation, we are here only estimating it on 100 articles.

lda_output <- LDA(AssociatedPress[1:100,], k = 10)

We can then inspect the contents of each topic as follows.

terms(lda_output, 10)

##       Topic 1    Topic 2     Topic 3      Topic 4    
##  [1,] "percent"  "police"    "soviet"     "central"  
##  [2,] "new"      "i"         "union"      "noriega"  
##  [3,] "company"  "man"       "year"       "peres"    
##  [4,] "oil"      "state"     "president"  "official" 
##  [5,] "duracell" "fire"      "questions"  "panama"   
##  [6,] "gas"      "new"       "bar"        "snow"     
##  [7,] "million"  "rating"    "batalla"    "nation"   
##  [8,] "year"     "three"     "last"       "northern" 
##  [9,] "record"   "greyhound" "multistate" "president"
## [10,] "prices"   "mrs"       "officials"  "southern" 
##       Topic 5     Topic 6    Topic 7   Topic 8       Topic 9    
##  [1,] "two"       "i"        "percent" "soviet"      "bush"     
##  [2,] "year"      "bush"     "year"    "new"         "dukakis"  
##  [3,] "roberts"   "barry"    "prices"  "officers"    "i"        
##  [4,] "soviet"    "campaign" "month"   "polish"      "people"   
##  [5,] "years"     "moore"    "rate"    "florio"      "waste"    
##  [6,] "people"    "study"    "report"  "settlements" "warming"  
##  [7,] "congress"  "like"     "months"  "union"       "global"   
##  [8,] "i"         "asked"    "economy" "exxon"       "city"     
##  [9,] "north"     "children" "rose"    "government"  "front"    
## [10,] "officials" "last"     "index"   "money"       "president"
##       Topic 10        
##  [1,] "new"           
##  [2,] "bank"          
##  [3,] "administration"
##  [4,] "california"    
##  [5,] "percent"       
##  [6,] "farmer"        
##  [7,] "i"             
##  [8,] "thats"         
##  [9,] "magellan"      
## [10,] "spacecraft"

We can then use the tidy() function from tidytext to gather the relevant parameters we’ve estimated. To get the \(\beta\) per-topic-per-word probabilities (i.e., the probability that the given term belongs to a given topic) we can do the following.

lda_beta <- tidy(lda_output, matrix = "beta")

lda_beta %>%
  arrange(-beta)

## # A tibble: 104,730 × 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     7 percent 0.0331
##  2    10 new     0.0165
##  3    10 bank    0.0159
##  4     8 soviet  0.0150
##  5     1 percent 0.0122
##  6     9 bush    0.0118
##  7     9 dukakis 0.0118
##  8     8 new     0.0111
##  9     2 police  0.0104
## 10     6 i       0.0103
## # ℹ 104,720 more rows

And to get the \(\gamma\) per-document-per-topic probabilities (i.e., the probability that a given document (here: article) belongs to a particular topic) we do the following.

lda_gamma <- tidy(lda_output, matrix = "gamma")

lda_gamma %>%
  arrange(-gamma)

## # A tibble: 1,000 × 3
##    document topic gamma
##       <int> <int> <dbl>
##  1       76    10  1.00
##  2       81     4  1.00
##  3        6     4  1.00
##  4       43     5  1.00
##  5       95    10  1.00
##  6       77     9  1.00
##  7       29     8  1.00
##  8       80     1  1.00
##  9       57     5  1.00
## 10       25    10  1.00
## # ℹ 990 more rows

And we can easily plot our \(\beta\) estimates as follows.

lda_beta %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  scale_y_reordered() +
  theme_tufte(base_family = "Helvetica")

Which shows us the words associated with each topic, and the size of the associated \(\beta\) coefficient.

10 Week 6: Unsupervised learning (topic models)

12 Week 7: Unsupervised learning (word embedding)