24 Exercise 8: Supervised learning

24.1 Introduction

The hands-on exercise for this week focuses on how to classify a sample of text.

24.1.1 Data

We will be classifying the same tweets as discussed in Barrie (2023).

I benefited form this worksheet when preparing this tutorial.

You can do this locally on your computers with:

tweets_sample  <- readRDS(gzcon(url("https://github.com/cjbarrie/CS-ED/blob/main/data/tweets-ranked.rds?raw=true")))
tweet_id user_username text source possibly_sensitive author_id lang conversation_id created_at user_name user_profile_image_url user_location user_verified user_description user_url user_created_at user_protected user_pinned_tweet_id retweet_count like_count quote_count user_tweet_count user_list_count user_followers_count user_following_count sourcetweet_type sourcetweet_id sourcetweet_text sourcetweet_lang sourcetweet_author_id in_reply_to_user_id date
1558572382947876865 crabcrawler1 RT @ Fire_fux: @ crabcrawler1 https://t.co/HcKNbVEQIP Twitter Web App FALSE 1324742403492892673 qme 1558572382947876865 2022-08-13T21:53:07.000Z Crab Man https://pbs.twimg.com/profile_images/1587559247637843970/CNfOGD_7_normal.jpg Ohio, USA FALSE DONATE PLEASE https://t.co/BF7MOhA7d0 https://t.co/qzu1mWI2EM NEWs and politics. By night time I normally watch TV and movies. Also, go to church on Sunday. NA 2020-11-06T15:56:36.000Z FALSE 1590851209111670784 1 0 0 83259 152 30253 1806 retweeted 1558570726965350401 @ crabcrawler1 https://t.co/HcKNbVEQIP qme 1531709708783984642 NA 2022-08-13
1579236823527608321 HankVenture5 Blocked and reported https://t.co/18aKrUQy92 Twitter for iPhone FALSE 1285388676751437824 en 1579236823527608321 2022-10-09T22:26:14.000Z Hank Venture https://pbs.twimg.com/profile_images/1389707714826182656/EFOyaLz2_normal.jpg NA FALSE Slightly right-of-center so evil probably. Gen-X. Stamp Collector. Team Venture Stan account. Chief Executive Officer of #HankCo NA 2020-07-21T01:39:30.000Z FALSE 1491907548827381761 0 12 0 31043 10 3887 930 quoted 1579236650244517889 No matter which city you’re in, all pizza is pretty much the same en 11203972 NA 2022-10-09
1591122529694818304 krus_chiki @ eDonut_ Look i’ve said it before and i’ll say it again, Kruschiki supply co almost became a retro/vintage candy store Twitter for iPhone FALSE 922267018526707712 en 1591120235104727040 2022-11-11T17:35:47.000Z krus🪖 https://pbs.twimg.com/profile_images/1227595520773885952/8MWK0ssD_normal.jpg NA FALSE cossack. rare rug dealer. sells military surplus. super villains, inc. https://t.co/52TOsDw8YZ 2017-10-23T01:02:46.000Z FALSE 1378899646630801416 0 8 0 41302 94 39752 1646 NA NA NA NA NA 1563662952892227584 2022-11-11
1578034288842264582 entelechiada gm https://t.co/5maKDeYAHV Twitter for iPhone FALSE 1415419642693111809 und 1578034288842264582 2022-10-06T14:47:47.000Z Entelechiada https://pbs.twimg.com/profile_images/1501024250114678787/NFVa-9Xe_normal.jpg United States FALSE entelechy + enchiladas: here for the good stuff https://t.co/iG5j2CQlzw 2021-07-14T21:15:41.000Z FALSE 1565024760718798848 0 8 0 3504 7 594 747 NA NA NA NA NA NA 2022-10-06
1573056421914382336 EITC_Official RT @ EITC_Official: This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one… Twitter for iPhone FALSE 1520028377457078275 en 1573056421914382336 2022-09-22T21:07:31.000Z 👁 Inside The Classroom https://pbs.twimg.com/profile_images/1522240813442342912/6oXnbTIV_normal.jpg San Antonio, TX FALSE Providing receipts that refute, “it’s not happening.” Videos belong to their respective owners. https://t.co/qI9VxrbMp2 2022-04-29T13:13:46.000Z FALSE 1592899675530866689 53 0 0 7250 73 18285 947 retweeted 1571349132090179589 This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one of the best experiences [she has] had so far as an educator.” https://t.co/8DvkicyG5O en 1520028377457078275 NA 2022-09-22
1583998206547226627 MsAvaArmstrong RT @ RSBNetwork: President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full… Twitter for iPhone FALSE 2449913803 en 1583998206547226627 2022-10-23T01:46:16.000Z AvaArmstrong, 🇺🇸Author https://pbs.twimg.com/profile_images/1377022534235918341/r_zFt102_normal.jpg Last Outpost FALSE Thriller-Romance author on AMAZON. MY OPINION STATED HERE. Expert at triggering Leftists. Trump won. CONTENT OF CHARACTER. Taken by @ EagleEyeFlyer ❤️ NO DM🚫 NA 2014-04-17T15:02:47.000Z FALSE 1591443575249768448 518 0 0 785719 374 157472 86624 retweeted 1583992381535186944 President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full speech on Rumble: https://t.co/XJnh6tcghu https://t.co/xx9A404nF0 en 4041824789 NA 2022-10-23

Now, we need first to add some labels to these data. Specifically, we’re interested in the “toxicity” of tweet content.

How can we do this?

Well, I have provided you with already labelled data in the below:

tweets_sample  <- readRDS(gzcon(url("https://github.com/cjbarrie/CS-ED/blob/main/data/tweets-tox-ranked.rds?raw=true")))
tweet_id user_username text TOXICITY
1558572382947876865 crabcrawler1 RT @ Fire_fux: @ crabcrawler1 https://t.co/HcKNbVEQIP 0.1004571
1579236823527608321 HankVenture5 Blocked and reported https://t.co/18aKrUQy92 0.0673801
1591122529694818304 krus_chiki @ eDonut_ Look i’ve said it before and i’ll say it again, Kruschiki supply co almost became a retro/vintage candy store 0.0570059
1578034288842264582 entelechiada gm https://t.co/5maKDeYAHV 0.0138227
1573056421914382336 EITC_Official RT @ EITC_Official: This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one… 0.0426573
1583998206547226627 MsAvaArmstrong RT @ RSBNetwork: President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full… 0.0451312

These tweets were actually labelled by another machine learning engine (we’ll talk about this later). But for now, we’re going to pretend they were labelled by humans. And we’re going to take a subset of the data to train our own classifier to label the rest of the dataset.

We’re going to say that anything above a score of .5 is “toxic.”

library(caret)
library(rsample)

# Select just the columns we need
tweets_tox_select <- tweets_tox_sample %>%
  select(tweet_id, user_username, text, TOXICITY) %>%
  mutate(toxbin = ifelse(TOXICITY>=.5, 1, 0))

set.seed(123)
split <- initial_split(tweets_tox_select, prop = .7, strata = "TOXICITY")
train <- training(split)
test  <- testing(split)

table(train$toxbin) %>% prop.table()
## 
##          0          1 
## 0.95927602 0.04072398
table(test$toxbin) %>% prop.table()
## 
##          0          1 
## 0.95438596 0.04561404

24.2 Naïve Bayes

This section was adapted from here. I thank Bradley Boehmke for providing these materials.

Bayesian probability is built on the idea of conditional probability, the probability of event A given that event B has occurred [P(A|B)].

For our Twitter data, this means we are interested in a tweet being “toxic” \(C_k\) (where \(C_{yes} = \text{toxic}\) and \(C_{no} = \text{non-toxic}\)) given that its predictor values are \(x_1, x_2, ..., x_p\). This can be written as \(P(C_k|x_1, ..., x_p)\).

The Bayesian formula for calculating this probability is

\[ P(C_k|X) = \frac{P(C_k) \cdot P(X|C_k)}{P(X)} \quad (1) \]

where:

  • \(P(C_k)\) is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of a tweet being toxic or not. And we know this is around 5%.

  • \(P(X)\) is the probability of the predictor variables (same as \(P(C_k|x_1, ..., x_p)\)). This will be the text of the tweets.

  • \(P(X|C_k)\) is the conditional probability or likelihood. Essentially, for each class of the response variable (i.e. toxic or non-toxic), what is the probability of observing the predictor values.

  • \(P(C_k|X)\) is called our posterior probability. By combining our observed information, we are updating our a priori information on probabilities to compute a posterior probability that an observation has class \(C_k\).

library(SnowballC)
library(naivebayes)
library(tm)

# Preprocess the text data
# Create a Corpus from the text column
train_corpus <- Corpus(VectorSource(train$text))
test_corpus <- Corpus(VectorSource(test$text))

# Text preprocessing
preprocess <- function(corpus) {
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

train_corpus <- preprocess(train_corpus)
test_corpus <- preprocess(test_corpus)

# Create a document-term matrix
train_dtm <- DocumentTermMatrix(train_corpus)
test_dtm <- DocumentTermMatrix(test_corpus, control=list(dictionary=Terms(train_dtm)))

# Convert dtm to matrix
train_matrix <- as.matrix(train_dtm)
test_matrix <- as.matrix(test_dtm)

# Fit Naive Bayes model
# Ensure factors are factors, and predictors are in the correct format
train$toxbin <- factor(train$toxbin)

# Use the train function from caret to train the model
trControl <- trainControl(method = "cv", number = 10)
nb_model <- train(x = train_matrix, y = train$toxbin, method = "naive_bayes", trControl = trControl)

# Predict on test data
test$toxbin <- factor(test$toxbin) # Make sure the test labels are also factors
predictions <- predict(nb_model, test_matrix)

# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test$toxbin)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 272  13
##          1   0   0
##                                           
##                Accuracy : 0.9544          
##                  95% CI : (0.9233, 0.9755)
##     No Information Rate : 0.9544          
##     P-Value [Acc > NIR] : 0.5730645       
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.0008741       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9544          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9544          
##          Detection Rate : 0.9544          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

Great! We have 95% accuracy. That’s good, right? WRONG.

This is a common issue in imbalanced datasets where one class is significantly more prevalent than the other. Here are some key terms to remember regarding the model’s performance:

  • The Kappa statistic is 0, which indicates that the model is no better than random chance when taking into account the imbalance of the classes.
  • The Sensitivity (also known as Recall or True Positive Rate) is 1, which means that the model correctly identified all non-toxic tweets (class 0) as such. However, this is not informative since there are almost no toxic tweets (class 1) to begin with.
  • The Specificity is 0, indicating that the model did not correctly identify any toxic tweets (class 1). This means the model failed to identify the minority class entirely.
  • The Pos Pred Value (or Precision) for the non-toxic class is the same as the accuracy, which is again not informative due to the lack of true positives for the toxic class.
  • Balanced Accuracy is 0.5, which is the average of sensitivity and specificity. Since specificity is 0, this metric shows that the model is ineffective for the minority class.

24.2.1 Alternatives to locally trained models

We can then use the following code to classify this content. This code connects to the Google Perspective classifying engine.

All we need to do is tell it which features of text we want it to classify. We can do so using the peRspective library in R.

library(peRspective)
library(dplyr)
library(ggplot2)

models <- c(peRspective::prsp_models)
models_subset <- models[c(1:5, 7, 9:10, 12, 14)]
models_subset

toxtwts <- tweets_sample %>%
  prsp_stream(text = text,
              text_id = tweet_id, 
              score_model = models_subset,
              verbose = T,
              safe_output = T)

colnames(toxtwts) <- c("tweet_id", "error", models_subset)

tweets_sample_tox_r <- tweets_sample %>%
  left_join(toxtwts, by = "tweet_id")

And then we’re back to the data we started with!