24 Exercise 8: Supervised learning

24.1 Introduction

The hands-on exercise for this week focuses on how to classify a sample of text.

24.1.1 Data

We will be classifying the same tweets as discussed in Barrie (2023).

I benefited form this worksheet when preparing this tutorial.

library(dplyr)

You can do this locally on your computers with:

tweets_sample  <- readRDS(gzcon(url("https://github.com/cjbarrie/CS-ED/blob/main/data/tweets-ranked.rds?raw=true")))

tweet_id	user_username	text	source	possibly_sensitive	author_id	lang	conversation_id	created_at	user_name	user_profile_image_url	user_location	user_verified	user_description	user_url	user_created_at	user_protected	user_pinned_tweet_id	retweet_count	like_count	user_tweet_count	user_list_count	user_followers_count	user_following_count	sourcetweet_type	sourcetweet_id	sourcetweet_text	sourcetweet_lang	sourcetweet_author_id	in_reply_to_user_id	date
1558572382947876865	crabcrawler1	RT @ Fire_fux: @ crabcrawler1 https://t.co/HcKNbVEQIP	Twitter Web App	FALSE	1324742403492892673	qme	1558572382947876865	2022-08-13T21:53:07.000Z	Crab Man	https://pbs.twimg.com/profile_images/1587559247637843970/CNfOGD_7_normal.jpg	Ohio, USA	FALSE	DONATE PLEASE https://t.co/BF7MOhA7d0 https://t.co/qzu1mWI2EM NEWs and politics. By night time I normally watch TV and movies. Also, go to church on Sunday.	NA	2020-11-06T15:56:36.000Z	FALSE	1590851209111670784	1	0	83259	152	30253	1806	retweeted	1558570726965350401	@ crabcrawler1 https://t.co/HcKNbVEQIP	qme	1531709708783984642	NA	2022-08-13
1579236823527608321	HankVenture5	Blocked and reported https://t.co/18aKrUQy92	Twitter for iPhone	FALSE	1285388676751437824	en	1579236823527608321	2022-10-09T22:26:14.000Z	Hank Venture	https://pbs.twimg.com/profile_images/1389707714826182656/EFOyaLz2_normal.jpg	NA	FALSE	Slightly right-of-center so evil probably. Gen-X. Stamp Collector. Team Venture Stan account. Chief Executive Officer of #HankCo	NA	2020-07-21T01:39:30.000Z	FALSE	1491907548827381761	0	12	31043	10	3887	930	quoted	1579236650244517889	No matter which city you’re in, all pizza is pretty much the same	en	11203972	NA	2022-10-09
1591122529694818304	krus_chiki	@ eDonut_ Look i’ve said it before and i’ll say it again, Kruschiki supply co almost became a retro/vintage candy store	Twitter for iPhone	FALSE	922267018526707712	en	1591120235104727040	2022-11-11T17:35:47.000Z	krus🪖	https://pbs.twimg.com/profile_images/1227595520773885952/8MWK0ssD_normal.jpg	NA	FALSE	cossack. rare rug dealer. sells military surplus. super villains, inc.	https://t.co/52TOsDw8YZ	2017-10-23T01:02:46.000Z	FALSE	1378899646630801416	0	8	41302	94	39752	1646	NA	NA	NA	NA	NA	1563662952892227584	2022-11-11
1578034288842264582	entelechiada	gm https://t.co/5maKDeYAHV	Twitter for iPhone	FALSE	1415419642693111809	und	1578034288842264582	2022-10-06T14:47:47.000Z	Entelechiada	https://pbs.twimg.com/profile_images/1501024250114678787/NFVa-9Xe_normal.jpg	United States	FALSE	entelechy + enchiladas: here for the good stuff	https://t.co/iG5j2CQlzw	2021-07-14T21:15:41.000Z	FALSE	1565024760718798848	0	8	3504	7	594	747	NA	NA	NA	NA	NA	NA	2022-10-06
1573056421914382336	EITC_Official	RT @ EITC_Official: This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one…	Twitter for iPhone	FALSE	1520028377457078275	en	1573056421914382336	2022-09-22T21:07:31.000Z	👁 Inside The Classroom	https://pbs.twimg.com/profile_images/1522240813442342912/6oXnbTIV_normal.jpg	San Antonio, TX	FALSE	Providing receipts that refute, “it’s not happening.” Videos belong to their respective owners.	https://t.co/qI9VxrbMp2	2022-04-29T13:13:46.000Z	FALSE	1592899675530866689	53	0	7250	73	18285	947	retweeted	1571349132090179589	This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one of the best experiences [she has] had so far as an educator.” https://t.co/8DvkicyG5O	en	1520028377457078275	NA	2022-09-22
1583998206547226627	MsAvaArmstrong	RT @ RSBNetwork: President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full…	Twitter for iPhone	FALSE	2449913803	en	1583998206547226627	2022-10-23T01:46:16.000Z	AvaArmstrong, 🇺🇸Author	https://pbs.twimg.com/profile_images/1377022534235918341/r_zFt102_normal.jpg	Last Outpost	FALSE	Thriller-Romance author on AMAZON. MY OPINION STATED HERE. Expert at triggering Leftists. Trump won. CONTENT OF CHARACTER. Taken by @ EagleEyeFlyer ❤️ NO DM🚫	NA	2014-04-17T15:02:47.000Z	FALSE	1591443575249768448	518	0	785719	374	157472	86624	retweeted	1583992381535186944	President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full speech on Rumble: https://t.co/XJnh6tcghu https://t.co/xx9A404nF0	en	4041824789	NA	2022-10-23

Now, we need first to add some labels to these data. Specifically, we’re interested in the “toxicity” of tweet content.

How can we do this?

Well, I have provided you with already labelled data in the below:

tweets_sample  <- readRDS(gzcon(url("https://github.com/cjbarrie/CS-ED/blob/main/data/tweets-tox-ranked.rds?raw=true")))

tweet_id	user_username	text	TOXICITY
1558572382947876865	crabcrawler1	RT @ Fire_fux: @ crabcrawler1 https://t.co/HcKNbVEQIP	0.1004571
1579236823527608321	HankVenture5	Blocked and reported https://t.co/18aKrUQy92	0.0673801
1591122529694818304	krus_chiki	@ eDonut_ Look i’ve said it before and i’ll say it again, Kruschiki supply co almost became a retro/vintage candy store	0.0570059
1578034288842264582	entelechiada	gm https://t.co/5maKDeYAHV	0.0138227
1573056421914382336	EITC_Official	RT @ EITC_Official: This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one…	0.0426573
1583998206547226627	MsAvaArmstrong	RT @ RSBNetwork: President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full…	0.0451312

These tweets were actually labelled by another machine learning engine (we’ll talk about this later). But for now, we’re going to pretend they were labelled by humans. And we’re going to take a subset of the data to train our own classifier to label the rest of the dataset.

We’re going to say that anything above a score of .5 is “toxic.”

library(caret)
library(rsample)

# Select just the columns we need
tweets_tox_select <- tweets_tox_sample %>%
  select(tweet_id, user_username, text, TOXICITY) %>%
  mutate(toxbin = ifelse(TOXICITY>=.5, 1, 0))

set.seed(123)
split <- initial_split(tweets_tox_select, prop = .7, strata = "TOXICITY")
train <- training(split)
test  <- testing(split)

table(train$toxbin) %>% prop.table()

## 
##          0          1 
## 0.95927602 0.04072398

table(test$toxbin) %>% prop.table()

## 
##          0          1 
## 0.95438596 0.04561404

24.2 Naïve Bayes

This section was adapted from here. I thank Bradley Boehmke for providing these materials.

Bayesian probability is built on the idea of conditional probability, the probability of event A given that event B has occurred [P(A|B)].

For our Twitter data, this means we are interested in a tweet being “toxic” \(C_k\) (where \(C_{yes} = \text{toxic}\) and \(C_{no} = \text{non-toxic}\)) given that its predictor values are \(x_1, x_2, ..., x_p\). This can be written as \(P(C_k|x_1, ..., x_p)\).

The Bayesian formula for calculating this probability is

\[ P(C_k|X) = \frac{P(C_k) \cdot P(X|C_k)}{P(X)} \quad (1) \]

where:

\(P(C_k)\) is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of a tweet being toxic or not. And we know this is around 5%.
\(P(X)\) is the probability of the predictor variables (same as \(P(C_k|x_1, ..., x_p)\)). This will be the text of the tweets.
\(P(X|C_k)\) is the conditional probability or likelihood. Essentially, for each class of the response variable (i.e. toxic or non-toxic), what is the probability of observing the predictor values.
\(P(C_k|X)\) is called our posterior probability. By combining our observed information, we are updating our a priori information on probabilities to compute a posterior probability that an observation has class \(C_k\).

library(SnowballC)
library(naivebayes)
library(tm)

# Preprocess the text data
# Create a Corpus from the text column
train_corpus <- Corpus(VectorSource(train$text))
test_corpus <- Corpus(VectorSource(test$text))

# Text preprocessing
preprocess <- function(corpus) {
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

train_corpus <- preprocess(train_corpus)
test_corpus <- preprocess(test_corpus)

# Create a document-term matrix
train_dtm <- DocumentTermMatrix(train_corpus)
test_dtm <- DocumentTermMatrix(test_corpus, control=list(dictionary=Terms(train_dtm)))

# Convert dtm to matrix
train_matrix <- as.matrix(train_dtm)
test_matrix <- as.matrix(test_dtm)

# Fit Naive Bayes model
# Ensure factors are factors, and predictors are in the correct format
train$toxbin <- factor(train$toxbin)

# Use the train function from caret to train the model
trControl <- trainControl(method = "cv", number = 10)
nb_model <- train(x = train_matrix, y = train$toxbin, method = "naive_bayes", trControl = trControl)

# Predict on test data
test$toxbin <- factor(test$toxbin) # Make sure the test labels are also factors
predictions <- predict(nb_model, test_matrix)

# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test$toxbin)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 272  13
##          1   0   0
##                                           
##                Accuracy : 0.9544          
##                  95% CI : (0.9233, 0.9755)
##     No Information Rate : 0.9544          
##     P-Value [Acc > NIR] : 0.5730645       
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.0008741       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9544          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9544          
##          Detection Rate : 0.9544          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

Great! We have 95% accuracy. That’s good, right? WRONG.

This is a common issue in imbalanced datasets where one class is significantly more prevalent than the other. Here are some key terms to remember regarding the model’s performance:

The Kappa statistic is 0, which indicates that the model is no better than random chance when taking into account the imbalance of the classes.
The Sensitivity (also known as Recall or True Positive Rate) is 1, which means that the model correctly identified all non-toxic tweets (class 0) as such. However, this is not informative since there are almost no toxic tweets (class 1) to begin with.
The Specificity is 0, indicating that the model did not correctly identify any toxic tweets (class 1). This means the model failed to identify the minority class entirely.
The Pos Pred Value (or Precision) for the non-toxic class is the same as the accuracy, which is again not informative due to the lack of true positives for the toxic class.
Balanced Accuracy is 0.5, which is the average of sensitivity and specificity. Since specificity is 0, this metric shows that the model is ineffective for the minority class.

24.2.1 Alternatives to locally trained models

We can then use the following code to classify this content. This code connects to the Google Perspective classifying engine.

All we need to do is tell it which features of text we want it to classify. We can do so using the peRspective library in R.

library(peRspective)
library(dplyr)
library(ggplot2)

models <- c(peRspective::prsp_models)
models_subset <- models[c(1:5, 7, 9:10, 12, 14)]
models_subset

toxtwts <- tweets_sample %>%
  prsp_stream(text = text,
              text_id = tweet_id, 
              score_model = models_subset,
              verbose = T,
              safe_output = T)

colnames(toxtwts) <- c("tweet_id", "error", models_subset)

tweets_sample_tox_r <- tweets_sample %>%
  left_join(toxtwts, by = "tweet_id")

And then we’re back to the data we started with!

23 Exercise 7: Sampling text information

25 Exercise 9: Validation