24 Exercise 8: Supervised learning
24.1 Introduction
The hands-on exercise for this week focuses on how to classify a sample of text.
24.1.1 Data
We will be classifying the same tweets as discussed in Barrie (2023).
I benefited form this worksheet when preparing this tutorial.
You can do this locally on your computers with:
tweets_sample <- readRDS(gzcon(url("https://github.com/cjbarrie/CS-ED/blob/main/data/tweets-ranked.rds?raw=true")))
tweet_id | user_username | text | source | possibly_sensitive | author_id | lang | conversation_id | created_at | user_name | user_profile_image_url | user_location | user_verified | user_description | user_url | user_created_at | user_protected | user_pinned_tweet_id | retweet_count | like_count | quote_count | user_tweet_count | user_list_count | user_followers_count | user_following_count | sourcetweet_type | sourcetweet_id | sourcetweet_text | sourcetweet_lang | sourcetweet_author_id | in_reply_to_user_id | date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1558572382947876865 | crabcrawler1 | RT @ Fire_fux: @ crabcrawler1 https://t.co/HcKNbVEQIP | Twitter Web App | FALSE | 1324742403492892673 | qme | 1558572382947876865 | 2022-08-13T21:53:07.000Z | Crab Man | https://pbs.twimg.com/profile_images/1587559247637843970/CNfOGD_7_normal.jpg | Ohio, USA | FALSE | DONATE PLEASE https://t.co/BF7MOhA7d0 https://t.co/qzu1mWI2EM NEWs and politics. By night time I normally watch TV and movies. Also, go to church on Sunday. | NA | 2020-11-06T15:56:36.000Z | FALSE | 1590851209111670784 | 1 | 0 | 0 | 83259 | 152 | 30253 | 1806 | retweeted | 1558570726965350401 | @ crabcrawler1 https://t.co/HcKNbVEQIP | qme | 1531709708783984642 | NA | 2022-08-13 |
1579236823527608321 | HankVenture5 | Blocked and reported https://t.co/18aKrUQy92 | Twitter for iPhone | FALSE | 1285388676751437824 | en | 1579236823527608321 | 2022-10-09T22:26:14.000Z | Hank Venture | https://pbs.twimg.com/profile_images/1389707714826182656/EFOyaLz2_normal.jpg | NA | FALSE | Slightly right-of-center so evil probably. Gen-X. Stamp Collector. Team Venture Stan account. Chief Executive Officer of #HankCo | NA | 2020-07-21T01:39:30.000Z | FALSE | 1491907548827381761 | 0 | 12 | 0 | 31043 | 10 | 3887 | 930 | quoted | 1579236650244517889 | No matter which city you’re in, all pizza is pretty much the same | en | 11203972 | NA | 2022-10-09 |
1591122529694818304 | krus_chiki | @ eDonut_ Look i’ve said it before and i’ll say it again, Kruschiki supply co almost became a retro/vintage candy store | Twitter for iPhone | FALSE | 922267018526707712 | en | 1591120235104727040 | 2022-11-11T17:35:47.000Z | krus🪖 | https://pbs.twimg.com/profile_images/1227595520773885952/8MWK0ssD_normal.jpg | NA | FALSE | cossack. rare rug dealer. sells military surplus. super villains, inc. | https://t.co/52TOsDw8YZ | 2017-10-23T01:02:46.000Z | FALSE | 1378899646630801416 | 0 | 8 | 0 | 41302 | 94 | 39752 | 1646 | NA | NA | NA | NA | NA | 1563662952892227584 | 2022-11-11 |
1578034288842264582 | entelechiada | gm https://t.co/5maKDeYAHV | Twitter for iPhone | FALSE | 1415419642693111809 | und | 1578034288842264582 | 2022-10-06T14:47:47.000Z | Entelechiada | https://pbs.twimg.com/profile_images/1501024250114678787/NFVa-9Xe_normal.jpg | United States | FALSE | entelechy + enchiladas: here for the good stuff | https://t.co/iG5j2CQlzw | 2021-07-14T21:15:41.000Z | FALSE | 1565024760718798848 | 0 | 8 | 0 | 3504 | 7 | 594 | 747 | NA | NA | NA | NA | NA | NA | 2022-10-06 |
1573056421914382336 | EITC_Official | RT @ EITC_Official: This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one… | Twitter for iPhone | FALSE | 1520028377457078275 | en | 1573056421914382336 | 2022-09-22T21:07:31.000Z | 👁 Inside The Classroom | https://pbs.twimg.com/profile_images/1522240813442342912/6oXnbTIV_normal.jpg | San Antonio, TX | FALSE | Providing receipts that refute, “it’s not happening.” Videos belong to their respective owners. | https://t.co/qI9VxrbMp2 | 2022-04-29T13:13:46.000Z | FALSE | 1592899675530866689 | 53 | 0 | 0 | 7250 | 73 | 18285 | 947 | retweeted | 1571349132090179589 | This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one of the best experiences [she has] had so far as an educator.” https://t.co/8DvkicyG5O | en | 1520028377457078275 | NA | 2022-09-22 |
1583998206547226627 | MsAvaArmstrong | RT @ RSBNetwork: President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full… | Twitter for iPhone | FALSE | 2449913803 | en | 1583998206547226627 | 2022-10-23T01:46:16.000Z | AvaArmstrong, 🇺🇸Author | https://pbs.twimg.com/profile_images/1377022534235918341/r_zFt102_normal.jpg | Last Outpost | FALSE | Thriller-Romance author on AMAZON. MY OPINION STATED HERE. Expert at triggering Leftists. Trump won. CONTENT OF CHARACTER. Taken by @ EagleEyeFlyer ❤️ NO DM🚫 | NA | 2014-04-17T15:02:47.000Z | FALSE | 1591443575249768448 | 518 | 0 | 0 | 785719 | 374 | 157472 | 86624 | retweeted | 1583992381535186944 | President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full speech on Rumble: https://t.co/XJnh6tcghu https://t.co/xx9A404nF0 | en | 4041824789 | NA | 2022-10-23 |
Now, we need first to add some labels to these data. Specifically, we’re interested in the “toxicity” of tweet content.
How can we do this?
Well, I have provided you with already labelled data in the below:
tweets_sample <- readRDS(gzcon(url("https://github.com/cjbarrie/CS-ED/blob/main/data/tweets-tox-ranked.rds?raw=true")))
tweet_id | user_username | text | TOXICITY |
---|---|---|---|
1558572382947876865 | crabcrawler1 | RT @ Fire_fux: @ crabcrawler1 https://t.co/HcKNbVEQIP | 0.1004571 |
1579236823527608321 | HankVenture5 | Blocked and reported https://t.co/18aKrUQy92 | 0.0673801 |
1591122529694818304 | krus_chiki | @ eDonut_ Look i’ve said it before and i’ll say it again, Kruschiki supply co almost became a retro/vintage candy store | 0.0570059 |
1578034288842264582 | entelechiada | gm https://t.co/5maKDeYAHV | 0.0138227 |
1573056421914382336 | EITC_Official | RT @ EITC_Official: This 8th grade science teacher at @ RVKPanthers explains that one of her boy students wearing fingernail polish was “one… | 0.0426573 |
1583998206547226627 | MsAvaArmstrong | RT @ RSBNetwork: President Donald J. Trump: "They’re coming after me because I am fighting for you, and that is true." Join us for the full… | 0.0451312 |
These tweets were actually labelled by another machine learning engine (we’ll talk about this later). But for now, we’re going to pretend they were labelled by humans. And we’re going to take a subset of the data to train our own classifier to label the rest of the dataset.
We’re going to say that anything above a score of .5 is “toxic.”
library(caret)
library(rsample)
# Select just the columns we need
tweets_tox_select <- tweets_tox_sample %>%
select(tweet_id, user_username, text, TOXICITY) %>%
mutate(toxbin = ifelse(TOXICITY>=.5, 1, 0))
set.seed(123)
split <- initial_split(tweets_tox_select, prop = .7, strata = "TOXICITY")
train <- training(split)
test <- testing(split)
table(train$toxbin) %>% prop.table()
##
## 0 1
## 0.95927602 0.04072398
table(test$toxbin) %>% prop.table()
##
## 0 1
## 0.95438596 0.04561404
24.2 Naïve Bayes
This section was adapted from here. I thank Bradley Boehmke for providing these materials.
Bayesian probability is built on the idea of conditional probability, the probability of event A given that event B has occurred [P(A|B)].
For our Twitter data, this means we are interested in a tweet being “toxic” \(C_k\) (where \(C_{yes} = \text{toxic}\) and \(C_{no} = \text{non-toxic}\)) given that its predictor values are \(x_1, x_2, ..., x_p\). This can be written as \(P(C_k|x_1, ..., x_p)\).
The Bayesian formula for calculating this probability is
\[ P(C_k|X) = \frac{P(C_k) \cdot P(X|C_k)}{P(X)} \quad (1) \]
where:
\(P(C_k)\) is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of a tweet being toxic or not. And we know this is around 5%.
\(P(X)\) is the probability of the predictor variables (same as \(P(C_k|x_1, ..., x_p)\)). This will be the text of the tweets.
\(P(X|C_k)\) is the conditional probability or likelihood. Essentially, for each class of the response variable (i.e. toxic or non-toxic), what is the probability of observing the predictor values.
\(P(C_k|X)\) is called our posterior probability. By combining our observed information, we are updating our a priori information on probabilities to compute a posterior probability that an observation has class \(C_k\).
library(SnowballC)
library(naivebayes)
library(tm)
# Preprocess the text data
# Create a Corpus from the text column
train_corpus <- Corpus(VectorSource(train$text))
test_corpus <- Corpus(VectorSource(test$text))
# Text preprocessing
preprocess <- function(corpus) {
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
return(corpus)
}
train_corpus <- preprocess(train_corpus)
test_corpus <- preprocess(test_corpus)
# Create a document-term matrix
train_dtm <- DocumentTermMatrix(train_corpus)
test_dtm <- DocumentTermMatrix(test_corpus, control=list(dictionary=Terms(train_dtm)))
# Convert dtm to matrix
train_matrix <- as.matrix(train_dtm)
test_matrix <- as.matrix(test_dtm)
# Fit Naive Bayes model
# Ensure factors are factors, and predictors are in the correct format
train$toxbin <- factor(train$toxbin)
# Use the train function from caret to train the model
trControl <- trainControl(method = "cv", number = 10)
nb_model <- train(x = train_matrix, y = train$toxbin, method = "naive_bayes", trControl = trControl)
# Predict on test data
test$toxbin <- factor(test$toxbin) # Make sure the test labels are also factors
predictions <- predict(nb_model, test_matrix)
# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test$toxbin)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 272 13
## 1 0 0
##
## Accuracy : 0.9544
## 95% CI : (0.9233, 0.9755)
## No Information Rate : 0.9544
## P-Value [Acc > NIR] : 0.5730645
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.0008741
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9544
## Neg Pred Value : NaN
## Prevalence : 0.9544
## Detection Rate : 0.9544
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
Great! We have 95% accuracy. That’s good, right? WRONG.
This is a common issue in imbalanced datasets where one class is significantly more prevalent than the other. Here are some key terms to remember regarding the model’s performance:
- The
Kappa
statistic is 0, which indicates that the model is no better than random chance when taking into account the imbalance of the classes. - The
Sensitivity
(also known as Recall or True Positive Rate) is 1, which means that the model correctly identified all non-toxic tweets (class 0) as such. However, this is not informative since there are almost no toxic tweets (class 1) to begin with. - The
Specificity
is 0, indicating that the model did not correctly identify any toxic tweets (class 1). This means the model failed to identify the minority class entirely. - The
Pos Pred Value
(or Precision) for the non-toxic class is the same as the accuracy, which is again not informative due to the lack of true positives for the toxic class. -
Balanced Accuracy
is 0.5, which is the average of sensitivity and specificity. Since specificity is 0, this metric shows that the model is ineffective for the minority class.
24.2.1 Alternatives to locally trained models
We can then use the following code to classify this content. This code connects to the Google Perspective classifying engine.
All we need to do is tell it which features of text we want it to classify. We can do so using the peRspective
library in R.
library(peRspective)
library(dplyr)
library(ggplot2)
models <- c(peRspective::prsp_models)
models_subset <- models[c(1:5, 7, 9:10, 12, 14)]
models_subset
toxtwts <- tweets_sample %>%
prsp_stream(text = text,
text_id = tweet_id,
score_model = models_subset,
verbose = T,
safe_output = T)
colnames(toxtwts) <- c("tweet_id", "error", models_subset)
tweets_sample_tox_r <- tweets_sample %>%
left_join(toxtwts, by = "tweet_id")
And then we’re back to the data we started with!