Building the Term Document Matrix
A Term Document Matrix (TDM) is a mathematical matrix that graphically represents the frequency of terms that occur in a collection of documents. In this matrix, rows correspond to terms and columns correspond to documents, or vice versa, depending on the structure chosen. Each cell in the matrix indicates the frequency of a term in a particular document.
Visualizing the Most Common Words
Generate a Bar Chart
Generate a Word Cloud
Topic Modeling
Determine the ideal number of and identify topics.
fit models... done.
calculate metrics:
CaoJuan2009... done.
Arun2010... done.
Deveaud2014... done.
- The CaoJuan2009 and Arun2010 metrics suggest a small number of topics with 3 and 5 topics respectively being points of interest.
- Deveaud2014 suggests even fewer topics (2 topics) might be optimal.
Topic 1 Topic 2 Topic 3 Topic 4
[1,] "corrupt" "macron" "ukrain" "ukrain"
[2,] "time" "must" "need" "situat"
[3,] "countri" "ukrain" "time" "seem"
[4,] "govern" "unitedst" "must" "unitedst"
[5,] "must" "see" "chang" "macron"
Sentiment Analysis in R
Sentiments in texts can be classified as positive, neutral, or negative. They can also be quantified using a numerical scale to express the intensity of the sentiment.
Sentiment Analysis using Syuzhet Method
Extract sentiment scores and view initial elements and summaries.
Code
# Calculate sentiments using the Syuzhet method
<- get_sentiment(text, method="syuzhet")
syuzhet_vector # Display first few entries of the sentiment scores
head(syuzhet_vector)
[1] 1.30 0.00 -0.60 0.65 -1.00 -0.25
Code
# Generate summary statistics for the Syuzhet sentiment scores
summary(syuzhet_vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.5000 -0.7500 -0.1250 -0.1324 0.5000 3.9000
Sentiment Analysis using Bing Method
Apply the Bing method, inspect the first few entries, and summarize.
Code
# Calculate sentiments using the Bing method
<- get_sentiment(text, method="bing")
bing_vector # Display first few entries
head(bing_vector)
[1] 2 0 -1 0 0 -1
Code
# Summary statistics
summary(bing_vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.0000 -1.0000 0.0000 -0.3594 0.0000 3.0000
Sentiment Analysis using AFINN Method
Analysis with AFINN, examining initial outputs and summary statistics.
Code
# Calculate sentiments using the AFINN method
<- get_sentiment(text, method="afinn")
afinn_vector # Display first few entries
head(afinn_vector)
[1] 3 0 -1 2 -3 -2
Code
# Summary statistics
summary(afinn_vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-11.0000 -2.0000 0.0000 -0.6894 1.0000 10.0000
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 0 -1 1 -1 -1
[2,] 1 0 -1 0 0 -1
[3,] 1 0 -1 1 -1 -1
Bing Method: This method utilizes a binary scale where:
- -1 represents negative sentiment
- +1 denotes positive sentiment
AFINN Method: This approach employs an integer scale ranging from:
- -5 (most negative)
- +5 (most positive)
Syuzhet Method: This technique employs the NRC emotion lexicon, which associates words with eight different emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). It provides a complex and nuanced understanding of emotional undertones in text data.
To effectively compare the sentiment analysis results from different methods, it’s important to normalize their outputs to a common scale because they use different rating systems. A practical approach in R for this standardization is to use the sign function.
- Converts all positive numbers to 1
- Converts all negative numbers to -1
- Keeps zero values unchanged as 0 This simplification allows for direct comparison across different sentiment analysis methods.
Emotion Analysis
The NRC Word-Emotion Association Lexicon (EmoLex) facilitates the classification of words according to their association with various emotions and sentiments. EmoLex categorizes English words into eight distinct emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). Further details on EmoLex can be found on Saif Mohammad’s website.
The get_nrc_sentiments
function generates a data frame where each row corresponds to a specific sentence from the analyzed text. This data frame has ten columns. Each column represents one of the eight emotions or one of the two sentiment valences.
anger anticipation disgust fear joy sadness surprise trust negative positive
1 0 0 0 0 0 0 0 0 0 2
2 0 1 0 0 0 0 0 1 1 0
3 0 1 0 0 1 0 0 1 1 1
4 1 2 0 1 1 1 1 2 1 3
5 0 0 0 0 0 0 0 1 0 0
6 0 1 0 1 0 1 0 0 1 1
7 0 1 0 2 0 2 0 1 2 1
8 1 0 0 1 1 1 0 1 1 2
9 0 0 0 1 0 1 0 1 2 1
10 2 0 1 1 1 1 1 1 1 1
The next step is to create two plots charts to help visually analyze the emotions in the headline text. This will tally the total number of instances of words in the text associated with each of the eight emotions.
To better understand the main emotions in the headlines, we can look at these numbers as parts of the whole, which shows how much of the important words were categorized under each sentiment.
Citation
@article{infoepi_lab2024,
author = {{InfoEpi Lab}},
publisher = {Information Epidemiology Lab},
title = {Sentiments and {Emotion} in {Doppelgänger} {Tweet} {Text}},
journal = {InfoEpi Lab},
date = {2024-05-08},
url = {https://infoepi.org/posts/2024/05/08-doppelganger-tweet-text.html},
langid = {en}
}