Monthly Sentiment Analysis
This notebook ingests the preprocessed data from ../interim/text
downloaded by download_datasets.ipynb
and uses a text classifier from the packageflair
to identify the sentiment for each month.
Finally, the data is saved as a single csv file and pushed to remote storage for visualization with Superset.
import pandas as pd
import os
import datetime
from dotenv import load_dotenv
import seaborn as sns
from collections import Counter
import sys
import matplotlib.pyplot as plt
sys.path.append("../..")
from src import utils
from flair.data import Sentence
from flair.models import TextClassifier
load_dotenv("../../.env")
True
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data")
LAST_MONTH_DATE = datetime.datetime.now().replace(day=1) - datetime.timedelta(
days=1
)
year = LAST_MONTH_DATE.year
month = LAST_MONTH_DATE.month
if os.getenv("RUN_IN_AUTOMATION"):
df = pd.read_csv(
f"{BASE_PATH}/interim/text/fedora-devel-{year}-{month}.mbox.csv"
)
df.head()
else:
df = utils.load_dataset(f"{BASE_PATH}/interim/text/")
df.head()
df.shape
(4928, 3)
df["Date"] = df["Date"].apply(lambda x: pd.to_datetime(x))
df["Chunk"] = df["Date"].apply(lambda x: datetime.date(x.year, x.month, 1))
df = df.sort_values(by="Date")
df.reset_index(inplace=True, drop=True)
df.head()
Message-ID | Date | Body | Chunk | |
---|---|---|---|---|
0 | <c0266d75-cda5-dbac-765d-121bd7cfdf68@fedorapr... | 2018-04-01 14:16:20+09:00 | Hello, all: I am going to update oniguruma to ... | 2018-04-01 |
1 | <20180401074215.3B11460C851A@bastion01.phx2.fe... | 2018-04-01 07:42:15+00:00 | No missing expected images. Passed openQA test... | 2018-04-01 |
2 | <20180401100600.GA24503@branched-composer.phx2... | 2018-04-01 10:06:00+00:00 | OLD: Fedora2820180327.n.0 NEW: Fedora282018033... | 2018-04-01 |
3 | <20180401115240.4EE6B60874A4@bastion01.phx2.fe... | 2018-04-01 11:52:40+00:00 | No missing expected images. Failed openQA test... | 2018-04-01 |
4 | <20180401163037.11073.48046@mailman01.phx2.fed... | 2018-04-01 16:30:37+00:00 | Maybe one could package VeraCrypt? It is a qui... | 2018-04-01 |
Sentiment Analysis
We will start with a singular sentiment analysis in order to understand the workflow of using the flair
package. The general workflow is to tokenize each word, then use a text classifier to evaluate the sentence's sentiment as a whole.
Single sentence
sentence = Sentence(df.iloc[0]["Body"])
print(sentence.get_token)
<bound method Sentence.get_token of Sentence: "Hello , all : I am going to update oniguruma to 6.8.1 on rawhide and F28 . Compared to 6.7.1 , two symbols are removed ( although I dont think these symbols are used by other packages ) and so this will cause soname bump . I will rebuild all affected packages below : jq0:1.511.fc28.x8664 kitutuki0:0.9.618.fc28.x8664 luarex0:2.8.01.fc28.x8664 mfiler30:4.4.916.fc28.x8664 mfiler40:1.3.111.fc28.x8664 ochusha0:0.6.0.10.14.cvs20100817T0000.fc28.1.12.x8664 onigurumadevel0:6.7.11.fc28.x8664 phpmbstring0:7.2.41.fc29.x8664 saphire0:3.6.515.fc28.x8664 slangslsh0:2.3.21.fc29.x8664 xyzsh0:1.5.811.fc28.x8664 Regards , Mamoru" [− Tokens: 68]>
classifier = TextClassifier.load("en-sentiment")
2021-04-07 11:00:58,857 loading file /Users/isabelzimmerman/.flair/models/sentiment-en-mix-distillbert_4.pt
classifier.predict(sentence)
print("Sentence above is: ", sentence.labels[0])
labels = sentence.labels[0]
Sentence above is: NEGATIVE (0.9974)
labels.value
'NEGATIVE'
We see here that a seemingly neutral sentence is classified as negative, with a very high (99.74%) certainty that this is correct. This highlights the fact that there may be flaws in the accuracy of our model, and we may want to explore other options. However, we will check a few months of data to see if this issue persists.
Now, we will use this same workflow in a function in order to iterate through months and see the overall sentiment.
def get_monthly_sentiment(corpus, chunk):
sentiment_per_email = []
for value in corpus.items():
sentence_body = Sentence(value[1])
classifier.predict(sentence_body)
labels = sentence_body.labels[0]
sentiment_per_email.append(labels.value)
monthly_sentiment = Counter(sentiment_per_email).most_common(2)
sentiment_df = pd.DataFrame(
monthly_sentiment, columns=["sentiment", "count"]
)
sentiment_df.insert(0, "month", chunk)
return sentiment_df
# For each document, determine if positive or negative sentiment occurs more often
months = df.Chunk.unique()
# just looking at first few months for a demo, due to length of time for generating results
subset_months = months[0:3]
monthly_sentiment = []
monthly_sentiment = pd.DataFrame(
monthly_sentiment, columns=["month", "sentiment", "count"]
)
for month in subset_months:
corpus = df[df.Chunk == month].Body
top_sentiment = get_monthly_sentiment(corpus, month)
monthly_sentiment = monthly_sentiment.append(top_sentiment)
print(monthly_sentiment)
month sentiment count
0 2018-04-01 NEGATIVE 494
1 2018-04-01 POSITIVE 160
0 2020-05-01 NEGATIVE 1136
1 2020-05-01 POSITIVE 437
0 2020-09-01 NEGATIVE 1033
1 2020-09-01 POSITIVE 277
We can see from just these few months that the overall sentiment seems to be overwhelmingly negative. We'll take this with a grain of a salt for a few reasons, primarily:
- This is a pre-made model, not trained on our data specifically.
- The technical world of Fedora's mailing list may have words that do not mean the same in industry as in everyday language.
The takeaway here is not to expect toxicity from mailing lists; in fact, it's probably quite the opposite, as we saw neutral sentences being classified as negative. Regardless, this demo still shows rough tooling that can be used to understand the sentiment in mailing lists. We'll take a look at the data, and we can see the summary of the data given (realizing this is a tiny subset and to be taken as demonstration rather than analysis).
monthly_sentiment["month"] = monthly_sentiment["month"].astype(
"datetime64[ns]"
)
monthly_sentiment["count"] = monthly_sentiment["count"].astype("int64")
print(monthly_sentiment.dtypes)
month datetime64[ns]
sentiment object
count int64
dtype: object
# look at range of negative to positive sentiments
sns.barplot(data=monthly_sentiment, x="sentiment", y="count")
plt.show()
# look at change in sentiment over the months
sns.lineplot(data=monthly_sentiment, x="month", y="count", hue="sentiment")
plt.show()
# Save the metric (takes about 67 MB in csv data)
file = "monthly_sentiment.csv"
folder = f"{BASE_PATH}/processed/sentiment/"
if not os.path.exists(folder):
os.makedirs(folder)
fullpath = os.path.join(folder, file)
monthly_sentiment.to_csv(fullpath, header=False)
Conclusion
In the end, this notebook saved a small subset of data that showed an overwhelmingly negative sentiment for the mailing list. This tooling is rough, and will probably need more fine-tuning before being pushed into automation.
It may be useful to expand our efforts into looking at more python packages or models, since this does not seem to give us any deep insights out of the box.