ODH Logo

Monthly Sentiment Analysis

This notebook ingests the preprocessed data from ../interim/text downloaded by download_datasets.ipynb and uses a text classifier from the packageflair to identify the sentiment for each month.

Finally, the data is saved as a single csv file and pushed to remote storage for visualization with Superset.

[16]
import pandas as pd
import os
import datetime
from dotenv import load_dotenv
import seaborn as sns
from collections import Counter
import sys
import matplotlib.pyplot as plt

sys.path.append("../..")
from src import utils

from flair.data import Sentence
from flair.models import TextClassifier

load_dotenv("../../.env")
True
[2]
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data")

LAST_MONTH_DATE = datetime.datetime.now().replace(day=1) - datetime.timedelta(
    days=1
)
year = LAST_MONTH_DATE.year
month = LAST_MONTH_DATE.month
[3]
if os.getenv("RUN_IN_AUTOMATION"):
    df = pd.read_csv(
        f"{BASE_PATH}/interim/text/fedora-devel-{year}-{month}.mbox.csv"
    )
    df.head()

else:
    df = utils.load_dataset(f"{BASE_PATH}/interim/text/")
    df.head()
[4]
df.shape
(4928, 3)
[5]
df["Date"] = df["Date"].apply(lambda x: pd.to_datetime(x))
df["Chunk"] = df["Date"].apply(lambda x: datetime.date(x.year, x.month, 1))
df = df.sort_values(by="Date")
df.reset_index(inplace=True, drop=True)
df.head()
Message-ID Date Body Chunk
0 <c0266d75-cda5-dbac-765d-121bd7cfdf68@fedorapr... 2018-04-01 14:16:20+09:00 Hello, all: I am going to update oniguruma to ... 2018-04-01
1 <20180401074215.3B11460C851A@bastion01.phx2.fe... 2018-04-01 07:42:15+00:00 No missing expected images. Passed openQA test... 2018-04-01
2 <20180401100600.GA24503@branched-composer.phx2... 2018-04-01 10:06:00+00:00 OLD: Fedora2820180327.n.0 NEW: Fedora282018033... 2018-04-01
3 <20180401115240.4EE6B60874A4@bastion01.phx2.fe... 2018-04-01 11:52:40+00:00 No missing expected images. Failed openQA test... 2018-04-01
4 <20180401163037.11073.48046@mailman01.phx2.fed... 2018-04-01 16:30:37+00:00 Maybe one could package VeraCrypt? It is a qui... 2018-04-01

Sentiment Analysis

We will start with a singular sentiment analysis in order to understand the workflow of using the flair package. The general workflow is to tokenize each word, then use a text classifier to evaluate the sentence's sentiment as a whole.

Single sentence

[6]
sentence = Sentence(df.iloc[0]["Body"])
[7]
print(sentence.get_token)
<bound method Sentence.get_token of Sentence: "Hello , all : I am going to update oniguruma to 6.8.1 on rawhide and F28 . Compared to 6.7.1 , two symbols are removed ( although I dont think these symbols are used by other packages ) and so this will cause soname bump . I will rebuild all affected packages below : jq0:1.511.fc28.x8664 kitutuki0:0.9.618.fc28.x8664 luarex0:2.8.01.fc28.x8664 mfiler30:4.4.916.fc28.x8664 mfiler40:1.3.111.fc28.x8664 ochusha0:0.6.0.10.14.cvs20100817T0000.fc28.1.12.x8664 onigurumadevel0:6.7.11.fc28.x8664 phpmbstring0:7.2.41.fc29.x8664 saphire0:3.6.515.fc28.x8664 slangslsh0:2.3.21.fc29.x8664 xyzsh0:1.5.811.fc28.x8664 Regards , Mamoru" [− Tokens: 68]>
[8]
classifier = TextClassifier.load("en-sentiment")
2021-04-07 11:00:58,857 loading file /Users/isabelzimmerman/.flair/models/sentiment-en-mix-distillbert_4.pt
[9]
classifier.predict(sentence)
[10]
print("Sentence above is: ", sentence.labels[0])
labels = sentence.labels[0]
Sentence above is: NEGATIVE (0.9974)
[11]
labels.value
'NEGATIVE'

We see here that a seemingly neutral sentence is classified as negative, with a very high (99.74%) certainty that this is correct. This highlights the fact that there may be flaws in the accuracy of our model, and we may want to explore other options. However, we will check a few months of data to see if this issue persists.

Now, we will use this same workflow in a function in order to iterate through months and see the overall sentiment.

[12]
def get_monthly_sentiment(corpus, chunk):

    sentiment_per_email = []

    for value in corpus.items():
        sentence_body = Sentence(value[1])
        classifier.predict(sentence_body)
        labels = sentence_body.labels[0]
        sentiment_per_email.append(labels.value)

    monthly_sentiment = Counter(sentiment_per_email).most_common(2)
    sentiment_df = pd.DataFrame(
        monthly_sentiment, columns=["sentiment", "count"]
    )
    sentiment_df.insert(0, "month", chunk)

    return sentiment_df
[13]
# For each document, determine if positive or negative sentiment occurs more often

months = df.Chunk.unique()

# just looking at first few months for a demo, due to length of time for generating results
subset_months = months[0:3]

monthly_sentiment = []
monthly_sentiment = pd.DataFrame(
    monthly_sentiment, columns=["month", "sentiment", "count"]
)

for month in subset_months:
    corpus = df[df.Chunk == month].Body
    top_sentiment = get_monthly_sentiment(corpus, month)
    monthly_sentiment = monthly_sentiment.append(top_sentiment)

print(monthly_sentiment)
month sentiment count 0 2018-04-01 NEGATIVE 494 1 2018-04-01 POSITIVE 160 0 2020-05-01 NEGATIVE 1136 1 2020-05-01 POSITIVE 437 0 2020-09-01 NEGATIVE 1033 1 2020-09-01 POSITIVE 277

We can see from just these few months that the overall sentiment seems to be overwhelmingly negative. We'll take this with a grain of a salt for a few reasons, primarily:

  1. This is a pre-made model, not trained on our data specifically.
  2. The technical world of Fedora's mailing list may have words that do not mean the same in industry as in everyday language.

The takeaway here is not to expect toxicity from mailing lists; in fact, it's probably quite the opposite, as we saw neutral sentences being classified as negative. Regardless, this demo still shows rough tooling that can be used to understand the sentiment in mailing lists. We'll take a look at the data, and we can see the summary of the data given (realizing this is a tiny subset and to be taken as demonstration rather than analysis).

[14]
monthly_sentiment["month"] = monthly_sentiment["month"].astype(
    "datetime64[ns]"
)
monthly_sentiment["count"] = monthly_sentiment["count"].astype("int64")
print(monthly_sentiment.dtypes)
month datetime64[ns] sentiment object count int64 dtype: object
[17]
# look at range of negative to positive sentiments
sns.barplot(data=monthly_sentiment, x="sentiment", y="count")
plt.show()
[18]
# look at change in sentiment over the months
sns.lineplot(data=monthly_sentiment, x="month", y="count", hue="sentiment")
plt.show()
[19]
# Save the metric (takes about 67 MB in csv data)
file = "monthly_sentiment.csv"
folder = f"{BASE_PATH}/processed/sentiment/"
if not os.path.exists(folder):
    os.makedirs(folder)

fullpath = os.path.join(folder, file)
monthly_sentiment.to_csv(fullpath, header=False)

Conclusion

In the end, this notebook saved a small subset of data that showed an overwhelmingly negative sentiment for the mailing list. This tooling is rough, and will probably need more fine-tuning before being pushed into automation.

It may be useful to expand our efforts into looking at more python packages or models, since this does not seem to give us any deep insights out of the box.

[ ]