Contributor Analysis
This notebook ingests the preprocessed data from ../interim/metadata
downloaded by download_datasets.ipynb
and quantifies the activity of individual contributors to the mailing list, including frequency of initial senders to a thread, and frequency of all replies to existing threads in the mailing list. Both of these analyses are performed on monthly intervals.
Finally, the analyses are merged and saved as a single csv file that is pushed to remote storage for visualization with Superset.
[1]
import pandas as pd
import os
import datetime
import re
from pathlib import Path
from dotenv import load_dotenv
import matplotlib.pyplot as plt
import seaborn as sns
load_dotenv("../../.env")
import sys
sys.path.append("../..")
from src import utils
[2]
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data/")
[3]
df = utils.load_dataset(f"{BASE_PATH}/interim/metadata/")
[5]
df.head()
Message-ID | Date | Subject | |
---|---|---|---|
0 | <CAJCQCtQ=Rif-5LcbHPB=CJ3+c7U20yqyiqShVTTaQH7P... | Thu, 31 Dec 2020 17:49:41 -0700 | Re: Fedora 34 Change: Enable btrfs transparent... |
1 | <45a1abbe39ccd56d4dd3b62d09214e78ae5fa699.came... | Thu, 31 Dec 2020 17:10:44 -0800 | Re: Thoughts about packaging a standalone pyth... |
2 | <alpine.DEB.2.22.394.2012312201030.356797@bear... | Thu, 31 Dec 2020 22:02:44 -0500 | Re: Thoughts about packaging a standalone pyth... |
3 | <CAJP_izfLFAVL4CTFzS7=W7T6izt_WQG81Ys=GqdQ3yZG... | Thu, 31 Dec 2020 23:03:56 -0500 | License change: R-usethis GPLv3 -> MIT |
4 | <20210101073305.9666D3052DF1@bastion01.iad2.fe... | Fri, 01 Jan 2021 07:33:05 +0000 | Fedora-Cloud-33-20210101.0 compose check report |
Minor preprocessing
Here we need to do some minor cleaning to the "subject", "text" and "date" fields to correctly rearrange the dataframe so that all messages from the same thread are grouped together.
[6]
# get all participants regradles of response
def match_threads(subject):
return re.sub(r"^{0}".format(re.escape("Re: ")), "", subject)
# remove trailing emials
def remove_email(text):
return re.sub(r" <.*", "", text)
# convert date string to datetime object
def parse_date(date):
return pd.to_datetime(date)
[7]
# apply our transformations
df["Subject"] = df["Subject"].apply(match_threads)
df["Date"] = df["Date"].apply(parse_date)
df["Chunk"] = df["Date"].apply(lambda x: datetime.date(x.year, x.month, 1))
df = df.sort_values(by="Date")
df.reset_index(inplace=True, drop=True)
df.head()
Message-ID | Date | Subject | Chunk | |
---|---|---|---|---|
0 | <57d76b65c8c848f7e1b83e56ff8f094ce3855479.came... | 2018-01-01 02:42:21 | Anything we can do to temporarily halt new bug... | 2018-01-01 |
1 | <20180101172438.GA2871@flame.pingoured.fr> | 2018-01-01 17:24:38 | [Bug 1529276] New: findbugs-contrib-7.2.0.sb i... | 2018-01-01 |
2 | <20180101220004.0632660A400B@fedocal02.phx2.fe... | 2018-01-01 22:00:04 | [Fedocal] Reminder meeting : Modularity Office... | 2018-01-01 |
3 | <20180101220004.0E97560A400C@fedocal02.phx2.fe... | 2018-01-01 22:00:04 | [Fedocal] Reminder meeting : Modularity Office... | 2018-01-01 |
4 | <20180101221314.GA52721@rawhide-composer.phx2.... | 2018-01-01 22:13:15 | Fedora rawhide compose report: 20180101.n.0 ch... | 2018-01-01 |
Quantify contributor activity
[8]
# Quantify askers over the entire dataset
Monthly = df.groupby("Chunk").count()
Monthly.head(15)
Message-ID | Date | Subject | |
---|---|---|---|
Chunk | |||
2018-01-01 | 1429 | 1429 | 1429 |
2018-02-01 | 1226 | 1226 | 1226 |
2018-03-01 | 1231 | 1231 | 1231 |
2018-04-01 | 678 | 678 | 678 |
2018-05-01 | 723 | 723 | 723 |
2018-06-01 | 1054 | 1054 | 1054 |
2018-07-01 | 1031 | 1031 | 1031 |
2018-08-01 | 802 | 802 | 802 |
2018-09-01 | 832 | 832 | 832 |
2018-10-01 | 987 | 987 | 987 |
2018-11-01 | 1086 | 1086 | 1086 |
2018-12-01 | 706 | 706 | 706 |
2019-01-01 | 1003 | 1003 | 1003 |
2019-02-01 | 1008 | 1008 | 1008 |
2019-03-01 | 1041 | 1041 | 1041 |
[9]
sns.set(rc={"figure.figsize": (20, 10)})
plt.plot(Monthly.Subject)
plt.title("Number of emails per month")
plt.ylabel("Number of emails")
plt.xlabel("Date")
plt.show()
/opt/app-root/lib/python3.6/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
(prop.get_family(), self.defaultFamily[fontext]))
[10]
# plot # of threads
Subject = df.groupby(["Chunk"]).Subject.unique().apply(len)
Subject.head(5)
Chunk
2018-01-01 331
2018-02-01 262
2018-03-01 356
2018-04-01 258
2018-05-01 207
Name: Subject, dtype: int64
[11]
plt.plot(Subject)
plt.title("Number of threads per month")
plt.ylabel("Number of threads")
plt.xlabel("Date")
plt.show()
[12]
# number 1 subject per month
top_subjects = {}
for month in df.Chunk.unique():
top_threads = [
df[df["Chunk"] == month].Subject.value_counts().head(5).index.tolist()
]
# thread_count = df[df["Chunk"] == month].Subject.value_counts().head(3).tolist()
# current_month = list(zip(top_threads,thread_count))
top_subjects[month] = top_threads
[13]
x = pd.DataFrame(top_subjects).T
x
0 | |
---|---|
2018-01-01 | [EPEL support in "master" branch (aka speeding... |
2018-02-01 | [[ACTION NEEDED] Missing BuildRequires: gcc/gc... |
2018-03-01 | [Gating packages in Rawhide, Goodbye nvr.rspli... |
2018-04-01 | [Status of OwnCloud/NextCloud, Intent to orpha... |
2018-05-01 | [Prioritizing ~/.local/bin over /usr/bin on th... |
2018-06-01 | [F29 System Wide Change: Make BootLoaderSpec t... |
2018-07-01 | [[HEADS UP] Removal of GCC from the buildroot,... |
2018-08-01 | [Proposal: Reduce *-devel packages dependencie... |
2018-09-01 | [Semi-serious proposal: drop all optional entr... |
2018-10-01 | [Fedora should replace mailing lists with Disc... |
2018-11-01 | [Fedora Lifecycles: imagine longer-term possib... |
2018-12-01 | [Fedora 30 System-Wide Change proposal: Remove... |
2019-01-01 | [F30: System-Wide Change proposal: DNF UUID, F... |
2019-02-01 | [Orphaned packages that will be retired (and e... |
2019-03-01 | [More than 10% of all Fedora spec files are no... |
2019-04-01 | [Can we maybe reduce the set of packages we in... |
2019-05-01 | [Upgrade to F30 gone wrong, Fedora 31 System-W... |
2019-06-01 | [Modularity vs. libgit, Fedora 31 System-Wide ... |
2019-07-01 | [Rolling out Phase I of rawhide package gating... |
2019-08-01 | [Fedora Workstation and disabled by default fi... |
2019-09-01 | [Fedora 31 System-Wide Change proposal (late):... |
2019-10-01 | [Modularity and the system-upgrade path, Defin... |
2019-11-01 | [Modularity: The Official Complaint Thread, Fe... |
2019-12-01 | [Fedora 32 System-Wide Change proposal: Disall... |
2020-01-01 | [Git Forge Requirements: Please see the Commun... |
2020-02-01 | [Ideas and proposal for removing changelog and... |
2020-03-01 | [Fedora 33 System-Wide Change proposal: ELN Bu... |
2020-04-01 | [CPE Git Forge Decision, Fedora 33 System-Wide... |
2020-05-01 | [Is dist-git a good place for work?, Re-Launch... |
2020-06-01 | [Fedora 33 System-Wide Change proposal: Make n... |
2020-07-01 | [The future of legacy BIOS support in Fedora.,... |
2020-08-01 | [EarlyOOM +ZRAM Only, Fedora 33 Mass Rebuild, ... |
2020-09-01 | [This is bad, was Re: Fedora 33 System-Wide Ch... |
2020-10-01 | [[ELN] gcc is going to be updated to gcc11 in ... |
2020-11-01 | [Fedora 34 Change: Route all Audio to PipeWire... |
2020-12-01 | [Fedora 34 Change: GitRepos-master-to-main (Se... |
2021-01-01 | [Proposal to deprecated `fedpkg local`, Fedora... |
[14]
x = x.join(Subject)
x = x.join(Monthly["Message-ID"])
[15]
x.columns = ["Top_5_Threads", "Number_of_threads", "Number_of_Messages"]
x
Top_5_Threads | Number_of_threads | Number_of_Messages | |
---|---|---|---|
2018-01-01 | [EPEL support in "master" branch (aka speeding... | 331 | 1429 |
2018-02-01 | [[ACTION NEEDED] Missing BuildRequires: gcc/gc... | 262 | 1226 |
2018-03-01 | [Gating packages in Rawhide, Goodbye nvr.rspli... | 356 | 1231 |
2018-04-01 | [Status of OwnCloud/NextCloud, Intent to orpha... | 258 | 678 |
2018-05-01 | [Prioritizing ~/.local/bin over /usr/bin on th... | 207 | 723 |
2018-06-01 | [F29 System Wide Change: Make BootLoaderSpec t... | 240 | 1054 |
2018-07-01 | [[HEADS UP] Removal of GCC from the buildroot,... | 300 | 1031 |
2018-08-01 | [Proposal: Reduce *-devel packages dependencie... | 274 | 802 |
2018-09-01 | [Semi-serious proposal: drop all optional entr... | 259 | 832 |
2018-10-01 | [Fedora should replace mailing lists with Disc... | 255 | 987 |
2018-11-01 | [Fedora Lifecycles: imagine longer-term possib... | 247 | 1086 |
2018-12-01 | [Fedora 30 System-Wide Change proposal: Remove... | 215 | 706 |
2019-01-01 | [F30: System-Wide Change proposal: DNF UUID, F... | 247 | 1003 |
2019-02-01 | [Orphaned packages that will be retired (and e... | 263 | 1008 |
2019-03-01 | [More than 10% of all Fedora spec files are no... | 284 | 1041 |
2019-04-01 | [Can we maybe reduce the set of packages we in... | 294 | 961 |
2019-05-01 | [Upgrade to F30 gone wrong, Fedora 31 System-W... | 267 | 845 |
2019-06-01 | [Modularity vs. libgit, Fedora 31 System-Wide ... | 202 | 817 |
2019-07-01 | [Rolling out Phase I of rawhide package gating... | 271 | 1067 |
2019-08-01 | [Fedora Workstation and disabled by default fi... | 336 | 1546 |
2019-09-01 | [Fedora 31 System-Wide Change proposal (late):... | 403 | 1420 |
2019-10-01 | [Modularity and the system-upgrade path, Defin... | 335 | 1357 |
2019-11-01 | [Modularity: The Official Complaint Thread, Fe... | 292 | 1321 |
2019-12-01 | [Fedora 32 System-Wide Change proposal: Disall... | 264 | 1144 |
2020-01-01 | [Git Forge Requirements: Please see the Commun... | 289 | 1535 |
2020-02-01 | [Ideas and proposal for removing changelog and... | 395 | 1229 |
2020-03-01 | [Fedora 33 System-Wide Change proposal: ELN Bu... | 480 | 1545 |
2020-04-01 | [CPE Git Forge Decision, Fedora 33 System-Wide... | 443 | 1772 |
2020-05-01 | [Is dist-git a good place for work?, Re-Launch... | 400 | 1618 |
2020-06-01 | [Fedora 33 System-Wide Change proposal: Make n... | 307 | 2143 |
2020-07-01 | [The future of legacy BIOS support in Fedora.,... | 298 | 1785 |
2020-08-01 | [EarlyOOM +ZRAM Only, Fedora 33 Mass Rebuild, ... | 360 | 1218 |
2020-09-01 | [This is bad, was Re: Fedora 33 System-Wide Ch... | 426 | 1436 |
2020-10-01 | [[ELN] gcc is going to be updated to gcc11 in ... | 414 | 1188 |
2020-11-01 | [Fedora 34 Change: Route all Audio to PipeWire... | 348 | 1213 |
2020-12-01 | [Fedora 34 Change: GitRepos-master-to-main (Se... | 335 | 1438 |
2021-01-01 | [Proposal to deprecated `fedpkg local`, Fedora... | 344 | 1388 |
[16]
contributor_data_set = x
Upload results to S3
[17]
new_files = (
(contributor_data_set, f"{BASE_PATH}/processed/contributors.csv"),
)
[18]
Path(f"{BASE_PATH}/processed").mkdir(parents=True, exist_ok=True)
[19]
contributor_data_set.to_csv(new_files[0][1], header=False)
[20]
if os.getenv("RUN_IN_AUTOMATION"):
utils.upload_files(
(f, f"processed/{Path(f).stem}/contributors.csv") for _, f in new_files
)