ODH Logo

Contributor Analysis

This notebook ingests the preprocessed data from ../interim/metadata downloaded by download_datasets.ipynb and quantifies the activity of individual contributors to the mailing list, including frequency of initial senders to a thread, and frequency of all replies to existing threads in the mailing list. Both of these analyses are performed on monthly intervals.

Finally, the analyses are merged and saved as a single csv file that is pushed to remote storage for visualization with Superset.

[1]
import pandas as pd
import os
import datetime
import re
from pathlib import Path
from dotenv import load_dotenv
import matplotlib.pyplot as plt
import seaborn as sns

load_dotenv("../../.env")

import sys

sys.path.append("../..")
from src import utils
[2]
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data/")
[3]
df = utils.load_dataset(f"{BASE_PATH}/interim/metadata/")
[5]
df.head()
Message-ID Date Subject
0 <CAJCQCtQ=Rif-5LcbHPB=CJ3+c7U20yqyiqShVTTaQH7P... Thu, 31 Dec 2020 17:49:41 -0700 Re: Fedora 34 Change: Enable btrfs transparent...
1 <45a1abbe39ccd56d4dd3b62d09214e78ae5fa699.came... Thu, 31 Dec 2020 17:10:44 -0800 Re: Thoughts about packaging a standalone pyth...
2 <alpine.DEB.2.22.394.2012312201030.356797@bear... Thu, 31 Dec 2020 22:02:44 -0500 Re: Thoughts about packaging a standalone pyth...
3 <CAJP_izfLFAVL4CTFzS7=W7T6izt_WQG81Ys=GqdQ3yZG... Thu, 31 Dec 2020 23:03:56 -0500 License change: R-usethis GPLv3 -> MIT
4 <20210101073305.9666D3052DF1@bastion01.iad2.fe... Fri, 01 Jan 2021 07:33:05 +0000 Fedora-Cloud-33-20210101.0 compose check report

Minor preprocessing

Here we need to do some minor cleaning to the "subject", "text" and "date" fields to correctly rearrange the dataframe so that all messages from the same thread are grouped together.

[6]
# get all participants regradles of response
def match_threads(subject):
    return re.sub(r"^{0}".format(re.escape("Re: ")), "", subject)


# remove trailing emials
def remove_email(text):
    return re.sub(r" <.*", "", text)


# convert date string to datetime object
def parse_date(date):
    return pd.to_datetime(date)
[7]
# apply our transformations

df["Subject"] = df["Subject"].apply(match_threads)
df["Date"] = df["Date"].apply(parse_date)
df["Chunk"] = df["Date"].apply(lambda x: datetime.date(x.year, x.month, 1))
df = df.sort_values(by="Date")
df.reset_index(inplace=True, drop=True)
df.head()
Message-ID Date Subject Chunk
0 <57d76b65c8c848f7e1b83e56ff8f094ce3855479.came... 2018-01-01 02:42:21 Anything we can do to temporarily halt new bug... 2018-01-01
1 <20180101172438.GA2871@flame.pingoured.fr> 2018-01-01 17:24:38 [Bug 1529276] New: findbugs-contrib-7.2.0.sb i... 2018-01-01
2 <20180101220004.0632660A400B@fedocal02.phx2.fe... 2018-01-01 22:00:04 [Fedocal] Reminder meeting : Modularity Office... 2018-01-01
3 <20180101220004.0E97560A400C@fedocal02.phx2.fe... 2018-01-01 22:00:04 [Fedocal] Reminder meeting : Modularity Office... 2018-01-01
4 <20180101221314.GA52721@rawhide-composer.phx2.... 2018-01-01 22:13:15 Fedora rawhide compose report: 20180101.n.0 ch... 2018-01-01

Quantify contributor activity

[8]
# Quantify askers over the entire dataset
Monthly = df.groupby("Chunk").count()
Monthly.head(15)
Message-ID Date Subject
Chunk
2018-01-01 1429 1429 1429
2018-02-01 1226 1226 1226
2018-03-01 1231 1231 1231
2018-04-01 678 678 678
2018-05-01 723 723 723
2018-06-01 1054 1054 1054
2018-07-01 1031 1031 1031
2018-08-01 802 802 802
2018-09-01 832 832 832
2018-10-01 987 987 987
2018-11-01 1086 1086 1086
2018-12-01 706 706 706
2019-01-01 1003 1003 1003
2019-02-01 1008 1008 1008
2019-03-01 1041 1041 1041
[9]
sns.set(rc={"figure.figsize": (20, 10)})
plt.plot(Monthly.Subject)
plt.title("Number of emails per month")
plt.ylabel("Number of emails")
plt.xlabel("Date")
plt.show()
/opt/app-root/lib/python3.6/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
[10]
# plot # of threads

Subject = df.groupby(["Chunk"]).Subject.unique().apply(len)
Subject.head(5)
Chunk
2018-01-01    331
2018-02-01    262
2018-03-01    356
2018-04-01    258
2018-05-01    207
Name: Subject, dtype: int64
[11]
plt.plot(Subject)
plt.title("Number of threads per month")
plt.ylabel("Number of threads")
plt.xlabel("Date")
plt.show()
[12]
# number 1 subject per month

top_subjects = {}

for month in df.Chunk.unique():
    top_threads = [
        df[df["Chunk"] == month].Subject.value_counts().head(5).index.tolist()
    ]
    # thread_count = df[df["Chunk"] == month].Subject.value_counts().head(3).tolist()
    # current_month = list(zip(top_threads,thread_count))
    top_subjects[month] = top_threads
[13]
x = pd.DataFrame(top_subjects).T
x
0
2018-01-01 [EPEL support in "master" branch (aka speeding...
2018-02-01 [[ACTION NEEDED] Missing BuildRequires: gcc/gc...
2018-03-01 [Gating packages in Rawhide, Goodbye nvr.rspli...
2018-04-01 [Status of OwnCloud/NextCloud, Intent to orpha...
2018-05-01 [Prioritizing ~/.local/bin over /usr/bin on th...
2018-06-01 [F29 System Wide Change: Make BootLoaderSpec t...
2018-07-01 [[HEADS UP] Removal of GCC from the buildroot,...
2018-08-01 [Proposal: Reduce *-devel packages dependencie...
2018-09-01 [Semi-serious proposal: drop all optional entr...
2018-10-01 [Fedora should replace mailing lists with Disc...
2018-11-01 [Fedora Lifecycles: imagine longer-term possib...
2018-12-01 [Fedora 30 System-Wide Change proposal: Remove...
2019-01-01 [F30: System-Wide Change proposal: DNF UUID, F...
2019-02-01 [Orphaned packages that will be retired (and e...
2019-03-01 [More than 10% of all Fedora spec files are no...
2019-04-01 [Can we maybe reduce the set of packages we in...
2019-05-01 [Upgrade to F30 gone wrong, Fedora 31 System-W...
2019-06-01 [Modularity vs. libgit, Fedora 31 System-Wide ...
2019-07-01 [Rolling out Phase I of rawhide package gating...
2019-08-01 [Fedora Workstation and disabled by default fi...
2019-09-01 [Fedora 31 System-Wide Change proposal (late):...
2019-10-01 [Modularity and the system-upgrade path, Defin...
2019-11-01 [Modularity: The Official Complaint Thread, Fe...
2019-12-01 [Fedora 32 System-Wide Change proposal: Disall...
2020-01-01 [Git Forge Requirements: Please see the Commun...
2020-02-01 [Ideas and proposal for removing changelog and...
2020-03-01 [Fedora 33 System-Wide Change proposal: ELN Bu...
2020-04-01 [CPE Git Forge Decision, Fedora 33 System-Wide...
2020-05-01 [Is dist-git a good place for work?, Re-Launch...
2020-06-01 [Fedora 33 System-Wide Change proposal: Make n...
2020-07-01 [The future of legacy BIOS support in Fedora.,...
2020-08-01 [EarlyOOM +ZRAM Only, Fedora 33 Mass Rebuild, ...
2020-09-01 [This is bad, was Re: Fedora 33 System-Wide Ch...
2020-10-01 [[ELN] gcc is going to be updated to gcc11 in ...
2020-11-01 [Fedora 34 Change: Route all Audio to PipeWire...
2020-12-01 [Fedora 34 Change: GitRepos-master-to-main (Se...
2021-01-01 [Proposal to deprecated `fedpkg local`, Fedora...
[14]
x = x.join(Subject)
x = x.join(Monthly["Message-ID"])
[15]
x.columns = ["Top_5_Threads", "Number_of_threads", "Number_of_Messages"]
x
Top_5_Threads Number_of_threads Number_of_Messages
2018-01-01 [EPEL support in "master" branch (aka speeding... 331 1429
2018-02-01 [[ACTION NEEDED] Missing BuildRequires: gcc/gc... 262 1226
2018-03-01 [Gating packages in Rawhide, Goodbye nvr.rspli... 356 1231
2018-04-01 [Status of OwnCloud/NextCloud, Intent to orpha... 258 678
2018-05-01 [Prioritizing ~/.local/bin over /usr/bin on th... 207 723
2018-06-01 [F29 System Wide Change: Make BootLoaderSpec t... 240 1054
2018-07-01 [[HEADS UP] Removal of GCC from the buildroot,... 300 1031
2018-08-01 [Proposal: Reduce *-devel packages dependencie... 274 802
2018-09-01 [Semi-serious proposal: drop all optional entr... 259 832
2018-10-01 [Fedora should replace mailing lists with Disc... 255 987
2018-11-01 [Fedora Lifecycles: imagine longer-term possib... 247 1086
2018-12-01 [Fedora 30 System-Wide Change proposal: Remove... 215 706
2019-01-01 [F30: System-Wide Change proposal: DNF UUID, F... 247 1003
2019-02-01 [Orphaned packages that will be retired (and e... 263 1008
2019-03-01 [More than 10% of all Fedora spec files are no... 284 1041
2019-04-01 [Can we maybe reduce the set of packages we in... 294 961
2019-05-01 [Upgrade to F30 gone wrong, Fedora 31 System-W... 267 845
2019-06-01 [Modularity vs. libgit, Fedora 31 System-Wide ... 202 817
2019-07-01 [Rolling out Phase I of rawhide package gating... 271 1067
2019-08-01 [Fedora Workstation and disabled by default fi... 336 1546
2019-09-01 [Fedora 31 System-Wide Change proposal (late):... 403 1420
2019-10-01 [Modularity and the system-upgrade path, Defin... 335 1357
2019-11-01 [Modularity: The Official Complaint Thread, Fe... 292 1321
2019-12-01 [Fedora 32 System-Wide Change proposal: Disall... 264 1144
2020-01-01 [Git Forge Requirements: Please see the Commun... 289 1535
2020-02-01 [Ideas and proposal for removing changelog and... 395 1229
2020-03-01 [Fedora 33 System-Wide Change proposal: ELN Bu... 480 1545
2020-04-01 [CPE Git Forge Decision, Fedora 33 System-Wide... 443 1772
2020-05-01 [Is dist-git a good place for work?, Re-Launch... 400 1618
2020-06-01 [Fedora 33 System-Wide Change proposal: Make n... 307 2143
2020-07-01 [The future of legacy BIOS support in Fedora.,... 298 1785
2020-08-01 [EarlyOOM +ZRAM Only, Fedora 33 Mass Rebuild, ... 360 1218
2020-09-01 [This is bad, was Re: Fedora 33 System-Wide Ch... 426 1436
2020-10-01 [[ELN] gcc is going to be updated to gcc11 in ... 414 1188
2020-11-01 [Fedora 34 Change: Route all Audio to PipeWire... 348 1213
2020-12-01 [Fedora 34 Change: GitRepos-master-to-main (Se... 335 1438
2021-01-01 [Proposal to deprecated `fedpkg local`, Fedora... 344 1388
[16]
contributor_data_set = x

Upload results to S3

[17]
new_files = (
    (contributor_data_set, f"{BASE_PATH}/processed/contributors.csv"),
)
[18]
Path(f"{BASE_PATH}/processed").mkdir(parents=True, exist_ok=True)
[19]
contributor_data_set.to_csv(new_files[0][1], header=False)
[20]
if os.getenv("RUN_IN_AUTOMATION"):
    utils.upload_files(
        (f, f"processed/{Path(f).stem}/contributors.csv") for _, f in new_files
    )