ODH Logo

gzip to raw monthly data

This notebook handles the .gz files downloaded by "collect_data.ipynb" by checking the local data directory's 'raw/fedora-devel-list' for any .gz files that do not have a corresponding *.mbox file, and then unzips them.

The unziped files are only stored locally to be consumed by the preprocessing notebooks to prevent pushing and pulling the raw datasets to and from Ceph.

Note: This notebook should be used in automation.

[15]
import gzip
import shutil
import os
from pathlib import Path
from dotenv import load_dotenv

load_dotenv("../../.env")
True
[16]
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data")
path = Path(BASE_PATH).joinpath("raw/fedora-devel-list")

# check for gz that don't have a corresponding mbox
gzs_names = set([x.name for x in list(path.glob("*.gz"))])
mbox_names = set([x.name + ".gz" for x in list(path.glob("*.mbox"))])
gzs = list(gzs_names.difference(mbox_names))
[]
[12]
# unzip and save each file locally.
for mail in gzs:
    with gzip.open(path.joinpath(mail), "rb") as f_in:
        with open(path.joinpath(mail[:-3]), "wb") as f_out:
            print(mail[:-3])
            shutil.copyfileobj(f_in, f_out)