Step 0: json to pandas-dataframe
At Red Hat, Backblaze Q4 2018 dataset files (.csv) were downloaded and converted to look like the json output from smartctl
command. These json files are quite complex structure (deeply nested). The purpose of this notebook is to create a pandas
dataframe object from one such json file. The reason for doing this is that it is much easier to analyze and play around with data when it is in a more "human friendly" and simplistic form such as a flat dataframe or csv.
NOTE: If you are not using the json-files-version of Bacblaze dataset at Red Hat, this notebook may not relevant for you.
In this notebook, first the data is parsed. Then it is flattened, i.e. deep-nested-ness of the data is removed. Next, if there are any duplicate or redundant columns, or columns that will not be available at inference time, these are also removed. Essentially, it is made to look exactly like a csv file from the Backblaze dataset.
import boto3
import pandas as pd
# Ceph data location
S3_ENDPOINT_URL = "https://s3.upshift.redhat.com/"
BUCKET_NAME = "DH-PLAYPEN"
PREFIX = "nwatkins/smart_data_Q4_2018"
# create a S3 client that will access Ceph
# NOTE: assumes aws credentials are stored in ~/.aws/credentials. if not, then must be passed as parameters
s3 = boto3.client("s3", "us-east-1", endpoint_url=S3_ENDPOINT_URL)
# keys of the data files in the Ceph bucket
data_s3_keys = [
content["Key"]
for content in s3.list_objects(Bucket=BUCKET_NAME, Prefix=PREFIX)["Contents"]
]
# get the first file
ret = s3.get_object(Bucket=BUCKET_NAME, Key=data_s3_keys[0])
df = pd.read_json(ret["Body"].read().decode("utf-8"), lines=True)
print(df.shape)
df.head()
(99636, 2)
hints | smartctl_json | |
---|---|---|
0 | {'is_backblaze': True, 'backblaze_ts': 1538352... | {'model_name': 'ST4000DM000', 'serial_number':... |
1 | {'is_backblaze': True, 'backblaze_ts': 1538352... | {'model_name': 'ST12000NM0007', 'serial_number... |
2 | {'is_backblaze': True, 'backblaze_ts': 1538352... | {'model_name': 'ST12000NM0007', 'serial_number... |
3 | {'is_backblaze': True, 'backblaze_ts': 1538352... | {'model_name': 'HGST HMS5C4040ALE640', 'serial... |
4 | {'is_backblaze': True, 'backblaze_ts': 1538352... | {'model_name': 'ST8000NM0055', 'serial_number'... |
# #################### USE THIS TO READ THE FILE LINE BY LINE #################### #
# # get the first json file
# ret = s3.get_object(Bucket=BUCKET_NAME, Key=data_s3_keys[0])
# i = 1
# MAX_NUM_LINES = 5
# data = None
# labels = None
# ########## MORE EFFICIENT METHOD
# # pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],
# # ... ignore_index=True)
# ##########
# ########## BUGFIX
# # labels = pd.concat([labels, pd.DataFrame.from_dict(tmp['hints'], orient='index').transpose()])
# ##########
# for line in ret['Body'].iter_lines():
# if data is None and labels is None:
# tmp = json.loads(line)
# data = pd.DataFrame.from_dict(tmp['smartctl_json'], orient='index').transpose()
# labels = pd.DataFrame.from_dict(tmp['hints'], orient='index').transpose()
# else:
# tmp = json.loads(line)
# data = data.append(tmp['smartctl_json'], ignore_index=True)
# labels = labels.append(tmp['hints'], ignore_index=True)
# if i == MAX_NUM_LINES:
# break
# else:
# i += 1
# labels.head()
# data.head()
# get labels (the "Y")
labels = df["hints"].apply(pd.Series)
labels.head()
is_backblaze | backblaze_ts | backblaze_failure_label | |
---|---|---|---|
0 | True | 1.538352e+12 | False |
1 | True | 1.538352e+12 | False |
2 | True | 1.538352e+12 | False |
3 | True | 1.538352e+12 | False |
4 | True | 1.538352e+12 | False |
# get data (the "X")
data = df["smartctl_json"].apply(pd.Series)
data.head()
model_name | serial_number | model_family | user_capacity | ata_smart_attributes | |
---|---|---|---|---|---|
0 | ST4000DM000 | Z305B2QN | ST4000DM000 | {'bytes': 4000787030016} | {'table': [{'id': 1, 'value': 117, 'raw': {'va... |
1 | ST12000NM0007 | ZJV0XJQ4 | ST12000NM0007 | {'bytes': 12000138625024} | {'table': [{'id': 1, 'value': 68, 'raw': {'val... |
2 | ST12000NM0007 | ZJV0XJQ0 | ST12000NM0007 | {'bytes': 12000138625024} | {'table': [{'id': 1, 'value': 79, 'raw': {'val... |
3 | HGST HMS5C4040ALE640 | PL1331LAHG1S4H | HGST HMS5C4040ALE640 | {'bytes': 4000787030016} | {'table': [{'id': 1, 'value': 100, 'raw': {'va... |
4 | ST8000NM0055 | ZA16NQJR | ST8000NM0055 | {'bytes': 8001563222016} | {'table': [{'id': 1, 'value': 80, 'raw': {'val... |
# change from dict type {'bytes': 123} to just int64 type 123
data["user_capacity"] = data["user_capacity"].apply(lambda x: x["bytes"])
data.head()
model_name | serial_number | model_family | user_capacity | ata_smart_attributes | |
---|---|---|---|---|---|
0 | ST4000DM000 | Z305B2QN | ST4000DM000 | 4000787030016 | {'table': [{'id': 1, 'value': 117, 'raw': {'va... |
1 | ST12000NM0007 | ZJV0XJQ4 | ST12000NM0007 | 12000138625024 | {'table': [{'id': 1, 'value': 68, 'raw': {'val... |
2 | ST12000NM0007 | ZJV0XJQ0 | ST12000NM0007 | 12000138625024 | {'table': [{'id': 1, 'value': 79, 'raw': {'val... |
3 | HGST HMS5C4040ALE640 | PL1331LAHG1S4H | HGST HMS5C4040ALE640 | 4000787030016 | {'table': [{'id': 1, 'value': 100, 'raw': {'va... |
4 | ST8000NM0055 | ZA16NQJR | ST8000NM0055 | 8001563222016 | {'table': [{'id': 1, 'value': 80, 'raw': {'val... |
# change from dict type {'table': [{}, {}, {}]} to list type [{}, {}, {}]
data["ata_smart_attributes"] = data["ata_smart_attributes"].apply(lambda x: x["table"])
data.head()
model_name | serial_number | model_family | user_capacity | ata_smart_attributes | |
---|---|---|---|---|---|
0 | ST4000DM000 | Z305B2QN | ST4000DM000 | 4000787030016 | [{'id': 1, 'value': 117, 'raw': {'value': 1485... |
1 | ST12000NM0007 | ZJV0XJQ4 | ST12000NM0007 | 12000138625024 | [{'id': 1, 'value': 68, 'raw': {'value': 65293... |
2 | ST12000NM0007 | ZJV0XJQ0 | ST12000NM0007 | 12000138625024 | [{'id': 1, 'value': 79, 'raw': {'value': 75775... |
3 | HGST HMS5C4040ALE640 | PL1331LAHG1S4H | HGST HMS5C4040ALE640 | 4000787030016 | [{'id': 1, 'value': 100, 'raw': {'value': 0, '... |
4 | ST8000NM0055 | ZA16NQJR | ST8000NM0055 | 8001563222016 | [{'id': 1, 'value': 80, 'raw': {'value': 10341... |
# change from list of dict type [{}, {}, {}] to list of tuples type [(), (), ()]
data["ata_smart_attributes"] = data["ata_smart_attributes"].apply(
lambda x: [tuple(kv.values()) for kv in x]
)
data.head()
model_name | serial_number | model_family | user_capacity | ata_smart_attributes | |
---|---|---|---|---|---|
0 | ST4000DM000 | Z305B2QN | ST4000DM000 | 4000787030016 | [(1, 117, {'value': 148579464, 'string': 14857... |
1 | ST12000NM0007 | ZJV0XJQ4 | ST12000NM0007 | 12000138625024 | [(1, 68, {'value': 6529384, 'string': 6529384}... |
2 | ST12000NM0007 | ZJV0XJQ0 | ST12000NM0007 | 12000138625024 | [(1, 79, {'value': 75775832, 'string': 7577583... |
3 | HGST HMS5C4040ALE640 | PL1331LAHG1S4H | HGST HMS5C4040ALE640 | 4000787030016 | [(1, 100, {'value': 0, 'string': 0}), (2, 134.... |
4 | ST8000NM0055 | ZA16NQJR | ST8000NM0055 | 8001563222016 | [(1, 80, {'value': 103416784, 'string': 103416... |
# split list of tuples [(), (), ()] to columns
# NOTE: each list contains tuples corresponding to only those attributes which are non null
# for e.g. hdd1's list may look like [{'id': 252, 'value': 10, 'raw': {...}}, {'id': 187, 'value': 10, 'raw': {...}}]
# but hdd2's list may look like [{'id': 12, 'value': 5, 'raw': {...}}, {'id': 5, 'value': 10, 'raw': {...}}]
for row_idx in range(data.shape[0]):
# get the stats for current drive
stats = data.loc[row_idx]["ata_smart_attributes"]
for smart_id, norm_val, raw_val_dict in stats:
# extract normalized value, and int form of raw value
data.at[row_idx, "smart_" + str(smart_id) + "_normalized"] = norm_val
data.at[row_idx, "smart_" + str(smart_id) + "_raw"] = raw_val_dict["value"]
# drop the original column as we have parsed its information and don't need it any more
data.drop("ata_smart_attributes", axis=1, inplace=True)
data.head()
model_name | serial_number | model_family | user_capacity | smart_1_normalized | smart_1_raw | smart_3_normalized | smart_3_raw | smart_4_normalized | smart_4_raw | ... | smart_23_normalized | smart_23_raw | smart_24_normalized | smart_24_raw | smart_250_normalized | smart_250_raw | smart_251_normalized | smart_251_raw | smart_252_normalized | smart_252_raw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ST4000DM000 | Z305B2QN | ST4000DM000 | 4000787030016 | 117.0 | 148579464.0 | 91.0 | 0.0 | 100.0 | 12.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ST12000NM0007 | ZJV0XJQ4 | ST12000NM0007 | 12000138625024 | 68.0 | 6529384.0 | 98.0 | 0.0 | 100.0 | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | ST12000NM0007 | ZJV0XJQ0 | ST12000NM0007 | 12000138625024 | 79.0 | 75775832.0 | 99.0 | 0.0 | 100.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | HGST HMS5C4040ALE640 | PL1331LAHG1S4H | HGST HMS5C4040ALE640 | 4000787030016 | 100.0 | 0.0 | 100.0 | 436.0 | 100.0 | 8.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | ST8000NM0055 | ZA16NQJR | ST8000NM0055 | 8001563222016 | 80.0 | 103416784.0 | 92.0 | 0.0 | 100.0 | 6.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 90 columns
# are all the timestamps same for a json file
(labels["backblaze_ts"] == 1538352000000.0).all()
True
# are these two columns different for any row?
(data["model_name"] != data["model_family"]).any()
False
# drop duplicate columns
data.drop(labels=["model_family"], axis=1, inplace=True)
data.head()
model_name | serial_number | user_capacity | smart_1_normalized | smart_1_raw | smart_3_normalized | smart_3_raw | smart_4_normalized | smart_4_raw | smart_5_normalized | ... | smart_23_normalized | smart_23_raw | smart_24_normalized | smart_24_raw | smart_250_normalized | smart_250_raw | smart_251_normalized | smart_251_raw | smart_252_normalized | smart_252_raw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ST4000DM000 | Z305B2QN | 4000787030016 | 117.0 | 148579464.0 | 91.0 | 0.0 | 100.0 | 12.0 | 100.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ST12000NM0007 | ZJV0XJQ4 | 12000138625024 | 68.0 | 6529384.0 | 98.0 | 0.0 | 100.0 | 2.0 | 100.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | ST12000NM0007 | ZJV0XJQ0 | 12000138625024 | 79.0 | 75775832.0 | 99.0 | 0.0 | 100.0 | 1.0 | 100.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | HGST HMS5C4040ALE640 | PL1331LAHG1S4H | 4000787030016 | 100.0 | 0.0 | 100.0 | 436.0 | 100.0 | 8.0 | 100.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | ST8000NM0055 | ZA16NQJR | 8001563222016 | 80.0 | 103416784.0 | 92.0 | 0.0 | 100.0 | 6.0 | 100.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 89 columns
NOTE: The datetime column can optionally be added to data, but first it must be ensured that the SMART stat fetching tool (e.g. smartctl) provides this
bbdf = pd.read_csv("/home/kachauha/Downloads/data_Q4_2018/2018-10-01.csv")
bbdf.head()
date | serial_number | model | capacity_bytes | failure | smart_1_normalized | smart_1_raw | smart_2_normalized | smart_2_raw | smart_3_normalized | ... | smart_250_normalized | smart_250_raw | smart_251_normalized | smart_251_raw | smart_252_normalized | smart_252_raw | smart_254_normalized | smart_254_raw | smart_255_normalized | smart_255_raw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-10-01 | Z305B2QN | ST4000DM000 | 4000787030016 | 0 | 117 | 148579464 | NaN | NaN | 91 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2018-10-01 | ZJV0XJQ4 | ST12000NM0007 | 12000138625024 | 0 | 68 | 6529384 | NaN | NaN | 98 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2018-10-01 | ZJV0XJQ0 | ST12000NM0007 | 12000138625024 | 0 | 79 | 75775832 | NaN | NaN | 99 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2018-10-01 | PL1331LAHG1S4H | HGST HMS5C4040ALE640 | 4000787030016 | 0 | 100 | 0 | 134.0 | 103.0 | 100 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2018-10-01 | ZA16NQJR | ST8000NM0055 | 8001563222016 | 0 | 80 | 103416784 | NaN | NaN | 92 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 129 columns