Analysis notebook :
This notebook is a longer and in-depth version of the EDA notebook for cloud-price-data and contains the following results :
- Count of unique instances types
- Type of feature: numerical, categorical, bag of words
- Smple of feature values
- Distrubition of features (unique values / feature)
Importing the required python libraries and the sample dataset from the data/processed
folder
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
df = pd.read_csv("../../data/processed/processed.csv")
d = df.copy()
sns.set(style="whitegrid")
/opt/app-root/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (30,61,63,64) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Dataframe for aws-pricing dataset.
df.head()
servicecode | location | locationType | instanceType | currentGeneration | instanceFamily | vcpu | physicalProcessor | storage | networkPerformance | ... | instanceCapacityMetal | elasticGraphicsType | instanceCapacityMedium | productType | instanceCapacity32xlarge | maxIopsBurstPerformance | provisioned | memory_num | clockSpeed_num | gpuMemory_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AmazonEC2 | Asia Pacific (Tokyo) | AWS Region | d2.xlarge | Yes | Storage optimized | 4.0 | Intel Xeon E5-2676 v3 (Haswell) | 3 x 2000 HDD | Moderate | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 30.5 | 2.4 | NaN |
1 | AmazonEC2 | EU (Frankfurt) | AWS Region | m5d.24xlarge | Yes | General purpose | 96.0 | Intel Xeon Platinum 8175 (Skylake) | 4 x 900 NVMe SSD | 25 Gigabit | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 384.0 | 3.1 | NaN |
2 | AmazonEC2 | US East (N. Virginia) | AWS Region | r5dn.24xlarge | Yes | Memory optimized | 96.0 | Intel Xeon Platinum 8259 (Cascade Lake) | 4 x 900 NVMe SSD | 100 Gigabit | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 768.0 | 3.1 | NaN |
3 | AmazonEC2 | US West (Oregon) | AWS Region | c3.2xlarge | No | Compute optimized | 8.0 | Intel Xeon E5-2680 v2 (Ivy Bridge) | 2 x 80 SSD | High | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15.0 | 2.8 | NaN |
4 | AmazonEC2 | Asia Pacific (Singapore) | AWS Region | m5dn.2xlarge | Yes | General purpose | 8.0 | Intel Xeon Platinum 8259 (Cascade Lake) | 1 x 300 NVMe SSD | Up to 25 Gigabit | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32.0 | 3.1 | NaN |
5 rows × 68 columns
Unique instanceTypes
This dataframe (unique_instanceType_frame) displays all the unique instanceType(s) in this dataset.
unique_instancetype = df["instanceType"].unique()
Length of the list of all unique instanceType(s) gives us total number of unique instanceType(s) in this sample of the dataset.
len(unique_instancetype)
351
Convert the list to a dataframe for better understanding of contents
unique_instancetype_frame = pd.DataFrame(np.array(unique_instancetype))
unique_instancetype_frame.columns = ["unique_instancetype"]
display(unique_instancetype_frame)
unique_instancetype | |
---|---|
0 | d2.xlarge |
1 | m5d.24xlarge |
2 | r5dn.24xlarge |
3 | c3.2xlarge |
4 | m5dn.2xlarge |
... | ... |
346 | h1 |
347 | u-6tb1 |
348 | m3 |
349 | m5n |
350 | x1e |
351 rows × 1 columns
Total number of unique instanceType : 351
df_instanceType : DataFrame with counts of unique elements in each position.
df_instancetype = df.groupby("instanceType").nunique()
df_instancetype
servicecode | location | locationType | instanceType | currentGeneration | instanceFamily | vcpu | physicalProcessor | storage | networkPerformance | ... | instanceCapacityMetal | elasticGraphicsType | instanceCapacityMedium | productType | instanceCapacity32xlarge | maxIopsBurstPerformance | provisioned | memory_num | clockSpeed_num | gpuMemory_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
instanceType | |||||||||||||||||||||
a1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
a1.2xlarge | 1 | 8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
a1.4xlarge | 1 | 7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
a1.large | 1 | 6 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
a1.medium | 1 | 9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
z1d.3xlarge | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
z1d.6xlarge | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
z1d.large | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
z1d.metal | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
z1d.xlarge | 1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
350 rows × 68 columns
Numerical Features
Below shown table shows if the values in the column are numeric(True) or not(False)
is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
df_features = pd.DataFrame(is_numeric_col)
df_features = df_features.rename(columns={0: "isNumeric"})
with pd.option_context("display.max_rows", None):
display(df_features)
isNumeric | |
---|---|
servicecode | False |
location | False |
locationType | False |
instanceType | False |
currentGeneration | False |
instanceFamily | False |
vcpu | True |
physicalProcessor | False |
storage | False |
networkPerformance | False |
processorArchitecture | False |
tenancy | False |
operatingSystem | False |
licenseModel | False |
usagetype | False |
operation | False |
capacitystatus | False |
dedicatedEbsThroughput | False |
ecu | False |
enhancedNetworkingSupported | False |
instancesku | False |
intelAvxAvailable | False |
intelAvx2Available | False |
intelTurboAvailable | False |
normalizationSizeFactor | True |
preInstalledSw | False |
processorFeatures | False |
servicename | False |
id | False |
gpu | True |
ebsOptimized | False |
transferType | False |
fromLocation | False |
fromLocationType | False |
toLocation | False |
toLocationType | False |
group | False |
groupDescription | False |
resourceType | False |
instance | False |
instanceCapacity12xlarge | True |
instanceCapacity2xlarge | True |
instanceCapacityLarge | True |
instanceCapacityXlarge | True |
physicalCores | True |
instanceCapacity18xlarge | True |
instanceCapacity4xlarge | True |
instanceCapacity9xlarge | True |
instanceCapacity10xlarge | True |
instanceCapacity16xlarge | True |
instanceCapacity8xlarge | True |
instanceCapacity24xlarge | True |
storageMedia | False |
volumeType | False |
maxVolumeSize | False |
maxIopsvolume | False |
maxThroughputvolume | False |
volumeApiName | False |
instanceCapacityMetal | True |
elasticGraphicsType | False |
instanceCapacityMedium | True |
productType | False |
instanceCapacity32xlarge | True |
maxIopsBurstPerformance | False |
provisioned | False |
memory_num | True |
clockSpeed_num | True |
gpuMemory_num | True |
all_features = df_features.T.columns
From the above dataframe, extracting only the numerical features into the list numerical_fetaures
and further making a dataframe - num_feat
for better undertanding of the contents.
numerical_features = is_numeric_col[is_numeric_col]
n = numerical_features.index
num_feat = pd.DataFrame(n)
num_feat.rename(columns={0: "Numerical Features"})
Numerical Features | |
---|---|
0 | vcpu |
1 | normalizationSizeFactor |
2 | gpu |
3 | instanceCapacity12xlarge |
4 | instanceCapacity2xlarge |
5 | instanceCapacityLarge |
6 | instanceCapacityXlarge |
7 | physicalCores |
8 | instanceCapacity18xlarge |
9 | instanceCapacity4xlarge |
10 | instanceCapacity9xlarge |
11 | instanceCapacity10xlarge |
12 | instanceCapacity16xlarge |
13 | instanceCapacity8xlarge |
14 | instanceCapacity24xlarge |
15 | instanceCapacityMetal |
16 | instanceCapacityMedium |
17 | instanceCapacity32xlarge |
18 | memory_num |
19 | clockSpeed_num |
20 | gpuMemory_num |
From the list of all features, all the features that are not numerical will be categorical features. Adding them to the list categorial
and further making a dataframe - cat_feat
for better understandin of the contents.
categorical = [x for x in all_features if x not in numerical_features]
cat_feat = pd.DataFrame(categorical)
with pd.option_context("display.max_rows", None):
display(cat_feat.rename(columns={0: "Categorical Features"}))
Categorical Features | |
---|---|
0 | servicecode |
1 | location |
2 | locationType |
3 | instanceType |
4 | currentGeneration |
5 | instanceFamily |
6 | physicalProcessor |
7 | storage |
8 | networkPerformance |
9 | processorArchitecture |
10 | tenancy |
11 | operatingSystem |
12 | licenseModel |
13 | usagetype |
14 | operation |
15 | capacitystatus |
16 | dedicatedEbsThroughput |
17 | ecu |
18 | enhancedNetworkingSupported |
19 | instancesku |
20 | intelAvxAvailable |
21 | intelAvx2Available |
22 | intelTurboAvailable |
23 | preInstalledSw |
24 | processorFeatures |
25 | servicename |
26 | id |
27 | ebsOptimized |
28 | transferType |
29 | fromLocation |
30 | fromLocationType |
31 | toLocation |
32 | toLocationType |
33 | group |
34 | groupDescription |
35 | resourceType |
36 | instance |
37 | storageMedia |
38 | volumeType |
39 | maxVolumeSize |
40 | maxIopsvolume |
41 | maxThroughputvolume |
42 | volumeApiName |
43 | elasticGraphicsType |
44 | productType |
45 | maxIopsBurstPerformance |
46 | provisioned |
Displaying the contents of the dataframe that have categorical
data
with pd.option_context("display.max_columns", None):
display(df[categorical].head())
servicecode | location | locationType | instanceType | currentGeneration | instanceFamily | physicalProcessor | storage | networkPerformance | processorArchitecture | tenancy | operatingSystem | licenseModel | usagetype | operation | capacitystatus | dedicatedEbsThroughput | ecu | enhancedNetworkingSupported | instancesku | intelAvxAvailable | intelAvx2Available | intelTurboAvailable | preInstalledSw | processorFeatures | servicename | id | ebsOptimized | transferType | fromLocation | fromLocationType | toLocation | toLocationType | group | groupDescription | resourceType | instance | storageMedia | volumeType | maxVolumeSize | maxIopsvolume | maxThroughputvolume | volumeApiName | elasticGraphicsType | productType | maxIopsBurstPerformance | provisioned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AmazonEC2 | Asia Pacific (Tokyo) | AWS Region | d2.xlarge | Yes | Storage optimized | Intel Xeon E5-2676 v3 (Haswell) | 3 x 2000 HDD | Moderate | 64-bit | Shared | Windows | No License required | APN1-UnusedBox:d2.xlarge | RunInstances:0002 | UnusedCapacityReservation | 750 Mbps | 14 | Yes | NZHXGSV3KSMEQT45 | Yes | Yes | Yes | NaN | Intel AVX; Intel AVX2; Intel Turbo | Amazon Elastic Compute Cloud | Y9FZ6K8HF7D54PQK | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | AmazonEC2 | EU (Frankfurt) | AWS Region | m5d.24xlarge | Yes | General purpose | Intel Xeon Platinum 8175 (Skylake) | 4 x 900 NVMe SSD | 25 Gigabit | 64-bit | Dedicated | Linux | No License required | EUC1-DedicatedRes:m5d.24xlarge | RunInstances:0200 | AllocatedCapacityReservation | 12000 Mbps | 337 | Yes | KVTTH3ZWAM6J3RS5 | Yes | Yes | Yes | SQL Web | Intel AVX; Intel AVX2; Intel AVX512; Intel Turbo | Amazon Elastic Compute Cloud | F4MU8VH495AU9238 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | AmazonEC2 | US East (N. Virginia) | AWS Region | r5dn.24xlarge | Yes | Memory optimized | Intel Xeon Platinum 8259 (Cascade Lake) | 4 x 900 NVMe SSD | 100 Gigabit | 64-bit | Shared | Windows | No License required | BoxUsage:r5dn.24xlarge | RunInstances:0002 | Used | 12000 Mbps | NaN | No | NaN | No | No | No | NaN | NaN | Amazon Elastic Compute Cloud | KB8CP4XD7SBD3H3F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | AmazonEC2 | US West (Oregon) | AWS Region | c3.2xlarge | No | Compute optimized | Intel Xeon E5-2680 v2 (Ivy Bridge) | 2 x 80 SSD | High | 64-bit | Shared | Windows | No License required | USW2-UnusedBox:c3.2xlarge | RunInstances:0102 | UnusedCapacityReservation | NaN | 28 | Yes | GQDUSYAUZVGTEET2 | Yes | No | Yes | SQL Ent | Intel AVX; Intel Turbo | Amazon Elastic Compute Cloud | PTXRHKVQTWX2U9QH | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | AmazonEC2 | Asia Pacific (Singapore) | AWS Region | m5dn.2xlarge | Yes | General purpose | Intel Xeon Platinum 8259 (Cascade Lake) | 1 x 300 NVMe SSD | Up to 25 Gigabit | 64-bit | Host | Windows | No License required | APS1-HostBoxUsage:m5dn.2xlarge | RunInstances:0002 | Used | Up to 3500 Mbps | NaN | No | NaN | No | No | No | NaN | NaN | Amazon Elastic Compute Cloud | GZ8NPDY8GZ37SBXN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Displaying the contents of the dataframe that have numerical
data
num_feat = [x for x in all_features if x not in categorical]
with pd.option_context("display.max_columns", None):
display(df[num_feat].head().fillna(0))
vcpu | normalizationSizeFactor | gpu | instanceCapacity12xlarge | instanceCapacity2xlarge | instanceCapacityLarge | instanceCapacityXlarge | physicalCores | instanceCapacity18xlarge | instanceCapacity4xlarge | instanceCapacity9xlarge | instanceCapacity10xlarge | instanceCapacity16xlarge | instanceCapacity8xlarge | instanceCapacity24xlarge | instanceCapacityMetal | instanceCapacityMedium | instanceCapacity32xlarge | memory_num | clockSpeed_num | gpuMemory_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.5 | 2.4 | 0.0 |
1 | 96.0 | 192.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 384.0 | 3.1 | 0.0 |
2 | 96.0 | 192.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 768.0 | 3.1 | 0.0 |
3 | 8.0 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15.0 | 2.8 | 0.0 |
4 | 8.0 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 32.0 | 3.1 | 0.0 |
Graphs for unique value distribution
Compare the distribution of the existing features.
The function plot_bar_subplots
allows the user to make bar graphs that will represent the percentage of a particular value for a feature.
plt.rcParams.update({"figure.max_open_warning": 0})
def plot_bar_subplots(sub_df, width, height, bar_height):
features = sub_df.columns
height = height * len(features)
fig, ax = plt.subplots(len(features), 1, figsize=(width, height))
for i in range(len(features)):
feature_freqs = sub_df[features[i]].value_counts(normalize=True) * 100
feature_freq_df = pd.DataFrame(
feature_freqs.sort_index(inplace=False, ascending=True)
).reset_index()
g = sns.barplot(
x=features[i],
y="index",
data=feature_freq_df,
ax=ax[i],
orient="h",
dodge=False,
)
ax[i].set_ylabel(features[i], fontsize=20, va="center")
ax[i].tick_params(axis="y", which="minor")
ax[i].set_xlim(0, 100)
ax[i].set_xlabel("Normalized Value Counts (%)", fontsize=20)
ax[i].set_title(features[i], fontsize=30)
def change_height(ax, new_value):
for patch in ax.patches:
patch.set_height(new_value)
change_height(g, bar_height)
fig.tight_layout()
def plot_bar(dataframe, n):
features = dataframe.columns
is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
df_features = pd.DataFrame(is_numeric_col)
numerical_features = [x for x in features if df_features.T[x][0] is True]
categorical = [x for x in features if x not in numerical_features]
small_features = [
feature for feature in categorical if len(dataframe[feature].unique()) < n
]
large_features = [x for x in categorical if x not in small_features]
dataframe[numerical_features] = (
dataframe[numerical_features].replace(["NA", "Variable"], "0").astype("float")
)
## Plots for features with numerical values
plot_bar_subplots(dataframe[numerical_features], 30, 30, 0.6)
## Plots for features with many values
plot_bar_subplots(dataframe[large_features], 30, 120, 0.6)
## Plots for features with less values
plot_bar_subplots(dataframe[small_features], 40, 7, 0.6)
Implementing the function plot_bar
on the given dataset.
- Removing some "all unique" kind of features to avoid disturbance in the graphs.
df_graph = df.copy() # removing all unique values for simplification purposes
df_graph = df_graph.reset_index()
df_graph.drop(columns=["id", "instancesku", "usagetype", "index"], inplace=True)
df_graph.dropna()
dd = pd.DataFrame(df_graph)
plot_bar(dd, 30)
<Figure size 2160x0 with 0 Axes>
- The above shown bar graphs represent frequency distribution for each feature.
- The x-axis represents the number count or percentage of occurrences in the data for each column and can be used to visualize data distributions.
- The y-axis represents the various values a features may have.