Analysis notebook :

This notebook is a longer and in-depth version of the EDA notebook for cloud-price-data and contains the following results :

Count of unique instances types
Type of feature: numerical, categorical, bag of words
Smple of feature values
Distrubition of features (unique values / feature)

Importing the required python libraries and the sample dataset from the data/processed folder

[1]

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

df = pd.read_csv("../../data/processed/processed.csv")
d = df.copy()
sns.set(style="whitegrid")

/opt/app-root/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (30,61,63,64) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Dataframe for aws-pricing dataset.

[2]

df.head()

	servicecode	location	locationType	instanceType	currentGeneration	instanceFamily	vcpu	physicalProcessor	storage	networkPerformance	...	instanceCapacityMetal	elasticGraphicsType	instanceCapacityMedium	productType	instanceCapacity32xlarge	maxIopsBurstPerformance	provisioned	memory_num	clockSpeed_num	gpuMemory_num
0	AmazonEC2	Asia Pacific (Tokyo)	AWS Region	d2.xlarge	Yes	Storage optimized	4.0	Intel Xeon E5-2676 v3 (Haswell)	3 x 2000 HDD	Moderate	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	30.5	2.4	NaN
1	AmazonEC2	EU (Frankfurt)	AWS Region	m5d.24xlarge	Yes	General purpose	96.0	Intel Xeon Platinum 8175 (Skylake)	4 x 900 NVMe SSD	25 Gigabit	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	384.0	3.1	NaN
2	AmazonEC2	US East (N. Virginia)	AWS Region	r5dn.24xlarge	Yes	Memory optimized	96.0	Intel Xeon Platinum 8259 (Cascade Lake)	4 x 900 NVMe SSD	100 Gigabit	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	768.0	3.1	NaN
3	AmazonEC2	US West (Oregon)	AWS Region	c3.2xlarge	No	Compute optimized	8.0	Intel Xeon E5-2680 v2 (Ivy Bridge)	2 x 80 SSD	High	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15.0	2.8	NaN
4	AmazonEC2	Asia Pacific (Singapore)	AWS Region	m5dn.2xlarge	Yes	General purpose	8.0	Intel Xeon Platinum 8259 (Cascade Lake)	1 x 300 NVMe SSD	Up to 25 Gigabit	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32.0	3.1	NaN

5 rows × 68 columns

Unique instanceTypes

This dataframe (unique_instanceType_frame) displays all the unique instanceType(s) in this dataset.

[3]

unique_instancetype = df["instanceType"].unique()

Length of the list of all unique instanceType(s) gives us total number of unique instanceType(s) in this sample of the dataset.

[4]

len(unique_instancetype)

Convert the list to a dataframe for better understanding of contents

[5]

unique_instancetype_frame = pd.DataFrame(np.array(unique_instancetype))
unique_instancetype_frame.columns = ["unique_instancetype"]
display(unique_instancetype_frame)

	unique_instancetype
0	d2.xlarge
1	m5d.24xlarge
2	r5dn.24xlarge
3	c3.2xlarge
4	m5dn.2xlarge
...	...
346	h1
347	u-6tb1
348	m3
349	m5n
350	x1e

351 rows × 1 columns

Total number of unique instanceType : 351

df_instanceType : DataFrame with counts of unique elements in each position.

[6]

df_instancetype = df.groupby("instanceType").nunique()
df_instancetype

	servicecode	location	locationType	instanceType	currentGeneration	instanceFamily	vcpu	physicalProcessor	storage	networkPerformance	...	instanceCapacityMetal	elasticGraphicsType	instanceCapacityMedium	productType	instanceCapacity32xlarge	maxIopsBurstPerformance	provisioned	memory_num	clockSpeed_num	gpuMemory_num
instanceType
a1	1	3	1	1	1	1	1	1	0	0	...	0	0	1	0	0	0	0	1	1	0
a1.2xlarge	1	8	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
a1.4xlarge	1	7	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
a1.large	1	6	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
a1.medium	1	9	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
z1d.3xlarge	1	12	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
z1d.6xlarge	1	12	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
z1d.large	1	12	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
z1d.metal	1	12	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0
z1d.xlarge	1	12	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	1	1	0

350 rows × 68 columns

Numerical Features

Below shown table shows if the values in the column are numeric(True) or not(False)

[7]

is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
df_features = pd.DataFrame(is_numeric_col)
df_features = df_features.rename(columns={0: "isNumeric"})
with pd.option_context("display.max_rows", None):
    display(df_features)

	isNumeric
servicecode	False
location	False
locationType	False
instanceType	False
currentGeneration	False
instanceFamily	False
vcpu	True
physicalProcessor	False
storage	False
networkPerformance	False
processorArchitecture	False
tenancy	False
operatingSystem	False
licenseModel	False
usagetype	False
operation	False
capacitystatus	False
dedicatedEbsThroughput	False
ecu	False
enhancedNetworkingSupported	False
instancesku	False
intelAvxAvailable	False
intelAvx2Available	False
intelTurboAvailable	False
normalizationSizeFactor	True
preInstalledSw	False
processorFeatures	False
servicename	False
id	False
gpu	True
ebsOptimized	False
transferType	False
fromLocation	False
fromLocationType	False
toLocation	False
toLocationType	False
group	False
groupDescription	False
resourceType	False
instance	False
instanceCapacity12xlarge	True
instanceCapacity2xlarge	True
instanceCapacityLarge	True
instanceCapacityXlarge	True
physicalCores	True
instanceCapacity18xlarge	True
instanceCapacity4xlarge	True
instanceCapacity9xlarge	True
instanceCapacity10xlarge	True
instanceCapacity16xlarge	True
instanceCapacity8xlarge	True
instanceCapacity24xlarge	True
storageMedia	False
volumeType	False
maxVolumeSize	False
maxIopsvolume	False
maxThroughputvolume	False
volumeApiName	False
instanceCapacityMetal	True
elasticGraphicsType	False
instanceCapacityMedium	True
productType	False
instanceCapacity32xlarge	True
maxIopsBurstPerformance	False
provisioned	False
memory_num	True
clockSpeed_num	True
gpuMemory_num	True

[8]

all_features = df_features.T.columns

From the above dataframe, extracting only the numerical features into the list numerical_fetaures and further making a dataframe - num_feat for better undertanding of the contents.

[9]

numerical_features = is_numeric_col[is_numeric_col]
n = numerical_features.index
num_feat = pd.DataFrame(n)
num_feat.rename(columns={0: "Numerical Features"})

	Numerical Features
0	vcpu
1	normalizationSizeFactor
2	gpu
3	instanceCapacity12xlarge
4	instanceCapacity2xlarge
5	instanceCapacityLarge
6	instanceCapacityXlarge
7	physicalCores
8	instanceCapacity18xlarge
9	instanceCapacity4xlarge
10	instanceCapacity9xlarge
11	instanceCapacity10xlarge
12	instanceCapacity16xlarge
13	instanceCapacity8xlarge
14	instanceCapacity24xlarge
15	instanceCapacityMetal
16	instanceCapacityMedium
17	instanceCapacity32xlarge
18	memory_num
19	clockSpeed_num
20	gpuMemory_num

From the list of all features, all the features that are not numerical will be categorical features. Adding them to the list categorial and further making a dataframe - cat_feat for better understandin of the contents.

[10]

categorical = [x for x in all_features if x not in numerical_features]
cat_feat = pd.DataFrame(categorical)
with pd.option_context("display.max_rows", None):
    display(cat_feat.rename(columns={0: "Categorical Features"}))

	Categorical Features
0	servicecode
1	location
2	locationType
3	instanceType
4	currentGeneration
5	instanceFamily
6	physicalProcessor
7	storage
8	networkPerformance
9	processorArchitecture
10	tenancy
11	operatingSystem
12	licenseModel
13	usagetype
14	operation
15	capacitystatus
16	dedicatedEbsThroughput
17	ecu
18	enhancedNetworkingSupported
19	instancesku
20	intelAvxAvailable
21	intelAvx2Available
22	intelTurboAvailable
23	preInstalledSw
24	processorFeatures
25	servicename
26	id
27	ebsOptimized
28	transferType
29	fromLocation
30	fromLocationType
31	toLocation
32	toLocationType
33	group
34	groupDescription
35	resourceType
36	instance
37	storageMedia
38	volumeType
39	maxVolumeSize
40	maxIopsvolume
41	maxThroughputvolume
42	volumeApiName
43	elasticGraphicsType
44	productType
45	maxIopsBurstPerformance
46	provisioned

Displaying the contents of the dataframe that have categorical data

[11]

with pd.option_context("display.max_columns", None):
    display(df[categorical].head())

	servicecode	location	locationType	instanceType	currentGeneration	instanceFamily	physicalProcessor	storage	networkPerformance	processorArchitecture	tenancy	operatingSystem	licenseModel	usagetype	operation	capacitystatus	dedicatedEbsThroughput	ecu	enhancedNetworkingSupported	instancesku	intelAvxAvailable	intelAvx2Available	intelTurboAvailable	preInstalledSw	processorFeatures	servicename	id	ebsOptimized	transferType	fromLocation	fromLocationType	toLocation	toLocationType	group	groupDescription	resourceType	instance	storageMedia	volumeType	maxVolumeSize	maxIopsvolume	maxThroughputvolume	volumeApiName	elasticGraphicsType	productType	maxIopsBurstPerformance	provisioned
0	AmazonEC2	Asia Pacific (Tokyo)	AWS Region	d2.xlarge	Yes	Storage optimized	Intel Xeon E5-2676 v3 (Haswell)	3 x 2000 HDD	Moderate	64-bit	Shared	Windows	No License required	APN1-UnusedBox:d2.xlarge	RunInstances:0002	UnusedCapacityReservation	750 Mbps	14	Yes	NZHXGSV3KSMEQT45	Yes	Yes	Yes	NaN	Intel AVX; Intel AVX2; Intel Turbo	Amazon Elastic Compute Cloud	Y9FZ6K8HF7D54PQK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	AmazonEC2	EU (Frankfurt)	AWS Region	m5d.24xlarge	Yes	General purpose	Intel Xeon Platinum 8175 (Skylake)	4 x 900 NVMe SSD	25 Gigabit	64-bit	Dedicated	Linux	No License required	EUC1-DedicatedRes:m5d.24xlarge	RunInstances:0200	AllocatedCapacityReservation	12000 Mbps	337	Yes	KVTTH3ZWAM6J3RS5	Yes	Yes	Yes	SQL Web	Intel AVX; Intel AVX2; Intel AVX512; Intel Turbo	Amazon Elastic Compute Cloud	F4MU8VH495AU9238	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	AmazonEC2	US East (N. Virginia)	AWS Region	r5dn.24xlarge	Yes	Memory optimized	Intel Xeon Platinum 8259 (Cascade Lake)	4 x 900 NVMe SSD	100 Gigabit	64-bit	Shared	Windows	No License required	BoxUsage:r5dn.24xlarge	RunInstances:0002	Used	12000 Mbps	NaN	No	NaN	No	No	No	NaN	NaN	Amazon Elastic Compute Cloud	KB8CP4XD7SBD3H3F	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	AmazonEC2	US West (Oregon)	AWS Region	c3.2xlarge	No	Compute optimized	Intel Xeon E5-2680 v2 (Ivy Bridge)	2 x 80 SSD	High	64-bit	Shared	Windows	No License required	USW2-UnusedBox:c3.2xlarge	RunInstances:0102	UnusedCapacityReservation	NaN	28	Yes	GQDUSYAUZVGTEET2	Yes	No	Yes	SQL Ent	Intel AVX; Intel Turbo	Amazon Elastic Compute Cloud	PTXRHKVQTWX2U9QH	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	AmazonEC2	Asia Pacific (Singapore)	AWS Region	m5dn.2xlarge	Yes	General purpose	Intel Xeon Platinum 8259 (Cascade Lake)	1 x 300 NVMe SSD	Up to 25 Gigabit	64-bit	Host	Windows	No License required	APS1-HostBoxUsage:m5dn.2xlarge	RunInstances:0002	Used	Up to 3500 Mbps	NaN	No	NaN	No	No	No	NaN	NaN	Amazon Elastic Compute Cloud	GZ8NPDY8GZ37SBXN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Displaying the contents of the dataframe that have numerical data

[12]

num_feat = [x for x in all_features if x not in categorical]
with pd.option_context("display.max_columns", None):
    display(df[num_feat].head().fillna(0))

	vcpu	normalizationSizeFactor	memory_num	clockSpeed_num
0	4.0	8.0	30.5	2.4
1	96.0	192.0	384.0	3.1
2	96.0	192.0	768.0	3.1
3	8.0	16.0	15.0	2.8
4	8.0	16.0	32.0	3.1

Graphs for unique value distribution

Compare the distribution of the existing features.

The function plot_bar_subplots allows the user to make bar graphs that will represent the percentage of a particular value for a feature.

[13]

plt.rcParams.update({"figure.max_open_warning": 0})

[14]

def plot_bar_subplots(sub_df, width, height, bar_height):
    features = sub_df.columns
    height = height * len(features)
    fig, ax = plt.subplots(len(features), 1, figsize=(width, height))
    for i in range(len(features)):
        feature_freqs = sub_df[features[i]].value_counts(normalize=True) * 100
        feature_freq_df = pd.DataFrame(
            feature_freqs.sort_index(inplace=False, ascending=True)
        ).reset_index()
        g = sns.barplot(
            x=features[i],
            y="index",
            data=feature_freq_df,
            ax=ax[i],
            orient="h",
            dodge=False,
        )

        ax[i].set_ylabel(features[i], fontsize=20, va="center")
        ax[i].tick_params(axis="y", which="minor")
        ax[i].set_xlim(0, 100)
        ax[i].set_xlabel("Normalized Value Counts (%)", fontsize=20)
        ax[i].set_title(features[i], fontsize=30)

        def change_height(ax, new_value):
            for patch in ax.patches:
                patch.set_height(new_value)

        change_height(g, bar_height)
    fig.tight_layout()


def plot_bar(dataframe, n):
    features = dataframe.columns
    is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
    df_features = pd.DataFrame(is_numeric_col)
    numerical_features = [x for x in features if df_features.T[x][0] is True]
    categorical = [x for x in features if x not in numerical_features]

    small_features = [
        feature for feature in categorical if len(dataframe[feature].unique()) < n
    ]
    large_features = [x for x in categorical if x not in small_features]
    dataframe[numerical_features] = (
        dataframe[numerical_features].replace(["NA", "Variable"], "0").astype("float")
    )

    ## Plots for features with numerical values
    plot_bar_subplots(dataframe[numerical_features], 30, 30, 0.6)

    ## Plots for features with many values
    plot_bar_subplots(dataframe[large_features], 30, 120, 0.6)

    ## Plots for features with less values
    plot_bar_subplots(dataframe[small_features], 40, 7, 0.6)

Implementing the function plot_bar on the given dataset.

Removing some "all unique" kind of features to avoid disturbance in the graphs.

[15]

df_graph = df.copy()  # removing all unique values for simplification purposes
df_graph = df_graph.reset_index()
df_graph.drop(columns=["id", "instancesku", "usagetype", "index"], inplace=True)
df_graph.dropna()
dd = pd.DataFrame(df_graph)
plot_bar(dd, 30)

<Figure size 2160x0 with 0 Axes>

The above shown bar graphs represent frequency distribution for each feature.
The x-axis represents the number count or percentage of occurrences in the data for each column and can be used to visualize data distributions.
The y-axis represents the various values a features may have.

[ ]