ODH Logo

Analysis notebook :

This notebook is a longer and in-depth version of the EDA notebook for cloud-price-data and contains the following results :

  • Count of unique instances types
  • Type of feature: numerical, categorical, bag of words
  • Smple of feature values
  • Distrubition of features (unique values / feature)

Importing the required python libraries and the sample dataset from the data/processed folder

[1]
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

df = pd.read_csv("../../data/processed/processed.csv")
d = df.copy()
sns.set(style="whitegrid")
/opt/app-root/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (30,61,63,64) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Dataframe for aws-pricing dataset.

[2]
df.head()
servicecode location locationType instanceType currentGeneration instanceFamily vcpu physicalProcessor storage networkPerformance ... instanceCapacityMetal elasticGraphicsType instanceCapacityMedium productType instanceCapacity32xlarge maxIopsBurstPerformance provisioned memory_num clockSpeed_num gpuMemory_num
0 AmazonEC2 Asia Pacific (Tokyo) AWS Region d2.xlarge Yes Storage optimized 4.0 Intel Xeon E5-2676 v3 (Haswell) 3 x 2000 HDD Moderate ... NaN NaN NaN NaN NaN NaN NaN 30.5 2.4 NaN
1 AmazonEC2 EU (Frankfurt) AWS Region m5d.24xlarge Yes General purpose 96.0 Intel Xeon Platinum 8175 (Skylake) 4 x 900 NVMe SSD 25 Gigabit ... NaN NaN NaN NaN NaN NaN NaN 384.0 3.1 NaN
2 AmazonEC2 US East (N. Virginia) AWS Region r5dn.24xlarge Yes Memory optimized 96.0 Intel Xeon Platinum 8259 (Cascade Lake) 4 x 900 NVMe SSD 100 Gigabit ... NaN NaN NaN NaN NaN NaN NaN 768.0 3.1 NaN
3 AmazonEC2 US West (Oregon) AWS Region c3.2xlarge No Compute optimized 8.0 Intel Xeon E5-2680 v2 (Ivy Bridge) 2 x 80 SSD High ... NaN NaN NaN NaN NaN NaN NaN 15.0 2.8 NaN
4 AmazonEC2 Asia Pacific (Singapore) AWS Region m5dn.2xlarge Yes General purpose 8.0 Intel Xeon Platinum 8259 (Cascade Lake) 1 x 300 NVMe SSD Up to 25 Gigabit ... NaN NaN NaN NaN NaN NaN NaN 32.0 3.1 NaN

5 rows × 68 columns

Unique instanceTypes

This dataframe (unique_instanceType_frame) displays all the unique instanceType(s) in this dataset.

[3]
unique_instancetype = df["instanceType"].unique()

Length of the list of all unique instanceType(s) gives us total number of unique instanceType(s) in this sample of the dataset.

[4]
len(unique_instancetype)
351

Convert the list to a dataframe for better understanding of contents

[5]
unique_instancetype_frame = pd.DataFrame(np.array(unique_instancetype))
unique_instancetype_frame.columns = ["unique_instancetype"]
display(unique_instancetype_frame)
unique_instancetype
0 d2.xlarge
1 m5d.24xlarge
2 r5dn.24xlarge
3 c3.2xlarge
4 m5dn.2xlarge
... ...
346 h1
347 u-6tb1
348 m3
349 m5n
350 x1e

351 rows × 1 columns

Total number of unique instanceType : 351

df_instanceType : DataFrame with counts of unique elements in each position.

[6]
df_instancetype = df.groupby("instanceType").nunique()
df_instancetype
servicecode location locationType instanceType currentGeneration instanceFamily vcpu physicalProcessor storage networkPerformance ... instanceCapacityMetal elasticGraphicsType instanceCapacityMedium productType instanceCapacity32xlarge maxIopsBurstPerformance provisioned memory_num clockSpeed_num gpuMemory_num
instanceType
a1 1 3 1 1 1 1 1 1 0 0 ... 0 0 1 0 0 0 0 1 1 0
a1.2xlarge 1 8 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
a1.4xlarge 1 7 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
a1.large 1 6 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
a1.medium 1 9 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
z1d.3xlarge 1 12 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
z1d.6xlarge 1 12 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
z1d.large 1 12 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
z1d.metal 1 12 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0
z1d.xlarge 1 12 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 1 0

350 rows × 68 columns

Numerical Features

Below shown table shows if the values in the column are numeric(True) or not(False)

[7]
is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
df_features = pd.DataFrame(is_numeric_col)
df_features = df_features.rename(columns={0: "isNumeric"})
with pd.option_context("display.max_rows", None):
    display(df_features)
isNumeric
servicecode False
location False
locationType False
instanceType False
currentGeneration False
instanceFamily False
vcpu True
physicalProcessor False
storage False
networkPerformance False
processorArchitecture False
tenancy False
operatingSystem False
licenseModel False
usagetype False
operation False
capacitystatus False
dedicatedEbsThroughput False
ecu False
enhancedNetworkingSupported False
instancesku False
intelAvxAvailable False
intelAvx2Available False
intelTurboAvailable False
normalizationSizeFactor True
preInstalledSw False
processorFeatures False
servicename False
id False
gpu True
ebsOptimized False
transferType False
fromLocation False
fromLocationType False
toLocation False
toLocationType False
group False
groupDescription False
resourceType False
instance False
instanceCapacity12xlarge True
instanceCapacity2xlarge True
instanceCapacityLarge True
instanceCapacityXlarge True
physicalCores True
instanceCapacity18xlarge True
instanceCapacity4xlarge True
instanceCapacity9xlarge True
instanceCapacity10xlarge True
instanceCapacity16xlarge True
instanceCapacity8xlarge True
instanceCapacity24xlarge True
storageMedia False
volumeType False
maxVolumeSize False
maxIopsvolume False
maxThroughputvolume False
volumeApiName False
instanceCapacityMetal True
elasticGraphicsType False
instanceCapacityMedium True
productType False
instanceCapacity32xlarge True
maxIopsBurstPerformance False
provisioned False
memory_num True
clockSpeed_num True
gpuMemory_num True
[8]
all_features = df_features.T.columns

From the above dataframe, extracting only the numerical features into the list numerical_fetaures and further making a dataframe - num_feat for better undertanding of the contents.

[9]
numerical_features = is_numeric_col[is_numeric_col]
n = numerical_features.index
num_feat = pd.DataFrame(n)
num_feat.rename(columns={0: "Numerical Features"})
Numerical Features
0 vcpu
1 normalizationSizeFactor
2 gpu
3 instanceCapacity12xlarge
4 instanceCapacity2xlarge
5 instanceCapacityLarge
6 instanceCapacityXlarge
7 physicalCores
8 instanceCapacity18xlarge
9 instanceCapacity4xlarge
10 instanceCapacity9xlarge
11 instanceCapacity10xlarge
12 instanceCapacity16xlarge
13 instanceCapacity8xlarge
14 instanceCapacity24xlarge
15 instanceCapacityMetal
16 instanceCapacityMedium
17 instanceCapacity32xlarge
18 memory_num
19 clockSpeed_num
20 gpuMemory_num

From the list of all features, all the features that are not numerical will be categorical features. Adding them to the list categorial and further making a dataframe - cat_feat for better understandin of the contents.

[10]
categorical = [x for x in all_features if x not in numerical_features]
cat_feat = pd.DataFrame(categorical)
with pd.option_context("display.max_rows", None):
    display(cat_feat.rename(columns={0: "Categorical Features"}))
Categorical Features
0 servicecode
1 location
2 locationType
3 instanceType
4 currentGeneration
5 instanceFamily
6 physicalProcessor
7 storage
8 networkPerformance
9 processorArchitecture
10 tenancy
11 operatingSystem
12 licenseModel
13 usagetype
14 operation
15 capacitystatus
16 dedicatedEbsThroughput
17 ecu
18 enhancedNetworkingSupported
19 instancesku
20 intelAvxAvailable
21 intelAvx2Available
22 intelTurboAvailable
23 preInstalledSw
24 processorFeatures
25 servicename
26 id
27 ebsOptimized
28 transferType
29 fromLocation
30 fromLocationType
31 toLocation
32 toLocationType
33 group
34 groupDescription
35 resourceType
36 instance
37 storageMedia
38 volumeType
39 maxVolumeSize
40 maxIopsvolume
41 maxThroughputvolume
42 volumeApiName
43 elasticGraphicsType
44 productType
45 maxIopsBurstPerformance
46 provisioned

Displaying the contents of the dataframe that have categorical data

[11]
with pd.option_context("display.max_columns", None):
    display(df[categorical].head())
servicecode location locationType instanceType currentGeneration instanceFamily physicalProcessor storage networkPerformance processorArchitecture tenancy operatingSystem licenseModel usagetype operation capacitystatus dedicatedEbsThroughput ecu enhancedNetworkingSupported instancesku intelAvxAvailable intelAvx2Available intelTurboAvailable preInstalledSw processorFeatures servicename id ebsOptimized transferType fromLocation fromLocationType toLocation toLocationType group groupDescription resourceType instance storageMedia volumeType maxVolumeSize maxIopsvolume maxThroughputvolume volumeApiName elasticGraphicsType productType maxIopsBurstPerformance provisioned
0 AmazonEC2 Asia Pacific (Tokyo) AWS Region d2.xlarge Yes Storage optimized Intel Xeon E5-2676 v3 (Haswell) 3 x 2000 HDD Moderate 64-bit Shared Windows No License required APN1-UnusedBox:d2.xlarge RunInstances:0002 UnusedCapacityReservation 750 Mbps 14 Yes NZHXGSV3KSMEQT45 Yes Yes Yes NaN Intel AVX; Intel AVX2; Intel Turbo Amazon Elastic Compute Cloud Y9FZ6K8HF7D54PQK NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 AmazonEC2 EU (Frankfurt) AWS Region m5d.24xlarge Yes General purpose Intel Xeon Platinum 8175 (Skylake) 4 x 900 NVMe SSD 25 Gigabit 64-bit Dedicated Linux No License required EUC1-DedicatedRes:m5d.24xlarge RunInstances:0200 AllocatedCapacityReservation 12000 Mbps 337 Yes KVTTH3ZWAM6J3RS5 Yes Yes Yes SQL Web Intel AVX; Intel AVX2; Intel AVX512; Intel Turbo Amazon Elastic Compute Cloud F4MU8VH495AU9238 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 AmazonEC2 US East (N. Virginia) AWS Region r5dn.24xlarge Yes Memory optimized Intel Xeon Platinum 8259 (Cascade Lake) 4 x 900 NVMe SSD 100 Gigabit 64-bit Shared Windows No License required BoxUsage:r5dn.24xlarge RunInstances:0002 Used 12000 Mbps NaN No NaN No No No NaN NaN Amazon Elastic Compute Cloud KB8CP4XD7SBD3H3F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 AmazonEC2 US West (Oregon) AWS Region c3.2xlarge No Compute optimized Intel Xeon E5-2680 v2 (Ivy Bridge) 2 x 80 SSD High 64-bit Shared Windows No License required USW2-UnusedBox:c3.2xlarge RunInstances:0102 UnusedCapacityReservation NaN 28 Yes GQDUSYAUZVGTEET2 Yes No Yes SQL Ent Intel AVX; Intel Turbo Amazon Elastic Compute Cloud PTXRHKVQTWX2U9QH NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 AmazonEC2 Asia Pacific (Singapore) AWS Region m5dn.2xlarge Yes General purpose Intel Xeon Platinum 8259 (Cascade Lake) 1 x 300 NVMe SSD Up to 25 Gigabit 64-bit Host Windows No License required APS1-HostBoxUsage:m5dn.2xlarge RunInstances:0002 Used Up to 3500 Mbps NaN No NaN No No No NaN NaN Amazon Elastic Compute Cloud GZ8NPDY8GZ37SBXN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Displaying the contents of the dataframe that have numerical data

[12]
num_feat = [x for x in all_features if x not in categorical]
with pd.option_context("display.max_columns", None):
    display(df[num_feat].head().fillna(0))
vcpu normalizationSizeFactor gpu instanceCapacity12xlarge instanceCapacity2xlarge instanceCapacityLarge instanceCapacityXlarge physicalCores instanceCapacity18xlarge instanceCapacity4xlarge instanceCapacity9xlarge instanceCapacity10xlarge instanceCapacity16xlarge instanceCapacity8xlarge instanceCapacity24xlarge instanceCapacityMetal instanceCapacityMedium instanceCapacity32xlarge memory_num clockSpeed_num gpuMemory_num
0 4.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 30.5 2.4 0.0
1 96.0 192.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 384.0 3.1 0.0
2 96.0 192.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 768.0 3.1 0.0
3 8.0 16.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 15.0 2.8 0.0
4 8.0 16.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 32.0 3.1 0.0

Graphs for unique value distribution

Compare the distribution of the existing features.

The function plot_bar_subplots allows the user to make bar graphs that will represent the percentage of a particular value for a feature.

[13]
plt.rcParams.update({"figure.max_open_warning": 0})
[14]
def plot_bar_subplots(sub_df, width, height, bar_height):
    features = sub_df.columns
    height = height * len(features)
    fig, ax = plt.subplots(len(features), 1, figsize=(width, height))
    for i in range(len(features)):
        feature_freqs = sub_df[features[i]].value_counts(normalize=True) * 100
        feature_freq_df = pd.DataFrame(
            feature_freqs.sort_index(inplace=False, ascending=True)
        ).reset_index()
        g = sns.barplot(
            x=features[i],
            y="index",
            data=feature_freq_df,
            ax=ax[i],
            orient="h",
            dodge=False,
        )

        ax[i].set_ylabel(features[i], fontsize=20, va="center")
        ax[i].tick_params(axis="y", which="minor")
        ax[i].set_xlim(0, 100)
        ax[i].set_xlabel("Normalized Value Counts (%)", fontsize=20)
        ax[i].set_title(features[i], fontsize=30)

        def change_height(ax, new_value):
            for patch in ax.patches:
                patch.set_height(new_value)

        change_height(g, bar_height)
    fig.tight_layout()


def plot_bar(dataframe, n):
    features = dataframe.columns
    is_numeric_col = d.dtypes.apply(lambda x: np.issubdtype(x, np.number))
    df_features = pd.DataFrame(is_numeric_col)
    numerical_features = [x for x in features if df_features.T[x][0] is True]
    categorical = [x for x in features if x not in numerical_features]

    small_features = [
        feature for feature in categorical if len(dataframe[feature].unique()) < n
    ]
    large_features = [x for x in categorical if x not in small_features]
    dataframe[numerical_features] = (
        dataframe[numerical_features].replace(["NA", "Variable"], "0").astype("float")
    )

    ## Plots for features with numerical values
    plot_bar_subplots(dataframe[numerical_features], 30, 30, 0.6)

    ## Plots for features with many values
    plot_bar_subplots(dataframe[large_features], 30, 120, 0.6)

    ## Plots for features with less values
    plot_bar_subplots(dataframe[small_features], 40, 7, 0.6)

Implementing the function plot_bar on the given dataset.

  • Removing some "all unique" kind of features to avoid disturbance in the graphs.
[15]
df_graph = df.copy()  # removing all unique values for simplification purposes
df_graph = df_graph.reset_index()
df_graph.drop(columns=["id", "instancesku", "usagetype", "index"], inplace=True)
df_graph.dropna()
dd = pd.DataFrame(df_graph)
plot_bar(dd, 30)
<Figure size 2160x0 with 0 Axes>
  • The above shown bar graphs represent frequency distribution for each feature.
  • The x-axis represents the number count or percentage of occurrences in the data for each column and can be used to visualize data distributions.
  • The y-axis represents the various values a features may have.
[ ]