Cloud Price Analysis

Given a dataset that talks about the cloud needs of RedHat and cloud services by aws, build a Cost-Optimization model that allows the user to make a wise decision on how cloud services should be managed with time. The model should be configurable, automated and should display detailed information behind each prediction.

[1]

import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict

Data Downloading

In order to access the public dataset of aws-cloud-pricing, we can download the data manually from here and save it in the data/raw folder.

[2]

df = pd.read_json("../../data/raw/index.json")

Data Understanding and Preprocessing

After downloading the data, let's have a look at our datset and see how it is structured.
We'll start by looking at first few values in order to understand the structure.
Further, we'll study each feature in detail.

[3]

features = {
    "formatVersion",
    "disclaimer",
    "offerCode",
    "version",
    "publicationDate",
    "products",
    "terms",
}
df.head()

	formatVersion	disclaimer	offerCode	version	publicationDate	products	terms
SZ472S2RBJNMG88E	v1.0	This pricing list is for informational purpose...	AmazonEC2	20200618221809	2020-06-18T22:18:09Z	{'sku': 'SZ472S2RBJNMG88E', 'productFamily': '...	NaN
XA55FCPSHD2ABUC5	v1.0	This pricing list is for informational purpose...	AmazonEC2	20200618221809	2020-06-18T22:18:09Z	{'sku': 'XA55FCPSHD2ABUC5', 'productFamily': '...	NaN
JF25X344J7WKTCG7	v1.0	This pricing list is for informational purpose...	AmazonEC2	20200618221809	2020-06-18T22:18:09Z	{'sku': 'JF25X344J7WKTCG7', 'productFamily': '...	NaN
JA7STMZBVD3T59GQ	v1.0	This pricing list is for informational purpose...	AmazonEC2	20200618221809	2020-06-18T22:18:09Z	{'sku': 'JA7STMZBVD3T59GQ', 'productFamily': '...	NaN
SJQU8C5FNVFYMK7X	v1.0	This pricing list is for informational purpose...	AmazonEC2	20200618221809	2020-06-18T22:18:09Z	{'sku': 'SJQU8C5FNVFYMK7X', 'productFamily': '...	NaN

[4]

print("Shape of the data\n\n", df.shape)
print("\n\n")
print("Is null?\n\n", df.isnull().sum())
print("\n\n")
print("Columnwise Datatype\n\n", df.dtypes)

Shape of the data

 (244351, 7)



Is null?

 formatVersion           0
disclaimer              0
offerCode               0
version                 0
publicationDate         0
products                2
terms              244349
dtype: int64



Columnwise Datatype

 formatVersion      object
disclaimer         object
offerCode          object
version             int64
publicationDate    object
products           object
terms              object
dtype: object

Looking at the dataframe, we can infer that the products column contains a lot of information. Let's dive deeper to see what information it holds :

[5]

df_products = df["products"]
df_products = df_products.fillna("NA")

Each product has three parameters :

SKU : id
productFamily
attributes

Let's get into ProductFamily and attributes to see what values they have.

[6]

dfp = df_products.sample(
    2400
)  # make a subset of 2400 entries for easier and smaller calculations

[7]

dfp_normalize = pd.json_normalize(dfp)
(dfp_normalize).fillna("NA")

	sku	productFamily	attributes.servicecode	attributes.location	attributes.locationType	attributes.instanceType	attributes.currentGeneration	attributes.instanceFamily	attributes.vcpu	attributes.physicalProcessor	...	attributes.fromLocation	attributes.fromLocationType	attributes.toLocation	attributes.toLocationType	attributes.instanceCapacity12xlarge	attributes.instanceCapacity16xlarge	attributes.instanceCapacity24xlarge	attributes.instanceCapacity8xlarge	attributes.instanceCapacityMetal	attributes.instanceCapacity32xlarge
0	FJU52S3ATSH3Y945	Compute Instance	AmazonEC2	Asia Pacific (Seoul)	AWS Region	x1e.16xlarge	Yes	Memory optimized	64	High Frequency Intel Xeon E7-8880 v3 (Haswell)	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1	MBBTDY5VEVG5H845	Compute Instance	AmazonEC2	EU (Frankfurt)	AWS Region	r5ad.8xlarge	Yes	Memory optimized	32	AMD EPYC 7571	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2	NT9QA9FJBS94KCPH	Compute Instance	AmazonEC2	Canada (Central)	AWS Region	m4.large	Yes	General purpose	2	Intel Xeon E5-2676 v3 (Haswell)	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
3	S5H8KFK35KAENKEY	Compute Instance	AmazonEC2	South America (Sao Paulo)	AWS Region	r5.2xlarge	Yes	Memory optimized	8	Intel Xeon Platinum 8175 (Skylake)	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
4	C28SA47ENKTE9M8H	Compute Instance	AmazonEC2	Asia Pacific (Hong Kong)	AWS Region	c5d.18xlarge	Yes	Compute optimized	72	Intel Xeon Platinum 8124M	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2395	38VVR2NU9XNW87J2	Compute Instance	AmazonEC2	Asia Pacific (Sydney)	AWS Region	m5d.xlarge	Yes	General purpose	4	Intel Xeon Platinum 8175 (Skylake)	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2396	3F3RZVH8KPMQSXTR	Compute Instance	AmazonEC2	AWS GovCloud (US-East)	AWS Region	c5d.xlarge	Yes	Compute optimized	4	Intel Xeon Platinum 8124M	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2397	64ZQWH349DXXHAF4	Compute Instance	AmazonEC2	US West (Oregon)	AWS Region	r5d.4xlarge	Yes	Memory optimized	16	Intel Xeon Platinum 8175 (Skylake)	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2398	37EDKZP2P4GWPTMA	Compute Instance	AmazonEC2	EU (Frankfurt)	AWS Region	r5n.large	Yes	Memory optimized	2	Intel Xeon Platinum 8259 (Cascade Lake)	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2399	A6JDG8R4DWNGZGJZ	Compute Instance	AmazonEC2	Asia Pacific (Singapore)	AWS Region	i3.2xlarge	Yes	Storage optimized	8	Intel Xeon E5-2686 v4 (Broadwell)	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

2400 rows × 54 columns

What productFamily does the product have?

[8]

pd.Series((dfp_normalize["productFamily"])).unique()

array(['Compute Instance', 'Compute Instance (bare metal)',
       'Dedicated Host', 'NAT Gateway', 'Storage Snapshot',
       'Data Transfer', 'IP Address', 'Load Balancer-Network'],
      dtype=object)

What all attributes does the product have?

[9]

list_k = []
for k, v in dfp.items():
    attributes = v["attributes"]
    for k, v in attributes.items():
        list_k.append(k)
unique_keys = set(list_k)
for i in unique_keys:
    print(i)

locationType
enhancedNetworkingSupported
intelAvx2Available
fromLocationType
toLocationType
storage
gpu
group
instanceCapacity12xlarge
tenancy
instanceCapacity16xlarge
instancesku
ecu
instanceFamily
instanceCapacityMetal
capacitystatus
dedicatedEbsThroughput
preInstalledSw
instanceCapacity18xlarge
servicecode
memory
normalizationSizeFactor
instanceCapacity8xlarge
operatingSystem
vcpu
instanceCapacity24xlarge
location
physicalCores
transferType
usagetype
operation
instanceCapacityXlarge
processorArchitecture
servicename
networkPerformance
licenseModel
instanceCapacity2xlarge
instanceCapacity4xlarge
storageMedia
intelAvxAvailable
instanceType
instanceCapacity9xlarge
toLocation
processorFeatures
currentGeneration
groupDescription
instanceCapacityLarge
intelTurboAvailable
physicalProcessor
instanceCapacity32xlarge
clockSpeed
fromLocation

Attributes hold a dictionary that contains following parameters:

instanceFamily
memory
networkPerformance
ecu
locationType
enhancedNetworkingSupported
intelAvx2Available
intelTurboAvailable
location
normalizationSizeFactor
physicalCores
tenancy
instancesku
licenseModel
physicalProcessor
preInstalledSw
usagetype
processorFeatures
operatingSystem
instanceType
capacitystatus
processorArchitecture
dedicatedEbsThroughput
servicename
servicecode
vcpu
intelAvxAvailable
instanceCapacityMetal
storage
clockSpeed
currentGeneration
gpu
operation

Attributes

This function find the unique list of a particular attribute. This allows us to explore the dataset we are dealing with, and we can hence find out correlations later in this project.

[10]

def make_attributes_dictionary(dfp):
    new_dict = {}
    for i in range(len(dfp)):
        new_dict.update(dfp[i]["attributes"])
    return new_dict

[11]

def attributes(attribute):
    list_v = []
    for key, value in dfp.items():
        attributes = value["attributes"]
        for k, v in attributes.items():
            if k == attribute:
                list_v.append(v)
    print(attribute, ":")
    print(set(list_v))

Finding various (unique)locations and location type these services are being used :

[12]

pd.Series(dfp_normalize["attributes.location"]).unique()

array(['Asia Pacific (Seoul)', 'EU (Frankfurt)', 'Canada (Central)',
       'South America (Sao Paulo)', 'Asia Pacific (Hong Kong)',
       'Asia Pacific (Tokyo)', 'US West (Oregon)', 'EU (Ireland)',
       'EU (Milan)', 'Africa (Cape Town)', 'Asia Pacific (Mumbai)',
       'Asia Pacific (Osaka-Local)', 'Asia Pacific (Singapore)',
       'US West (N. California)', 'Asia Pacific (Sydney)',
       'AWS GovCloud (US-East)', 'US East (Ohio)',
       'AWS GovCloud (US-West)', 'US East (N. Virginia)', 'EU (London)',
       'EU (Stockholm)', 'Middle East (Bahrain)', 'EU (Paris)',
       'US West (Los Angeles)', nan], dtype=object)

[13]

pd.Series(dfp_normalize["attributes.locationType"]).unique()

array(['AWS Region', nan], dtype=object)

Finding current generation for these services that are being used :

[14]

pd.Series(dfp_normalize["attributes.currentGeneration"]).unique()

array(['Yes', 'No', nan], dtype=object)

[15]

pd.Series(dfp_normalize["attributes.instanceFamily"]).unique()

array(['Memory optimized', 'General purpose', 'Compute optimized',
       'GPU instance', 'Storage optimized', nan,
       'Machine Learning ASIC Instances'], dtype=object)

Finding the vcpu options available

[16]

pd.Series(dfp_normalize["attributes.vcpu"]).unique()

array(['64', '32', '2', '8', '72', '96', '4', '16', '48', '12', '128',
       '1', '36', '24', '448', nan, '40'], dtype=object)

Finding the physical processors offered:

[17]

pd.Series(dfp_normalize["attributes.physicalProcessor"]).unique()

array(['High Frequency Intel Xeon E7-8880 v3 (Haswell)', 'AMD EPYC 7571',
       'Intel Xeon E5-2676 v3 (Haswell)',
       'Intel Xeon Platinum 8175 (Skylake)', 'Intel Xeon Platinum 8124M',
       'Intel Xeon Platinum 8259 (Cascade Lake)', 'Intel Xeon Family',
       'Intel Skylake E5 2686 v5', 'Intel Xeon E5-2686 v4 (Broadwell)',
       'Intel Xeon Platinum 8275L', 'AMD EPYC 7R32',
       'AWS Graviton2 Processor', 'Intel Xeon E5-2680 v2 (Ivy Bridge)',
       'Intel Xeon E5-2670 v2 (Ivy Bridge/Sandy Bridge)',
       'AWS Graviton Processor', 'Intel Xeon E5-2670 v2 (Ivy Bridge)',
       'Intel Xeon E5-2666 v3 (Haswell)', 'Intel Xeon Platinum 8151',
       'Intel Xeon Scalable (Skylake)', nan,
       'Intel Xeon E5-2670 (Sandy Bridge)', 'Intel Xeon E5-2650',
       'Intel Xeon Platinum 8275CL (Cascade Lake)'], dtype=object)

[18]

pd.Series(dfp_normalize["attributes.clockSpeed"]).unique()

array(['2.3 GHz', '2.5 GHz', '2.4 GHz', '3.1 GHz', '3 GHz', '2.8 GHz',
       nan, '2.9 GHz', '4 GHz', 'Up to 3.3 GHz', '2.6 GHz', '2.0 GHz'],
      dtype=object)

[19]

pd.Series(dfp_normalize["attributes.memory"]).unique()

array(['1952 GiB', '256 GiB', '8 GiB', '64 GiB', '144 GiB', '384 GiB',
       '512 GiB', '768 GiB', '16 GiB', '244 GiB', 'NA', '192 GiB',
       '61 GiB', '32 GiB', '96 GiB', '42 GiB', '128 GiB', '976 GiB',
       '30 GiB', '30.5 GiB', '3904 GiB', '3.75 GiB', '15 GiB', '1 GiB',
       '15.25 GiB', '122 GiB', '60 GiB', '4 GiB', '2 GiB', '488 GiB',
       '0.5 GiB', '21 GiB', '72 GiB', '9216 GiB', nan, '7.5 GiB',
       '5.25 GiB', '17.1 GiB', '160 GiB', '10.5 GiB', '12288 GiB',
       '1.7 GiB', '732 GiB', '68.4 GiB', '117 GiB', '48 GiB'],
      dtype=object)

[20]

pd.Series(dfp_normalize["attributes.operation"]).unique()

array(['RunInstances:0006', 'RunInstances:0002', 'RunInstances:0004',
       'RunInstances:0200', 'RunInstances:0100', 'RunInstances:000g',
       'RunInstances:0010', 'RunInstances:0202', 'RunInstances',
       'RunInstances:0102', 'Hourly', 'RunInstances:0800', 'NatGateway',
       '', 'RunInstances:FFP-CM42-1', 'RunInstances:FFP-C42-1',
       'RunInstances:FFP-CMR33T22-1', 'RunInstances:FFP-M42T22-1',
       'RunInstances:FFP-CMR42T22-1', 'RunInstances:FFP-CM33T22-1',
       'LoadBalancing:Network'], dtype=object)

[21]

pd.Series(dfp_normalize["attributes.ecu"]).unique()

array(['179', 'NA', '6.5', '37', '281', '347', '16', 'Variable', '128',
       '23', '73', '26', '188', '256', '337', '168', '345', '20', '70',
       '208', '91', '55', '340', '3', '10', '13', '8', '174.5', '58',
       '28', '132', '64', '62', '97', '201', '31', '39', '139', nan, '19',
       '47', '349', '375', '27', '52', '12', '235', '53', '271', '104',
       '53.5', '14', '116', '56', '124.5', '4', '45', '7', '1', '5', '2',
       '108', '35'], dtype=object)

[22]

numerical_features = [
    "vcpu",
    "gpu",
    "ecu",
    "normalizationFactor",
    "instanceCapacity2xlarge",
    "normalizationSizeFactor",
    "instanceCapacity4xlarge",
    "instanceCapacityXlarge",
    "instanceCapacityLarge",
]

The function common_attribute_count allows the user to find out how many unique values for each attribute and its value exist. For example: if the user wants to see how many services are based in Tokyo, then they can specify the key as 'location' and the value as ''Asia Pacific (Tokyo)'.

[23]

def common_attribute_count(attribute_key, attribute_value):
    counter = 0
    for key, value in dfp.items():
        attributes = value["attributes"]
        new_dict = defaultdict(list)
        for k, v in attributes.items():
            if k == attribute_key:
                if v == attribute_value:
                    new_dict[k].append(v)
                    counter += 1
    #     print(dict(new_dict), counter)
    return counter

Plots

For all the attributes, it'll be interesting to see some plots where we can see various values and their quantities in the data.

For each attribute, we can see various unique values offered and in what proportions.

[24]

def attributes_plots(attribute):
    list_v = []
    for key, value in dfp.items():
        attributes = value["attributes"]
        for k, v in attributes.items():
            if k == attribute:
                list_v.append(v)
    unique_values = set(list_v)
    return unique_values

[25]

def bar_chart(attribute_key):
    a = attributes_plots(attribute_key)
    if attribute_key in numerical_features:
        a.discard("NA")
        a.discard("Variable")
        print((a))
        a = sorted(a, key=lambda x: float(x))
    else:
        a = sorted(a)

    plt.subplots(figsize=(20, 10))
    for i in a:
        if i == "":
            i = "NA"
        else:
            y = common_attribute_count(attribute_key, i)
            plt.bar(i, y)
            plt.xticks(rotation=90)
            plt.title(attribute_key)
    plt.show()

[26]

for i in unique_keys:
    bar_chart(i)

{'4', '16', '8', '1', '2'}

{'139', '53', '45', '168', '58', '124.5', '64', '345', '132', '104', '1', '26', '55', '2', '347', '23', '337', '3', '8', '28', '174.5', '19', '7', '27', '73', '5', '108', '256', '116', '281', '31', '52', '12', '128', '208', '39', '14', '91', '201', '4', '47', '35', '349', '375', '188', '56', '179', '62', '16', '271', '97', '235', '53.5', '13', '37', '6.5', '20', '340', '70', '10'}

{'4', '128', '16', '144', '8', '96', '48', '0.25', '64', '24', '80', '2', '0.5', '1', '256', '192', '72', '32'}

{'4', '16', '8', '128', '96', '48', '40', '64', '24', '36', '1', '448', '2', '12', '72', '32'}

{'8', '24', '18'}

{'4', '8', '12'}

{'4', '2', '6'}

{'16', '48', '36'}

[27]

attribute_key = "location"
attribute_value = "Asia Pacific (Tokyo)"
common_attribute_count(attribute_key, attribute_value)

[28]

attribute_key = "instanceFamily"
attribute_value = "General purpose"
common_attribute_count(attribute_key, attribute_value)

The function common_attribute_finder allows the user to find out all values for each attribute . For example: if the user wants to see how many services arehave 32 GiB memory, then they can specify the key as 'memory' and the value as '32 GiB'.

This gives us a list of services that deliver similar attributes and for our use case, we can easily find out more services and details about the same using this feature.

After finding out all the products that offer the same feature as selected, we can store it in a file named after the attribute type and name. For example, if the user selects the key as 'memory' and the value as '32 GiB' the file name will be 'memory_32 GiB.txt'

[29]

def common_attribute_finder(attribute_key, attribute_value):
    filename = attribute_key + "_" + attribute_value + ".txt"
    with open(filename, "w+") as f:
        for key, value in dfp.items():
            attributes = value["attributes"]
            new_dict = defaultdict(list)
            for k, v in attributes.items():
                if k == attribute_key:
                    if v == attribute_value:
                        new_dict[k].append(attributes)
                        print((new_dict), file=f)
        f.close()

[30]

def find_second_common_attribute(attribute_key, attribute_value, attribute_two):
    counter = 0
    filename = attribute_key + "_" + attribute_value + ".txt"
    with open(filename, "r") as file:
        d = file.read()
    d = d.split("\n")
    for i in d:
        if attribute_two in i:
            #             print(i)
            counter = counter + 1
    print("Total values that have the second attribute common : ", counter)
    return counter

[31]

attribute_key = "memory"
attribute_value = "32 GiB"
common_attribute_finder(attribute_key, attribute_value)
common_attribute_count(attribute_key, attribute_value)
find_second_common_attribute(attribute_key, attribute_value, "EBS")

Total values that have the second attribute common :  124

Conclusion :

In this notebook, we have imported the data from aws-pricing and performed initial analysis on the same
- Import the dataset locally.
- Data Preprocessing
- Data Analysis
After this analysis, we observe that "attributes" hold the maximum information
We are also planning to implement detailed analysis on the attribute "instanceType". Please follow in-depth analysis notebook for the same.

[ ]