ODH Logo

Open Data Hub and Object Storage

The intent of this notebook is to provide examples of how data engineers/scientist can use Open Data Hub and object storage, specifically, Ceph object storage, much in the same way they are accustomed to interacting with Amazon Simple Storage Service (S3). This is made possible because Ceph's object storage gateway offers excellent fidelity with the modalities of Amazon S3.

Working with Boto

Boto is an integrated interface to current and future infrastructural services offered by Amazon Web Services. Among the services it provides interfaces for is Amazon S3. For lightweight analysis of data using python tools like numpy or pandas, it is handy to interact with data stored in object storage using pure python. This is where Boto shines.

[1]
import sys  
[2]
import os
import boto3
import pandas as pd


s3_endpoint_url = os.environ['S3_ENDPOINT_URL']
s3_access_key = os.environ['AWS_ACCESS_KEY_ID']
s3_secret_key = os.environ['AWS_SECRET_ACCESS_KEY']
s3_bucket_name = os.environ['JUPYTERHUB_USER']

print(s3_endpoint_url)
print(s3_bucket_name)
s3 = boto3.client('s3','us-east-1', endpoint_url= s3_endpoint_url,
                       aws_access_key_id = s3_access_key,
                       aws_secret_access_key = s3_secret_key)
https://s3.upshift.redhat.com mcliffor

Interacting with S3

Creating a bucket, uploading an object (put), and listing the bucket.

In the cell below we will use our boto3 connection, s3, to do the following: Create an S3 bucket, upload an object, and then display all of the contents of that bucket.

[3]
#s3.create_bucket(Bucket=s3_bucket_name)
#s3.put_object(Bucket=s3_bucket_name,Key='object',Body='data')
for key in s3.list_objects(Bucket=s3_bucket_name)['Contents']:
    print(key['Key'])
forestmnist.1.tgz kube-metrics/operationinfo.csv/_SUCCESS kube-metrics/operationinfo.csv/part-00000-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00001-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00002-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00003-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00004-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00005-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv new_data new_data.csv object somefolder/new_data.csv trip_report.tsv/_SUCCESS trip_report.tsv/part-00000-3549378a-5714-4808-8ffa-a591faa64ff4-c000.csv

Exercise #1: Manage Remote Storage

Let's do something slightly more more complicated and upload a small file to our new bucket.

Below we have used pandas to generate a small csv file for you. Run the below cell, and then upload it to your S3 bucket. Then Display the contents of your bucket like we did above.

This resource may be helpful: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html

Objective

1) Upload a csv file to your s3 bucket using s3.upload_file()

2) List the objects currently in your bucket using s3.list_objects()

[4]
### Create and save a small pandas dataframe and save it locally as a .csv file

import pandas as pd

x = [1,2,3,4]
y = [4,5,6,7]

df  = pd.DataFrame([x,y])
df.to_csv('new_data.csv')
[5]
# 1. Upload a csv file to your s3 bucket using s3.upload_file()

s3.upload_file(Filename='new_data.csv',Bucket=s3_bucket_name, Key='somefolder/new_data.csv')
[6]
# 2. List the objects currently in your bucket using s3.list_objects()

for key in s3.list_objects(Bucket=s3_bucket_name)['Contents']:
    print(key['Key'])
forestmnist.1.tgz kube-metrics/operationinfo.csv/_SUCCESS kube-metrics/operationinfo.csv/part-00000-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00001-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00002-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00003-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00004-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv kube-metrics/operationinfo.csv/part-00005-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv new_data new_data.csv object somefolder/new_data.csv trip_report.tsv/_SUCCESS trip_report.tsv/part-00000-3549378a-5714-4808-8ffa-a591faa64ff4-c000.csv

Now lets read our data from Ceph back into our notbook!

[7]

obj = s3.get_object(Bucket='mcliffor', Key = 'somefolder/new_data.csv')
df = pd.read_csv(obj['Body'])
df
Unnamed: 0 0 1 2 3
0 0 1 2 3 4
1 1 4 5 6 7

Great, now you know how to interact with and manage your data store with simple data types.