Prerequisites¶

Create the hidden folder to contain the AWS credentials:

In [1]:

!mkdir ~/.aws

Write the credentials to the credentials file:

In [2]:

%%file ~/.aws/credentials
[default]
aws_access_key_id=AKIAJAAAAAAAAAJ4ZMIQ
aws_secret_access_key=fVAAAAAAAALuLBvYQZ/5G+zxSe7wwJy+AAA

Writing /Users/j.waterschoot/.aws/credentials

Alternatively we can use the key and secret from other locations, or environment variables that we provide to the S3 instance.

Write to Parquet on S3¶

Create the inputdata:

In [3]:

%%file inputdata.csv
name,description,color,occupation,picture
Luigi,This is Luigi,green,plumber,https://upload.wikimedia.org/wikipedia/en/f/f1/LuigiNSMBW.png
Mario,This is Mario,red,plumber,https://upload.wikimedia.org/wikipedia/en/9/99/MarioSMBW.png
Peach,My name is Peach,pink,princess,https://s-media-cache-ak0.pinimg.com/originals/d2/4d/77/d24d77cfbba789256c9c1afa1f69b385.png
Toad,I like funghi,red,,https://upload.wikimedia.org/wikipedia/en/d/d1/Toad_3D_Land.png

Overwriting inputdata.csv

Read the data into a dataframe with Pandas:

In [4]:

import pandas as pd
dataframe = pd.read_csv('inputdata.csv')
dataframe

Out[4]:

	name	description	color	occupation	picture
0	Luigi	This is Luigi	green	plumber	https://upload.wikimedia.org/wikipedia/en/f/f1...
1	Mario	This is Mario	red	plumber	https://upload.wikimedia.org/wikipedia/en/9/99...
2	Peach	My name is Peach	pink	princess	https://s-media-cache-ak0.pinimg.com/originals...
3	Toad	I like funghi	red	NaN	https://upload.wikimedia.org/wikipedia/en/d/d1...

Convert to a PyArrow table:

In [5]:

import pyarrow as pa
table = pa.Table.from_pandas(dataframe)
table

Out[5]:

pyarrow.Table
name: string
description: string
color: string
occupation: string
picture: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "name", "field_name": "name", "pandas_type": "unicode'
            b'", "numpy_type": "object", "metadata": null}, {"name": "descript'
            b'ion", "field_name": "description", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "color", "field'
            b'_name": "color", "pandas_type": "unicode", "numpy_type": "object'
            b'", "metadata": null}, {"name": "occupation", "field_name": "occu'
            b'pation", "pandas_type": "unicode", "numpy_type": "object", "meta'
            b'data": null}, {"name": "picture", "field_name": "picture", "pand'
            b'as_type": "unicode", "numpy_type": "object", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.23.3"}'}

Create the output path for S3:

In [6]:

BUCKET_NAME = 'my-game-bucket-for-demo'
CONTAINER_NAME = 'nintendo-container'
TABLE_NAME = 'character-table'

output_file = f"s3://{BUCKET_NAME}/{CONTAINER_NAME}/{TABLE_NAME}.parquet"
output_file

Out[6]:

's3://my-game-bucket-for-demo/nintendo-container/character-table.parquet'

Setup connection with S3:

In [7]:

from s3fs import S3FileSystem
s3 = S3FileSystem() # or s3fs.S3FileSystem(key=ACCESS_KEY_ID, secret=SECRET_ACCESS_KEY)
s3

Out[7]:

<s3fs.core.S3FileSystem at 0x1030f6eb8>

Create the bucket if it does not exist yet:

In [8]:

BUCKET_EXISTS = False
try:
    s3.ls(BUCKET_NAME)
    BUCKET_EXISTS = True
except:
    print("Create bucket first!")

Create bucket first!

In [9]:

if not BUCKET_EXISTS:
    s3.mkdir(BUCKET_NAME)

Write the table to the S3 output:

In [10]:

import pyarrow.parquet as pq
pq.write_to_dataset(table=table, 
                    root_path=output_file,
                    filesystem=s3)

Check the files:

In [11]:

s3.ls(BUCKET_NAME)

Out[11]:

['my-game-bucket-for-demo/nintendo-container']

In [12]:

s3.ls(f"{BUCKET_NAME}/{CONTAINER_NAME}")

Out[12]:

['my-game-bucket-for-demo/nintendo-container/character-table.parquet']

Read the data from the Parquet file¶

In [13]:

import pyarrow.parquet as pq

dataset = pq.ParquetDataset(output_file, filesystem=s3)
df = dataset.read_pandas().to_pandas()
df

Out[13]:

	name	description	color	occupation	picture
0	Luigi	This is Luigi	green	plumber	https://upload.wikimedia.org/wikipedia/en/f/f1...
1	Mario	This is Mario	red	plumber	https://upload.wikimedia.org/wikipedia/en/9/99...
2	Peach	My name is Peach	pink	princess	https://s-media-cache-ak0.pinimg.com/originals...
3	Toad	I like funghi	red	None	https://upload.wikimedia.org/wikipedia/en/d/d1...