Interacting with Parquet on S3 with PyArrow and s3fs
Prerequisites¶
Create the hidden folder to contain the AWS credentials:
In [1]:
!mkdir ~/.aws
Write the credentials to the credentials
file:
In [2]:
%%file ~/.aws/credentials
[default]
aws_access_key_id=AKIAJAAAAAAAAAJ4ZMIQ
aws_secret_access_key=fVAAAAAAAALuLBvYQZ/5G+zxSe7wwJy+AAA
Alternatively we can use the key and secret from other locations, or environment variables that we provide to the S3 instance.
Write to Parquet on S3¶
Create the inputdata:
In [3]:
%%file inputdata.csv
name,description,color,occupation,picture
Luigi,This is Luigi,green,plumber,https://upload.wikimedia.org/wikipedia/en/f/f1/LuigiNSMBW.png
Mario,This is Mario,red,plumber,https://upload.wikimedia.org/wikipedia/en/9/99/MarioSMBW.png
Peach,My name is Peach,pink,princess,https://s-media-cache-ak0.pinimg.com/originals/d2/4d/77/d24d77cfbba789256c9c1afa1f69b385.png
Toad,I like funghi,red,,https://upload.wikimedia.org/wikipedia/en/d/d1/Toad_3D_Land.png
Read the data into a dataframe with Pandas:
In [4]:
import pandas as pd
dataframe = pd.read_csv('inputdata.csv')
dataframe
Out[4]:
Convert to a PyArrow table:
In [5]:
import pyarrow as pa
table = pa.Table.from_pandas(dataframe)
table
Out[5]:
Create the output path for S3:
In [6]:
BUCKET_NAME = 'my-game-bucket-for-demo'
CONTAINER_NAME = 'nintendo-container'
TABLE_NAME = 'character-table'
output_file = f"s3://{BUCKET_NAME}/{CONTAINER_NAME}/{TABLE_NAME}.parquet"
output_file
Out[6]:
Setup connection with S3:
In [7]:
from s3fs import S3FileSystem
s3 = S3FileSystem() # or s3fs.S3FileSystem(key=ACCESS_KEY_ID, secret=SECRET_ACCESS_KEY)
s3
Out[7]:
Create the bucket if it does not exist yet:
In [8]:
BUCKET_EXISTS = False
try:
s3.ls(BUCKET_NAME)
BUCKET_EXISTS = True
except:
print("Create bucket first!")
In [9]:
if not BUCKET_EXISTS:
s3.mkdir(BUCKET_NAME)
Write the table to the S3 output:
In [10]:
import pyarrow.parquet as pq
pq.write_to_dataset(table=table,
root_path=output_file,
filesystem=s3)
Check the files:
In [11]:
s3.ls(BUCKET_NAME)
Out[11]:
In [12]:
s3.ls(f"{BUCKET_NAME}/{CONTAINER_NAME}")
Out[12]:
Read the data from the Parquet file¶
In [13]:
import pyarrow.parquet as pq
dataset = pq.ParquetDataset(output_file, filesystem=s3)
df = dataset.read_pandas().to_pandas()
df
Out[13]: