None shall PAAS
This article was originally posted on the MarketInvoice blog.
This article was originally posted on the MarketInvoice blog.
Write a pandas
dataframe to a single CSV file on S3.
import boto3
from io import StringIO
DESTINATION = 'my-bucket'
def _write_dataframe_to_csv_on_s3(dataframe, filename):
""" Write a dataframe to a CSV on S3 """
print("Writing {} records to {}".format(len(dataframe), filename))
# Create buffer
csv_buffer = StringIO()
# Write dataframe to buffer
dataframe.to_csv(csv_buffer, sep="|", index=False)
# Create S3 object
s3_resource = boto3.resource("s3")
# Write buffer to S3 object
s3_resource.Object(DESTINATION, filename).put(Body=csv_buffer.getvalue())
_write_dataframe_to_csv_on_s3(my_df, 'my-folder')
Write a pandas
dataframe to a single Parquet file on S3.
# Note: make sure `s3fs` is installed in order to make Pandas use S3.
# Credentials for AWS in the normal location ~/.aws/credentials
DESTINATION = 'my-bucket'
def _write_dataframe_to_parquet_on_s3(dataframe, filename):
""" Write a dataframe to a Parquet on S3 """
print("Writing {} records to {}".format(len(dataframe), filename))
output_file = f"s3://{DESTINATION}/{filename}/data.parquet"
dataframe.to_parquet(output_file)
_write_dataframe_to_parquet_on_s3(my_df, 'my-folder')
A simple example on interaction with Parquet files on AWS S3.
After numerous restarts and resets of my VPS and executing over 20 different tutorials, I finally have my receipe to get my single node Kubernetes cluster with a proper load balancer working. Please note that the last version of my Kubernetes experiment can be found on my Git repo. Remember that this repository mainly contains a variety of YAML files that got the system working. What worked for me, might not work for anyone else.
I recently had to insert data from a Pandas dataframe into a Azure SQL database using pandas.to_sql()
. This was performing very poorly and seemed to take ages, but since PyODBC introduced executemany it is easy to improve the performance: simply add an event listener that activates the executemany
for the cursor. For 2300 records I did a small comparison 8.67s and 7.22s versus 5min 57s and 5min 26s, so roughly 50 times faster for this small example dataset.
import pandas as pd
import pyodbc
from sqlalchemy import create_engine, event
from sqlalchemy.pool import StaticPool
wh_conn = pyodbc.connect(f"DRIVER={config[name]['driver']};SERVER={config[name]['server']},{config[name]['port']};DATABASE={config[name]['database']};UID={config[name]['username']};PWD={config[name]['password']}")
engine = create_engine("mssql+pyodbc://", poolclass=StaticPool, creator=lambda: wh_conn)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
df.to_sql(name='Table',
con=engine,
schema='Schema',
index=False,
if_exists='replace')
The goal of this notebook is to get myself familiar with ButterCMS, a headless CMS that can be easily integrated with any platform in any language. In the past I've experienced with Joomla, Drupal and WordPress (and Blogger long time ago), where I have lost invested a lot of time in simply getting the CMS in place, extending the CMS with more functionality and migrating data between servers. Since I am investigating different ways of using microservices, becoming less platform dependent and overall learning new tools, I use this notebook to show a simple example of ButterCMS in Jupyter.
import pandas as pd
from azure.cosmosdb.table.tableservice import TableService
CONNECTION_STRING = "DUMMYSTRING"
SOURCE_TABLE = "DUMMYTABLE"
def set_table_service():
""" Set the Azure Table Storage service """
return TableService(connection_string=CONNECTION_STRING)
def get_dataframe_from_table_storage_table(table_service, filter_query):
""" Create a dataframe from table storage data """
return pd.DataFrame(get_data_from_table_storage_table(table_service,
filter_query))
def get_data_from_table_storage_table(table_service, filter_query):
""" Retrieve data from Table Storage """
for record in table_service.query_entities(
SOURCE_TABLE, filter=filter_query
):
yield record
fq = "PartitionKey eq '12345'"
ts = set_table_service()
df = get_dataframe_from_table_storage_table(table_service=ts,
filter_query=fq)
The aim of this notebook is to create a machine learning model and transform it an API which, when given some novel input parameters, returns the model’s prediction.
A short description how I managed to move my blog from my VPS to Github pages with SSL served via CloudFlare.