Getting started with Great Expectations
Great Expectations¶
This notebook describes an experiment to get to know Great Expectations better. In the approach below we will use the core API rather than the configuration option for the expectations.
Create events with the data generator¶
I will reuse some code I have written before to generate events.
!pip install mimesis
The next bit defines the EventGenerator
and shows five different keys will be created per event.
from mimesis.random import Random
from mimesis import Datetime
import json
class EventGenerator:
""" Defines the EventGenerator """
MIN_LIVES = 1
MAX_LIVES = 99
CHARACTERS = ["Mario", "Luigi", "Peach", "Toad"]
def __init__(self, start_date, end_date, num_events=10, output_type=None, output_file=None):
""" Initialize the EventGenerator """
self.datetime = Datetime()
self.random = Random()
self.num_events = num_events
self.output_type = output_type
self.output_file = output_file
self.start_date = start_date
self.end_date = end_date
def _get_date_between(self, date_start, date_end):
""" Get a date between start and end date """
return self.random.choice(self.datetime.bulk_create_datetimes(self.start_date, self.end_date, days=1))
def _generate_events(self):
""" Generate the metric data """
for _ in range(self.num_events):
yield {
"character": self.random.choice(self.CHARACTERS),
"world": self.random.randint(1, 8),
"level": self.random.randint(1, 4),
"lives": self.random.randint(self.MIN_LIVES, self.MAX_LIVES),
"time": str(self._get_date_between(self.start_date, self.end_date)),
}
def store_events(self):
if self.output_type == "jl":
with open(self.output_file, "w") as outputfile:
for event in self._generate_events():
outputfile.write(f"{json.dumps(event)}\n")
elif self.output_type == "list":
return list(self._generate_events())
else:
return self._generate_events()
The next step is to create the generator before calling the event generators main function.
import datetime
from dateutil.relativedelta import relativedelta
DATE_END = datetime.datetime.now()
DATE_START = DATE_END + relativedelta(months=-1)
params = {
"num_events": 10,
"start_date": DATE_START,
"end_date": DATE_END,
}
# Create the event generator
generator = EventGenerator(**params)
Create the dataframe with Pandas.
import pandas as pd
df = pd.DataFrame(generator._generate_events())
df.head(10)
Data validation¶
To check the static data I will use Great Expectations with a minimal set of tests.
!pip install great_expectations
import great_expectations as ge
To actually use Great Expectations against your data you need to import the data through a GE dataframe which is simply a wrapped Pandas dataframe with GE functionality.
gedf = ge.from_pandas(df)
gedf
The world column should have values from 1 to 8.
gedf.expect_column_values_to_be_between(column="world", min_value=1, max_value=8)
gedf.expect_column_values_to_be_in_set(column="character", value_set=["Mario", "Luigi", "Peach", "Toad"])
gedf.get_expectation_suite()
Write the final expectations to file to be used later in the pipeline.
import json
with open( "ge_expectation_file.json", "w") as fh:
fh.write(
json.dumps(gedf.get_expectation_suite().to_json_dict())
)
We can quickly check the content of the configuration file that has been created. This file can now be used when calling Great Expectations from the command line.
!cat ge_expectation_file.json | python -m json.tool