Calculate differences with sparse date/value dataframes
Goal¶
- Find the difference between two dataframes with a date and a value column
- Create a continuous time axis to eliminate gaps in the date column
- Fill the interpolated with the value of the previous date
In [1]:
import pandas as pd
Create a dataframe with two entries:
In [2]:
df_old = pd.DataFrame(
[
{
'date': '2019-01-01',
'value': 0.844
},{
'date': '2019-01-02',
'value': 0.842
}
]
)
Create a second dataframe with additional values for three more days:
In [3]:
df_new = pd.DataFrame(
[
{
'date': '2019-01-01',
'value': 0.844
},{
'date': '2019-01-02',
'value': 0.842
},{
'date': '2019-01-03',
'value': 0.84
},{
'date': '2019-01-04',
'value': 0.84
},{
'date': '2019-01-07',
'value': 0.838
}
]
)
Create a time axis between the minimum and maximum date found in the bigger dataframe:
In [4]:
time_index = pd.date_range(df_new['date'].min(), df_new['date'].max())
Set the index to the date column:
In [5]:
df_new.set_index('date', inplace=True)
Convert the index to a DatetimeIndex
:
In [6]:
df_new.index = pd.DatetimeIndex(df_new.index)
Reindex the existing date index with the continuous time_index
to interpolate missing dates:
In [7]:
df_new = df_new.reindex(time_index, method='ffill')
Convert the date index back to a column:
In [8]:
df_new = df_new.reset_index().rename(columns={'index': 'date'})
Convert the date column to the proper datatype for the smaller dataframe:
In [9]:
df_old['date'] = pd.to_datetime(df_old['date'])
Combine the two dataframes and remove all the duplicate couples to only keep the none duplicate rows:
In [10]:
df_diff = pd.concat([df_new, df_old]).drop_duplicates(keep=False)
In [11]:
df_diff
Out[11]: