Pair Correlation Functions On Dataframes
I want improve the criteria for identifying workers of a hotel vs guests. I was thinking about identifying devices that were seen in the hotel for 3 consecutive days, more than 3 times a month. Is it possible to do that efficiently and stay inside of pandas? I was thinking resample/groupby ‘1D’ and ‘device_id’, but I need an efficient way to calculate consecutive days
https://stackoverflow.com/questions/52901387/find-group-of-consecutive-dates-in-pandas-dataframe
- hours instead of days
- days instead of weeks
import pandas
df = pandas.util.testing.makeTimeDataFrame(100, freq='3600s')
df.index = df.index.rename('independent')
df.columns = df.columns.rename('dependent')
df
dependent | A | B | C | D |
---|---|---|---|---|
independent | ||||
2000-01-01 00:00:00 | -1.864051 | 0.194508 | -0.158947 | 0.025978 |
2000-01-01 01:00:00 | -0.924803 | -0.663514 | 0.434736 | 0.697357 |
2000-01-01 02:00:00 | -0.270664 | -1.604472 | -0.249273 | 0.812611 |
2000-01-01 03:00:00 | -0.549321 | -0.701722 | -0.958898 | -0.730263 |
2000-01-01 04:00:00 | 0.251478 | -0.073544 | -0.615413 | 0.167477 |
... | ... | ... | ... | ... |
2000-01-04 23:00:00 | -0.329630 | -1.098350 | -2.224822 | -0.371213 |
2000-01-05 00:00:00 | 0.002818 | 0.526695 | 0.434844 | -0.449561 |
2000-01-05 01:00:00 | -0.250249 | 0.609006 | 0.321713 | -0.161337 |
2000-01-05 02:00:00 | -0.745092 | -1.161005 | -0.601730 | -1.112144 |
2000-01-05 03:00:00 | -0.482342 | 1.703145 | -0.876495 | 0.132908 |
100 rows × 4 columns
df = pandas.util.testing.makeTimeDataFrame(100, freq='3600s').stack()
df.index.names = 'time device'.split()
threshold = .6
guests = (df > threshold).astype(int)
__import__('IPython').display.Markdown(F"* One hotel with 4 guest {df.index.get_level_values('device').unique()}")
- One hotel with 4 guest Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=’object’, name=’device’)
df = pandas.util.testing.makeTimeDataFrame(100, freq='3600s')
df.columns = df.columns.map(str.lower)
df = df.stack()
df.index.names = 'time device'.split()
threshold = -.1
employees = (df > threshold).astype(int)
__import__('IPython').display.Markdown(F"* One hotel with 4 employees {df.index.get_level_values('device').unique()}")
- One hotel with 4 employees Index([‘a’, ‘b’, ‘c’, ‘d’], dtype=’object’, name=’device’)
Combine the guests and employees into a wide dataframe with hours on the the columns. Values of 1 indicate a measurement.
s = pandas.concat([guests, employees])
df = s[s.astype(bool)]
df = df.groupby([pandas.Grouper(level='time', freq='1H'), pandas.Grouper(level='device')]).sum().unstack('time', 0)
df
time | 2000-01-01 00:00:00 | 2000-01-01 01:00:00 | 2000-01-01 02:00:00 | 2000-01-01 03:00:00 | 2000-01-01 04:00:00 | 2000-01-01 05:00:00 | 2000-01-01 06:00:00 | 2000-01-01 07:00:00 | 2000-01-01 08:00:00 | 2000-01-01 09:00:00 | ... | 2000-01-04 18:00:00 | 2000-01-04 19:00:00 | 2000-01-04 20:00:00 | 2000-01-04 21:00:00 | 2000-01-04 22:00:00 | 2000-01-04 23:00:00 | 2000-01-05 00:00:00 | 2000-01-05 01:00:00 | 2000-01-05 02:00:00 | 2000-01-05 03:00:00 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
device | |||||||||||||||||||||
A | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | ... | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
B | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
C | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
D | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
a | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | ... | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
b | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
c | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
d | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
8 rows × 100 columns
Hand roll our own pair correlation function using the Wiener–Khinchin_theorem.
f = lambda df: pandas.DataFrame(abs(pandas.np.fft.ifftn(abs(pandas.np.fft.fftn(pandas.concat([df, df*0], axis=1), axes=(1,))**2), axes=(1,))), index=df.index)
Apply the pair correlation function to a day of data.
g = lambda df: f(df.iloc[:, :24]).loc[:, 3].gt(3)
df
time | 2000-01-01 00:00:00 | 2000-01-01 01:00:00 | 2000-01-01 02:00:00 | 2000-01-01 03:00:00 | 2000-01-01 04:00:00 | 2000-01-01 05:00:00 | 2000-01-01 06:00:00 | 2000-01-01 07:00:00 | 2000-01-01 08:00:00 | 2000-01-01 09:00:00 | ... | 2000-01-04 18:00:00 | 2000-01-04 19:00:00 | 2000-01-04 20:00:00 | 2000-01-04 21:00:00 | 2000-01-04 22:00:00 | 2000-01-04 23:00:00 | 2000-01-05 00:00:00 | 2000-01-05 01:00:00 | 2000-01-05 02:00:00 | 2000-01-05 03:00:00 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
device | |||||||||||||||||||||
A | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | ... | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
B | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
C | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
D | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
a | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | ... | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
b | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
c | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
d | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
8 rows × 100 columns
g(df)
device
A False
B False
C False
D False
a True
b False
c True
d True
Name: 3, dtype: bool
df.groupby(pandas.Grouper(level='time', freq='24H', axis=1)).agg(g)
g(df)
device
A False
B False
C False
D False
a True
b True
c False
d True
Name: 3, dtype: bool