Pair Correlation Functions On Dataframes

I want improve the criteria for identifying workers of a hotel vs guests. I was thinking about identifying devices that were seen in the hotel for 3 consecutive days, more than 3 times a month. Is it possible to do that efficiently and stay inside of pandas? I was thinking resample/groupby ‘1D’ and ‘device_id’, but I need an efficient way to calculate consecutive days

https://stackoverflow.com/questions/52901387/find-group-of-consecutive-dates-in-pandas-dataframe

  • hours instead of days
  • days instead of weeks
    import pandas
df = pandas.util.testing.makeTimeDataFrame(100, freq='3600s')
df.index = df.index.rename('independent')
df.columns = df.columns.rename('dependent')
df
dependent A B C D
independent
2000-01-01 00:00:00 -1.864051 0.194508 -0.158947 0.025978
2000-01-01 01:00:00 -0.924803 -0.663514 0.434736 0.697357
2000-01-01 02:00:00 -0.270664 -1.604472 -0.249273 0.812611
2000-01-01 03:00:00 -0.549321 -0.701722 -0.958898 -0.730263
2000-01-01 04:00:00 0.251478 -0.073544 -0.615413 0.167477
... ... ... ... ...
2000-01-04 23:00:00 -0.329630 -1.098350 -2.224822 -0.371213
2000-01-05 00:00:00 0.002818 0.526695 0.434844 -0.449561
2000-01-05 01:00:00 -0.250249 0.609006 0.321713 -0.161337
2000-01-05 02:00:00 -0.745092 -1.161005 -0.601730 -1.112144
2000-01-05 03:00:00 -0.482342 1.703145 -0.876495 0.132908

100 rows × 4 columns

df = pandas.util.testing.makeTimeDataFrame(100, freq='3600s').stack()
df.index.names = 'time device'.split()
threshold = .6
guests = (df > threshold).astype(int)
__import__('IPython').display.Markdown(F"* One hotel with 4 guest {df.index.get_level_values('device').unique()}")
  • One hotel with 4 guest Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=’object’, name=’device’)
df = pandas.util.testing.makeTimeDataFrame(100, freq='3600s')
df.columns = df.columns.map(str.lower)
df = df.stack()
df.index.names = 'time device'.split()
threshold = -.1
employees = (df > threshold).astype(int)
__import__('IPython').display.Markdown(F"* One hotel with 4 employees {df.index.get_level_values('device').unique()}")
  • One hotel with 4 employees Index([‘a’, ‘b’, ‘c’, ‘d’], dtype=’object’, name=’device’)

Combine the guests and employees into a wide dataframe with hours on the the columns. Values of 1 indicate a measurement.

s = pandas.concat([guests, employees])
df = s[s.astype(bool)]
df = df.groupby([pandas.Grouper(level='time', freq='1H'), pandas.Grouper(level='device')]).sum().unstack('time', 0)
df
time 2000-01-01 00:00:00 2000-01-01 01:00:00 2000-01-01 02:00:00 2000-01-01 03:00:00 2000-01-01 04:00:00 2000-01-01 05:00:00 2000-01-01 06:00:00 2000-01-01 07:00:00 2000-01-01 08:00:00 2000-01-01 09:00:00 ... 2000-01-04 18:00:00 2000-01-04 19:00:00 2000-01-04 20:00:00 2000-01-04 21:00:00 2000-01-04 22:00:00 2000-01-04 23:00:00 2000-01-05 00:00:00 2000-01-05 01:00:00 2000-01-05 02:00:00 2000-01-05 03:00:00
device
A 1 0 0 0 0 0 1 0 1 1 ... 1 1 1 0 0 1 0 0 0 1
B 1 0 0 1 0 0 0 0 0 0 ... 1 0 0 0 1 0 0 0 0 0
C 1 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 1 0 1 0 1 0
D 0 0 0 0 0 0 1 0 0 0 ... 0 1 1 0 1 0 0 0 0 1
a 1 1 0 0 1 0 1 1 1 0 ... 1 1 0 1 1 0 1 0 1 1
b 1 0 1 0 0 0 0 1 0 0 ... 0 0 1 0 0 1 1 1 1 0
c 0 0 1 0 0 1 1 1 1 1 ... 0 0 1 0 1 1 1 0 1 1
d 1 0 0 1 1 1 1 1 1 0 ... 1 0 0 0 1 1 0 0 0 0

8 rows × 100 columns

Hand roll our own pair correlation function using the Wiener–Khinchin_theorem.

f = lambda df: pandas.DataFrame(abs(pandas.np.fft.ifftn(abs(pandas.np.fft.fftn(pandas.concat([df, df*0], axis=1), axes=(1,))**2), axes=(1,))), index=df.index)

Apply the pair correlation function to a day of data.

g = lambda df: f(df.iloc[:, :24]).loc[:, 3].gt(3)
df
time 2000-01-01 00:00:00 2000-01-01 01:00:00 2000-01-01 02:00:00 2000-01-01 03:00:00 2000-01-01 04:00:00 2000-01-01 05:00:00 2000-01-01 06:00:00 2000-01-01 07:00:00 2000-01-01 08:00:00 2000-01-01 09:00:00 ... 2000-01-04 18:00:00 2000-01-04 19:00:00 2000-01-04 20:00:00 2000-01-04 21:00:00 2000-01-04 22:00:00 2000-01-04 23:00:00 2000-01-05 00:00:00 2000-01-05 01:00:00 2000-01-05 02:00:00 2000-01-05 03:00:00
device
A 1 0 0 0 0 0 1 0 1 1 ... 1 1 1 0 0 1 0 0 0 1
B 1 0 0 1 0 0 0 0 0 0 ... 1 0 0 0 1 0 0 0 0 0
C 1 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 1 0 1 0 1 0
D 0 0 0 0 0 0 1 0 0 0 ... 0 1 1 0 1 0 0 0 0 1
a 1 1 0 0 1 0 1 1 1 0 ... 1 1 0 1 1 0 1 0 1 1
b 1 0 1 0 0 0 0 1 0 0 ... 0 0 1 0 0 1 1 1 1 0
c 0 0 1 0 0 1 1 1 1 1 ... 0 0 1 0 1 1 1 0 1 1
d 1 0 0 1 1 1 1 1 1 0 ... 1 0 0 0 1 1 0 0 0 0

8 rows × 100 columns

g(df)
device
A    False
B    False
C    False
D    False
a     True
b    False
c     True
d     True
Name: 3, dtype: bool
df.groupby(pandas.Grouper(level='time', freq='24H', axis=1)).agg(g)
g(df)
device
A    False
B    False
C    False
D    False
a     True
b     True
c    False
d     True
Name: 3, dtype: bool

Written on October 28, 2019