Better practices perform multiple window functions on spark sql

Keywords: python sql apache-spark

Question: 

I need help to tunning my code of multiple windows. When I use just one window, the execution finish in just a feel seconds, but when I add more windows, the code run for hours.

I tried to group features for each timeframe in a unique UDF that get all the columns and count/sum all of then on a same window. But Spark 2.3 don't allow UDAF on windows.

SELECT *
, sum( amount ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN    300000  PRECEDING AND CURRENT ROW ) sum_all_5m
, sum( amount ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN    900000  PRECEDING AND CURRENT ROW ) sum_all_15m
, sum( amount ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN    3600000  PRECEDING AND CURRENT ROW ) sum_all_60m
, sum( amount ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN    7200000  PRECEDING AND CURRENT ROW ) sum_all_120m
, count( id ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN   300000  PRECEDING AND CURRENT ROW ) count_all_5m
, count( id ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN   900000  PRECEDING AND CURRENT ROW ) count_all_15m
, count( id ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN   3600000  PRECEDING AND CURRENT ROW ) count_all_60m
, count( id ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN   7200000  PRECEDING AND CURRENT ROW ) count_all_120m
, sum( case when type='1' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  300000    PRECEDING AND CURRENT ROW ) count_1_5m
, sum( case when type='1' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  900000    PRECEDING AND CURRENT ROW ) count_1_15m
, sum( case when type='1' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  3600000    PRECEDING AND CURRENT ROW ) count_1_60m
, sum( case when type='1' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  7200000    PRECEDING AND CURRENT ROW ) count_1_120m
, sum( case when type='2' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  300000    PRECEDING AND CURRENT ROW ) count_2_5m
, sum( case when type='2' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  900000    PRECEDING AND CURRENT ROW ) count_2_15m
, sum( case when type='2' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  3600000    PRECEDING AND CURRENT ROW ) count_2_60m
, sum( case when type='2' then 1 else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  7200000    PRECEDING AND CURRENT ROW ) count_2_120m
, sum( case when type='2' then amount else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  300000    PRECEDING AND CURRENT ROW ) sum_2_5m
, sum( case when type='2' then amount else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  900000    PRECEDING AND CURRENT ROW ) sum_2_15m
, sum( case when type='2' then amount else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  3600000    PRECEDING AND CURRENT ROW ) sum_2_60m
, sum( case when type='2' then amount else 0  end ) OVER (PARTITION BY id ORDER BY  dtime asc  RANGE  BETWEEN  7200000    PRECEDING AND CURRENT ROW ) sum_2_120m
...
FROM tableX

Answers: