Wednesday, August 31, 2016

Feature Engineering - Get Lag Features, Rolling means in SQL and R

When preparing the training dataset for time series data, it is very often that we need to create new features. In this post, I talk about how to generate lag features, rolling means (standard deviations) within a preceding time window, and rolling aggregations within a following time window.

I have prepared a sample dataset, which can be downloaded at this link .

Sample input data:

Expected output data:




SQL code by running on Azure SQL server:


with t1 as (select  PID as ItemID, cast (VisitYear as int) as Year,  cast (VisitMonth as int) as Month, cast (Cost as float) as Value from PatientCostSample),
 t2 as(
select ItemID, Year, Month, Value,
avg(Value) over(partition by ItemID order by ItemID, Year, Month rows 4 preceding) as RollingMeanPre5M,
 LAG(Value, 1) OVER (partition by ItemID order by ItemID,Year, Month) AS Value_m1M,
 LAG(Value, 2) OVER (partition by ItemID order by ItemID,Year, Month) AS Value_m2M,
 LAG(Value, 3) OVER (partition by ItemID order by ItemID,Year, Month) AS Value_m3M,
max(Value) over(partition by ItemID order by ItemID, Year, Month ROWS 3 preceding) as MaxValuePre4M,
sum(Value) over(partition by ItemID order by ItemID, Year, Month ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING) as Next3MAggValue    
       from t1 )

select *  from t2 order by ItemID, Year, Month;



R code by using dplyr and zoo packages (output as shown below)


Below is the R scripts to generate this result. I am attaching its image due to formatting issue with the Blog post.



Another Blog post is also worth reading.