ARIMAX/SARIMAX Models
1 Literature review
Sales forecasting is crucial for companies to plan their operations and make informed decisions about future sales goals. According to some cases of sales forecasting on the internet, the variables that are important for the forecasting can vary by the industry, business type and specific product. However, there are still many common factors which are essential for most sales forecasting.
There are mainly two types of factors that need to be consider during forecasting: external factors and internal factors. The former ones include population, market environment and economic situation, while the latter ones mainly contain the company’s historical sales data, marketing and advertising strategies, and product lifecycle.
For this project that focuses on sales forecasting for Walmart data, I will consider several features that may contribute to the forecasting of the sales data. First of all, fuel price and local unemployment rate should be taken into consideration, because they are indirectly or directly related to the economic environment. Secondly, the CPI index is also important since it highly relates to the consumers and reflects the level of inflation. Besides, since many products have seasonal effects and can be related to the temperature, I will also include temperature as one of the predictor variables.
2 Variables and visualization
In this project, weekly sales is the response variable, while fuel price, temperature, CPI index and unemployment rate are predictor variables.
First, have a quick look of the data table after merging and cleaning which is ready for model fitting.
Code
=read.csv("./data/Walmart_sales_data.csv")
walmart=read.csv("./data/Walmart_stores.csv")
type=read.csv("./data/Walmart_features.csv")
features
=merge(walmart,type,by="Store")
df=merge(df, features, by=c("Date","Store"))
df_all
=group_by(df_all[df_all$Type=='A',],Date)
A_bytime=summarise(A_bytime,sales = mean(Weekly_Sales),fuel_price=mean(Fuel_Price),
A_salestemperature=mean(Temperature),CPI=mean(CPI),unemployment=mean(Unemployment))
=group_by(df_all[df_all$Type=='B',],Date)
B_bytime=summarise(B_bytime,sales = mean(Weekly_Sales),fuel_price=mean(Fuel_Price),
B_salestemperature=mean(Temperature),CPI=mean(CPI),unemployment=mean(Unemployment))
=group_by(df_all[df_all$Type=='C',],Date)
C_bytime=summarise(C_bytime,sales = mean(Weekly_Sales),fuel_price=mean(Fuel_Price),
C_salestemperature=mean(Temperature),CPI=mean(CPI),unemployment=mean(Unemployment))
::kable(
knitrhead(A_sales)
)
Date | sales | fuel_price | temperature | CPI | unemployment |
---|---|---|---|---|---|
2010-02-05 | 21259.34 | 2.702128 | 34.04677 | 170.5586 | 8.443993 |
2010-02-12 | 20491.12 | 2.683142 | 33.28542 | 170.4538 | 8.439380 |
2010-02-19 | 20408.21 | 2.656316 | 36.61275 | 170.4040 | 8.446582 |
2010-02-26 | 18598.27 | 2.675451 | 38.86468 | 170.7273 | 8.432503 |
2010-03-05 | 19881.15 | 2.721852 | 42.38552 | 170.8344 | 8.438802 |
2010-03-12 | 19532.97 | 2.761870 | 47.78439 | 170.8610 | 8.436977 |
Code
=A_sales[,2:6]
df=ts(df, start=decimal_date(as.Date("2010-02-05")), frequency = 365.25/7)
temp.tsautoplot(temp.ts,facets=TRUE)+
xlab("Year")+
ylab("Value")+
ggtitle("Weekly sales and 4 predictor variables")
This is a visualization of weekly sales and other variables from 2010 to 2012. It is noticed that most variables have different trends within the observation time.
3 Auto.arima fitting
Code
<- cbind(fuel = temp.ts[, "fuel_price"],
xreg tem = temp.ts[, "temperature"],
cpi = temp.ts[, "CPI"],
unemp = temp.ts[, "unemployment"])
<- auto.arima(temp.ts[, "sales"], xreg = xreg)
fit summary(fit)
Series: temp.ts[, "sales"]
Regression with ARIMA(0,0,0)(0,1,0)[52] errors
Coefficients:
fuel tem cpi unemp
-251.0014 45.0867 20.4911 -292.1289
s.e. 223.2053 22.3984 104.5482 623.5498
sigma^2 = 621301: log likelihood = -733.92
AIC=1477.84 AICc=1478.55 BIC=1490.4
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -2.770906 614.1811 345.2601 -0.00187349 1.69265 0.5876295
ACF1
Training set 0.1379488
Code
checkresiduals(fit)
Ljung-Box test
data: Residuals from Regression with ARIMA(0,0,0)(0,1,0)[52] errors
Q* = 22.321, df = 29, p-value = 0.8067
Model df: 0. Total lags used: 29
As we can see, the auto.arima() function fits a regression with SARIMA(0,0,0)(0,1,0)[52] errors, which is a SARIMAX model. The Ljung-Box test indicates that the residuals are independent and the model fits the data well.
4 Manually fitting
First, fit the linear regression model predicting weekly sales using 4 predictor variables.
Code
=lm(sales~. , data=df)
fit.regsummary(fit.reg)
Call:
lm(formula = sales ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-3563.8 -1030.1 -153.0 440.2 13763.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -114645.61 71035.82 -1.614 0.1088
fuel_price -1353.55 807.54 -1.676 0.0960 .
temperature -14.40 12.33 -1.168 0.2447
CPI 651.76 330.64 1.971 0.0507 .
unemployment 3396.72 1968.00 1.726 0.0866 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2170 on 138 degrees of freedom
Multiple R-squared: 0.04536, Adjusted R-squared: 0.01769
F-statistic: 1.639 on 4 and 138 DF, p-value: 0.1679
From the p-value, we can find that temperature is not significant at all, while other three predictor variables are significant under 0.1 level.
Let’s have a look at the residuals:
Code
=ts(residuals(fit.reg), start=decimal_date(as.Date("2010-02-05")), frequency = 365.25/7)
res.fitautoplot(res.fit)
Differencing with lag=52:
Code
%>% diff(52) %>% ggtsdisplay() res.fit
There seems to exist seasonal component in the residuals. When applying a lag 52 differencing to the residuals, it becomes quite stationary. Use SARIMA model to fit the residuals now.
Code
=1
i=matrix(rep(NA,9*30),nrow=30)
ls
for (p in 0:1)
{for(q in 0:1)
{for(P in 0:1)
{for(Q in 0:1)
{for(d in 0:1){
<- try(Arima(res.fit,order=c(p,d,q),seasonal=c(P,1,Q)),silent = TRUE)
modelif('try-error' %in% class(model)){
next
else{
}= c(p,d,q,P,1,Q,model$aic,model$bic,model$aicc)
ls[i,]=i+1
i
}
}
}
}
}
}
= as.data.frame(ls)
tempnames(temp)= c("p","d","q","P","D","Q","AIC","BIC","AICc")
p d q P D Q AIC BIC AICc
26 1 1 1 0 1 0 1486.177 1493.676 1486.456
The SARIMA(1,1,1,0,1,0)[52] has fitted the residuals best and have minimum AIC, BIC and AICc.
Code
=Arima(res.fit,order=c(1,1,1),seasonal=c(0,1,0))
fitsummary(fit)
Series: res.fit
ARIMA(1,1,1)(0,1,0)[52]
Coefficients:
ar1 ma1
0.3149 -1.0000
s.e. 0.1011 0.0389
sigma^2 = 796548: log likelihood = -740.09
AIC=1486.18 AICc=1486.46 BIC=1493.68
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -16.61097 700.1311 401.0234 -124.8012 248.4826 0.5792466
ACF1
Training set -0.02621394
Code
checkresiduals(fit)
Ljung-Box test
data: Residuals from ARIMA(1,1,1)(0,1,0)[52]
Q* = 25.465, df = 27, p-value = 0.5484
Model df: 2. Total lags used: 29
The model diagnostics above also indicates that this SARIMAX model fits the residuals well.
5 Cross-validation
In this part, I will use cross-validation to compare the RMSE of manually selected model and the model chosen by auto.arima. In practice, k equals to 100, and I predict the weekly sales of the next whole month in each iteration.
Code
=length(res.fit)
n=100
k=365.25/7
freq=tsp(res.fit)[1]+k/freq
st
=matrix(NA, 10, 1)
rmse1=matrix(NA, 10, 1)
rmse2
for (i in 1:10){
=window(res.fit, end=st+(i-1)*4/freq)
xtrain=window(res.fit, start=st+(i-1)*4/freq+1/freq, end=st+i*4/freq)
xtest
<- Arima(xtrain, order=c(1,1,1), seasonal=list(order=c(0,1,0), period=52),
fit1 method="ML")
<- forecast(fit1, h=4)
fcast1 <- Arima(xtrain, order=c(0,0,0), seasonal=list(order=c(0,1,0), period=52),
fit2 method="ML")
<- forecast(fit2, h=4)
fcast2
1] <- sqrt(mean((fcast1$mean-xtest)^2))
rmse1[i,1] <- sqrt(mean((fcast2$mean-xtest)^2))
rmse2[i,
}
plot(rmse1, type="l", col=2, xlab="horizon", ylab="RMSE")
lines(rmse2, type="l",col=3)
legend("topleft",legend=c("Manual","Auto"),col=2:4,lty=1)
According to this plot, RMSE on test sets are very close for these two models. In these 10 iterations, the manually selected model beats the other one for 6 times, suggesting that this manually selected SARIMAX model performs slightly better.
Code
=Arima(temp.ts[, "sales"], order=c(1,1,1), seasonal=c(0,1,0), xreg = xreg)
fitsummary(fit)
Series: temp.ts[, "sales"]
Regression with ARIMA(1,1,1)(0,1,0)[52] errors
Coefficients:
ar1 ma1 fuel tem cpi unemp
0.1542 -0.9999 237.1454 42.1104 54.6488 -1327.677
s.e. 0.1084 0.0379 434.8001 22.5373 123.4687 1010.681
sigma^2 = 616978: log likelihood = -726.64
AIC=1467.29 AICc=1468.65 BIC=1484.79
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 36.18888 602.0138 341.3975 0.1889309 1.673946 0.5810554 0.00182261
Therefore, the final model we choose is the SARIMAX model combined with a regression of 4 predictor variables, fuel price, temperature, CPI index and unemployment rate, and a SARIMA(1,1,1,0,1,0)[52] model fitting the residuals.
6 Forecasting
First use auto.arima() function to fit and forecast 4 predictor variables and then forecast the response variable weekly sales.
Code
<-auto.arima(df$fuel_price)
fuel_fit<-auto.arima(df$temperature)
temp_fit<-auto.arima(df$CPI)
cpi_fit<-auto.arima(df$unemployment)
unemp_fit
<-forecast(fuel_fit)
fuel_f<-forecast(temp_fit)
temp_f<-forecast(cpi_fit)
cpi_f<-forecast(unemp_fit)
unemp_f
<- cbind(fuel = fuel_f$mean,
fxreg tem = temp_f$mean,
cpi = cpi_f$mean,
unemp = unemp_f$mean)
<- forecast(fit, xreg=fxreg)
fcast autoplot(fcast) + xlab("Time") +
ylab("Weekly Sales")
This forecasting shows that the SARIMAX model has effectively captured the data pattern, especially the seasonality in this data, and generated relatively good results based on 4 predictor variables. Rather than only using historical sales data to construct a SARIMA time series model, SARIMAX can deal with the sales data more comprehensively since it also takes CPI index, unemployment rate and fuel price into consideration.