logo
ResearchBunny Logo
Lessons Learned Applying Deep Learning Approaches to Forecasting Complex Seasonal Behavior

Computer Science

Lessons Learned Applying Deep Learning Approaches to Forecasting Complex Seasonal Behavior

A. T. Karl, J. Wisnowski, et al.

Discover groundbreaking insights as Andrew T Karl, James Wisnowski, and Lambros Petropoulos delve into the power of recurrent neural networks for accurately forecasting call center volumes. With a focus on overcoming complex seasonal patterns and autocorrelation, this research contrasts advanced deep learning techniques with traditional forecasting methods, revealing practical strategies for real-world applications.... show more
Introduction

Member contact call centers experience fluctuating call volumes by day of week, time of day, holidays, business conditions, and other factors. Accurate forecasts are needed for staffing. Literature documents weekly seasonality modeled effectively by Winters’ seasonal smoothing and ARIMA, among others. To improve on classic methods, doubly stochastic linear mixed models have modeled additional complexities, and RNNs have been recommended for call volume and related applications. These flexible approaches can incorporate exogenous variables but at the cost of computational and programming complexity and increased prediction variance. This paper explores practical aspects of managing that complexity, applies models to actual call volumes from a large financial services company, and compares prediction capability to Winters and ARIMA. The contributions are twofold: (1) modify computational aspects of the doubly stochastic approach to improve convergence and stability when modeling inter- and intra-day correlation with trend and seasonality; and (2) evaluate RNNs (Elman/simple, GRU, LSTM) using a full-factorial designed experiment to identify configurations that minimize both expected error and variability (via the upper 95% prediction interval of testing error). A screening phase selects the most useful (R)NN, followed by a comprehensive performance study against classical approaches over more skills and validation days.

Literature Review

Prior work on call center arrivals includes Winters (1960) and Box & Jenkins (1970) for seasonal smoothing and ARIMA; practitioner-oriented references include Bisgaard & Kulachi and Montgomery et al. Doubly stochastic mixed-effects models have been effective for modeling inter- and intra-day correlations and additional complexities (Ibrahim & L’Ecuyer, 2013; Ibrahim et al., 2016). RNNs have been applied to related forecasting tasks, including call volumes in wireless networks (Bianchi et al., 2017) and ride volumes at Uber (Zhu & Laptev, 2017). Architectural variants considered in time series include Elman/simple RNN (Elman, 1990), GRU (Cho et al., 2014), and LSTM (Hochreiter & Schmidhuber, 1997), available via Keras (Allaire & Chollet). Prior studies suggest careful hyperparameter tuning, longer training for RNNs, and the potential benefit of differencing or including seasonal exogenous variables. This study builds on that literature by using a designed experiment to systematically compare architectures, depth, width, regularization, and the use of other model forecasts as covariates, and by contrasting RNNs with doubly stochastic mixed models and classical ARIMA/Winters on real call center data.

Methodology

Doubly stochastic mixed model: Call counts are square-root transformed y = sqrt(calls + 0.25) to reduce skew and stabilize variance. Fixed effects include a day-of-week by period-of-day interaction (e.g., 5×32=160 levels in a five-day, 32-period setting) and a holiday indicator interacted with grouped periods (p_group of 3-period bins on holidays) to reduce variance when few holiday observations exist. Random effects model inter-day correlation with a day-level random effect vector b ~ N(0, G), with G following AR(1). Residuals e ~ N(0, R) allow intra-day AR(1) within block-diagonal daily structures. The full model jointly estimates fixed effects, G, and R in a single PROC MIXED call. To address convergence challenges common when parameters lie near boundaries, modifications include: convergence by relative change in log-likelihood rather than gradient norm; Fisher scoring for improved stability with complex covariance; ddfm=residual for efficiency when only point forecasts are needed (kenwardroger2 if intervals are required). A Poisson mixed model via GLIMMIX was tested but was computationally infeasible with no error-rate improvement.

Recurrent neural networks and designed experiment: Neural networks are constructed to predict next-day call volumes using sequences of 30-minute periods within days. Architectures compared: dense (feedforward), simple/Elman RNN, GRU, and LSTM. Inputs (42 vectors) include one-hot day-of-week and period-of-day indicators; lagged call volumes at same period for previous day and same weekday in previous week; binary indicators for current/yesterday/last-week holiday; and a day-number trend indicator. Three optional inputs (treated jointly as a factor "mixed.cheat") feed the NN with contemporaneous predictions from the doubly stochastic mixed model, Winters smoothing, and a seasonal ARIMA(1,0,1)(0,1,1)160, enabling the NN to learn corrections or dynamic ensembling.

Experimental factors (full factorial, 128 runs): model.type {dense, simple RNN, GRU, LSTM}; nlayers {1,2}; nnodes per layer {25,50,75,100}; kernel L2 regularization {0, 0.0001}; mixed.cheat {FALSE, TRUE}. The design is replicated over 3 large-volume skills and 5 validation days (1920 total runs), enabling assessment of both mean error and variability due to random weight initializations.

Static/training settings: Dense input is a 640×42 (or ×45 with cheat) matrix based on 4 weeks × 5 days × 32 periods, excluding first week due to lagged inputs. RNN input is a 20×32×42 array (20 days × 32 periods × 42 predictors), STATEFUL=FALSE due to convergence issues and negligible benefit relative to including day/week lags. Batch shuffling is enabled for optimization stability. Optimizer: AMSGrad variant of Adam with learning-rate decay 1e-4 (selected from pilot experimentation). Kernel initializer: He normal; activation: ReLU. Epoch tuning: For each configuration, train up to 500 epochs, compute moving-average (window=10) validation WAPE across the last training week, then refit with the epoch count yielding minimum validation WAPE. Recurrent dropout was piloted but degraded performance in this short training horizon.

Response metric: Weighted Absolute Percentage Error (WAPE) is used, weighting period errors by call counts. Models are trained to minimize MAE (numerator of WAPE). For each configuration, models predict five one-day-ahead forecasts sequentially (each day re-trained on the preceding 5 weeks), and WAPE is recorded for each skill/day.

Analysis: Because WAPE is right-skewed, responses are transformed to 1/WAPE for linear modeling. A joint mean–variance analysis (JMP Pro 14.1) models the mean of 1/WAPE as a function of main effects and interactions among the five NN factors, with split, file (forecast day), and split*file as blocks. A loglinear variance model captures variability in 1/WAPE as a function of NN factors and blocking variables, quantifying stochasticity arising from random initialization. Significant factors for variance include model.type, nlayers, mixed.cheat, nnodes, and blocking terms (kernel L2 not significant). For the mean, significant factors include model.type, nlayers, mixed.cheat, nnodes, and blocking (kernel L2 and NN-factor interactions not significant). The optimal configuration is selected by minimizing the upper 95% prediction interval (PI) for WAPE (equivalently maximizing the lower 95% PI of 1/WAPE).

Comprehensive performance study: Using insights from the experiment, all four NN types with 2 layers, 50 nodes/layer, L2=0.0001 are evaluated (mixed.cheat disabled) across 36 skills with rolling 5-week training windows to produce 60 one-day-ahead validations per skill. Comparators include the doubly stochastic mixed model, ARIMA, and Winters. A further analysis evaluates NN performance when enabling mixed.cheat (using forecasts from mixed, ARIMA, Winters as covariates).

Key Findings
  • Doubly stochastic mixed model vs. RNNs: Across 36 skills and 60 one-day-ahead validations, the doubly stochastic mixed model generally achieves the lowest WAPE on high- and medium-volume splits. GRU is the strongest RNN, often competitive and occasionally best, particularly on low-volume splits. LSTM exhibits instability and large errors and is not recommended. Dense NN shows consistently worse forecast error (despite lower variance) and is not recommended.
  • Volume-dependent performance: Mixed model dominates for large and medium call volumes; GRU (and sometimes simple RNN) performs slightly better on low-volume splits. The proportion of days where GRU beats the mixed model decreases approximately linearly with log(median call volume), indicating GRU’s relative advantage at small volumes.
  • Optimal NN configuration from designed experiment: Minimizing the upper 95% PI of WAPE favors a GRU with 2 layers, 50 nodes per layer, L2=0.0001, and mixed.cheat enabled. However, kernel L2 and nlayers effects are relatively flat and L2 is not significant in mean or variance models. Mean performance plots suggest GRU or simple RNN with 1 layer and ~25 nodes maximize 1/WAPE, while variance plots favor dense NN with 2 layers and 50 nodes; prediction-interval analysis clearly favors GRU with mixed.cheat and 50 nodes.
  • Benefit of using other model forecasts as covariates (mixed.cheat): Adding mixed/ARIMA/Winters predictions as inputs improves GRU performance. Across 36×60=2160 paired comparisons, median(WAPE_GRU_no_cheat − WAPE_GRU_cheat)=0.002 with p=0.0005 (Wilcoxon signed-rank), indicating a statistically significant, albeit modest, improvement. Paired differences (mixed.cheat=FALSE) confirm GRU beats LSTM (median 0.0247, p=1e-129), dense NN (median 0.0690, p<1e-185), and simple RNN (median 0.0021, p=1e-04).
  • Training/optimization insights: STATEFUL RNNs did not improve validation error and occasionally failed to converge; AMSGrad with LR decay and He initialization improved stability; RNNs require substantially more epochs than dense NNs; recurrent dropout degraded performance in this short-horizon setting.
  • Mixed model convergence/stability: Joint estimation of fixed effects and covariance components in a single PROC MIXED call was made stable by using log-likelihood change as the convergence criterion and Fisher scoring; GLIMMIX Poisson models were too slow with no error-rate benefit.
Discussion

The study addresses whether deep recurrent models can outperform classical and mixed-effects approaches for short-term call volume forecasting with complex seasonality and autocorrelation. Results show that while GRU is the most robust RNN and can edge out other methods for low-volume skills, the doubly stochastic mixed model remains superior for high- and medium-volume cases within a five-week training horizon and one-day-ahead predictions. This suggests that explicit modeling of inter- and intra-day correlation structures with parsimonious fixed effects is highly effective when sufficient signal (volume) exists, and that the added flexibility of deep RNNs can be advantageous when data are sparse, where mixed models may be less stable or less expressive.

Incorporating forecasts from classical/mixed models as covariates (mixed.cheat) provides a systematic way to combine strengths: RNNs learn nonlinear corrections conditioned on exogenous factors, improving accuracy without complex ensembles. The designed-experiment framework offers a replicable approach to tuning architecture and regularization while accounting for mean performance and variability due to random initialization, leading to choices that minimize the upper tail of error distribution rather than only mean error.

Practically, for short-term, day-ahead forecasting with limited training windows, organizations can prioritize the doubly stochastic mixed model for high-volume queues and deploy GRU (ideally augmented with mixed/ARIMA/Winters predictions) for low-volume queues. Computationally, mixed models are faster and simpler to maintain; RNNs demand more engineering and longer training, so selective deployment is warranted. Findings align with prior industry reports (e.g., Uber) that deep models may yield more benefit with longer histories or multi-horizon tasks.

Conclusion
  • Main contributions: (1) Provide stable, single-pass joint estimation for a doubly stochastic mixed model with AR(1) inter- and intra-day structures via modified convergence criteria and Fisher scoring, yielding reliable convergence and strong short-term forecasts. (2) Conduct a full-factorial designed experiment comparing dense, simple RNN, GRU, and LSTM architectures across depth, width, regularization, and the inclusion of other model forecasts as covariates, using a robust mean–variance analysis to select configurations minimizing the upper 95% PI of WAPE. (3) Comprehensive validation over 36 skills and 60 one-day-ahead forecasts shows the mixed model generally outperforms on higher volumes, while GRU is the preferred RNN and can outperform on low volumes. Including mixed/ARIMA/Winters forecasts as NN inputs significantly improves GRU.
  • Recommendations: Use the doubly stochastic mixed model for high- and medium-volume skills in short-term settings; use GRU for low-volume skills, and enhance with mixed.cheat inputs. Employ designed experiments to tune NN hyperparameters, prioritizing configurations that minimize error variability. Consider AMSGrad with LR decay, He initialization, and sufficient epochs with moving-average validation.
  • Future work: Explore longer training histories and multi-horizon forecasts where RNNs may gain more advantage; evaluate optimizers and learning-rate schedules as experimental factors; revisit stateful RNNs with longer sequences; develop multi-output architectures to jointly model many skills; incorporate uncertainty quantification (e.g., prediction intervals) for NN forecasts; investigate hybrid models that fuse mixed-model structure with RNN components.
Limitations
  • Training horizon limited to five weeks; results may differ with longer histories where RNNs often excel.
  • One-day-ahead focus; multi-step or longer-horizon forecasts were not evaluated and may change relative performance.
  • STATEFUL RNNs were disabled due to convergence issues in this setting; benefits might appear with different batching or longer sequences.
  • Optimizer choice, learning-rate schedules, dropout, and activations were not treated as experimental factors (selected via pilots), potentially missing interactions with architecture.
  • Mixed.cheat improvements, while statistically significant, were modest in median magnitude; utility may vary by skill and volume.
  • GLIMMIX Poisson models were abandoned due to computation; alternative count models (e.g., negative binomial, state space) were not explored.
  • Single organization’s data (weekdays, 30-minute bins) may limit generalizability across industries or different operational calendars.
  • Focused on point forecasts (ddfm=residual); comprehensive interval estimation for all methods was not performed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny