Transportation
Bayesian estimation of mixed multinomial logit models: Advances and simulation-based evaluations
P. Bansal, R. Krueger, et al.
The paper studies scalable Bayesian estimation for mixed multinomial logit (MMNL) models, widely used to analyze individual discrete choices. Traditional Bayesian inference via MCMC yields full posterior distributions but faces key scalability bottlenecks: long computation times, storage of many posterior samples, and convergence diagnostics. Variational Bayes (VB) offers a deterministic optimization-based alternative that approximates the posterior, promising substantial computational gains. However, prior VB work for MMNL has been limited to models with only individual-specific random parameters, without fixed (invariant) parameters, and has not comprehensively assessed finite-sample properties or compared against maximum simulated likelihood estimation (MSLE). This paper asks: (1) Can VB be extended to MMNL specifications that include both fixed and random utility parameters? (2) How do extended VB variants compare to MCMC and MSLE in estimation time, parameter recovery, and predictive accuracy? Addressing these questions is important because fixed effects and fixed covariate interactions are common in practice, and practitioners need reliable, fast methods for large datasets.
Prior work has established VB as substantially faster than MCMC for MMNL with negligible loss in predictive accuracy (e.g., Braun and McAuliffe, 2010; Depraetere and Vandebroek, 2017; Tan, 2017). These studies considered only random-coefficient specifications and typically evaluated predictive accuracy but not finite-sample parameter recovery. Updating strategies for nonconjugate factors—quasi-Newton (QN) versus nonconjugate variational message passing (NCVMP)—have been studied separately, with no direct comparison. Comparisons to MSLE, the predominant frequentist approach for MMNL, were also lacking. Methodologically, several techniques address the intractable expectation of the log-sum-exp (E-LSE) term in the MNL kernel: analytical approximations (e.g., Delta method), simulation-based approximations (Quasi–Monte Carlo, QMC), and alternative variational lower bounds via modified Jensen’s inequality (MJI). Earlier findings suggest modified Jensen’s bounds can be tighter than classical Jensen’s bounds, but performance relative to other approximations in MMNL remains mixed. Additionally, alternative Bayesian samplers (e.g., NUTS/HMC) can improve mixing but may be computationally prohibitive for large MMNLs. This study situates itself by extending VB methods to incorporate fixed parameters and by benchmarking against MCMC and MSLE on parameter recovery and prediction.
Model: A fully Bayesian MMNL is specified where utility is linear-in-parameters with fixed (α) and random (β_n) taste parameters: U_ntj = X_ntj^F α + X_ntj^R β_n + ε_ntj with ε_ntj Gumbel(0,1). Individual random coefficients β_n ~ N(ζ, Ω). Priors: α ~ N(λ_0, Ξ_0), ζ ~ N(μ_0, Σ_0), and Ω follows Huang’s half-t prior for covariance matrices via an inverse-Wishart-Gamma construction, providing weakly-informative behavior preferable to inverse-Wishart in MMNL. Posterior inference targets θ = {α, ζ, Ω, a_k, β_1:N}. The joint distribution is intractable due to the multinomial logit kernel; the posterior is approximated as follows. MCMC baseline: A blocked Gibbs sampler with Metropolis steps for α and β_1:N (Allenby–Train procedure) is implemented. ζ and Ω have conjugate updates; α and β_n are updated via random-walk Metropolis with tuned step sizes. Practical considerations include vectorization and thinning; two chains with 100k iterations each, with burn-in and thinning, are used in experiments. Variational Bayes: A mean-field variational family factorizes as q(θ) = q(α) q(ζ) q(Ω) Π_k q(a_k) Π_n q(β_n). Conjugate factors have optimal forms: q*(ζ) is Normal, q*(Ω) is inverse-Wishart, and q*(a_k) is Gamma. Nonconjugate factors q(α) and q(β_n) are assumed Normal with parameters updated by maximizing the ELBO. The main difficulty is the intractable expectation of the log-sum-exp (E-LSE) in the MNL likelihood. E-LSE handling: Three strategies are used: (1) Delta (Δ) method via second-order Taylor expansion around current means to produce an analytical approximation to E[g_nt(Γ_n)]. (2) QMC simulation to approximate E-LSE with quasi-random points transformed by Cholesky factors of Σ_α and Σ_β. (3) An alternative lower bound using modified Jensen’s inequality (MJI), introducing auxiliary variational parameters α_ntj to bound E-LSE. Updating strategies for nonconjugate factors: (a) Quasi-Newton (QN), which maximizes the ELBO w.r.t. μ and Σ of α and β_n using supplied gradients and maintaining positive-definiteness via Cholesky parameterization. (b) NCVMP, which provides fixed-point updates for μ and Σ of α and β_n using expectations and gradients of the joint log-density within the ELBO; NCVMP is computationally cheaper per iteration but not guaranteed to monotonically increase the ELBO. Variants evaluated: VB-QN-Δ, VB-QN-QMC, VB-QN-MJI, VB-NCVMP-Δ, VB-NCVMP-MJI. NCVMP-QMC is not pursued due to numerical instability and difficulty ensuring positive-definite covariance updates. Simulation study: Semi-synthetic data are generated based on a real stated preference dataset (German alternative-fuel vehicle choices). Four scenarios vary by whether fixed parameters are included and by correlation among random coefficients (low vs. high). Sample sizes N ∈ {500, 2000} and T ∈ {5, 10} choice occasions are examined with 20 replications each. Performance metrics: parameter recovery via RMSE for α (when applicable), ζ, Ω, and β_1:N, and out-of-sample predictive accuracy via total variation distance (TVD) between true and estimated predictive distributions. MSLE predictions are integrated over an approximate asymptotic distribution of parameter estimates; MCMC and VB integrate over posterior/variational distributions. Implementation uses vectorized Python code; MSLE employs MLHS draws; VB-QN-QMC uses 64 quasi-random draws; stopping for VB follows Tan (2017) with a relative-change criterion.
- VB methods extended to handle both fixed and random parameters in MMNL perform comparably to MCMC and MSLE in parameter recovery and predictive accuracy, except for methods relying on the modified Jensen’s inequality (MJI) lower bound.
- VB-NCVMP-Δ consistently offers the best trade-off between speed and accuracy: in the experiments, it is roughly 1.7× to 16× faster than MCMC and MSLE while achieving similar RMSEs and TVD.
- Recovery of mean random coefficients (ζ) and individual-level parameters (β_1:N) is similar across all methods. For large samples (e.g., N=2000, T=10), RMSE(ζ) typically lies around 0.018–0.029 and RMSE(β_1:N) around 0.695–0.726 across scenarios and methods.
- Recovery of the covariance matrix Ω is acceptable and similar across methods using Δ or QMC approximations; MJI-based VB (VB-QN-MJI, VB-NCVMP-MJI) display noticeably worse Ω recovery (larger RMSE for Ω elements) and consequently worse predictive TVD.
- Predictive accuracy (TVD) is comparable among MSLE, MCMC, and VB variants using Δ or QMC across scenarios. MJI-based VB shows higher TVD, consistent with poorer Ω recovery.
- As N and T increase, all methods improve in RMSE and TVD, numerically supporting consistency claims for mean-field VB.
- QN updates are substantially slower than NCVMP in this application; contrary to some earlier reports, QN-based VB was not faster than the authors’ efficient MCMC implementation. Illustrative numbers: In scenario 1 (N=2000, T=10), VB-NCVMP-Δ achieves RMSE(ζ)=0.0190, RMSE(β_1:N)=0.7198, TVD≈0.1367, while being much faster than MCMC (RMSE(ζ)=0.0181, TVD≈0.1345). MJI-based variants in similar settings show higher RMSE(Ω) and higher TVD (e.g., VB-NCVMP-MJI TVD≈0.1494).
The study demonstrates that variational Bayes can be effectively extended to MMNL models with both fixed and random parameters, a specification common in practice (e.g., ASCs, socio-demographic covariates, interactions). Among VB variants, using a Delta-method approximation for the intractable E-LSE combined with NCVMP updates delivers substantial computational gains with minimal loss in accuracy relative to gold-standard MCMC and widely-used MSLE. The inferior performance of MJI-based lower bounds suggests that a tighter analytical bound is not necessarily preferable in practice for MMNL; accurate recovery of Ω is critical for predictive performance, and MJI underperforms on this dimension. Across all methods, increased data (larger N and/or T) improves estimation, reinforcing the practical viability and consistency of mean-field VB for these models. Practitioners can therefore consider VB-NCVMP-Δ as a reliable, fast estimator for MMNL, especially in large-scale applications where MCMC and MSLE may be computationally burdensome.
Contributions: (1) Extension of several VB methods to MMNL with both fixed and random utility parameters. (2) Comprehensive simulation-based benchmarking of VB against MCMC and MSLE on estimation time, parameter recovery, and predictive accuracy. Findings indicate VB-NCVMP-Δ provides fast, scalable, and accurate inference, often an order of magnitude faster than MCMC/MSLE, with negligible compromises in accuracy; methods based on MJI underperform. Future research directions: Extend VB to more flexible mixing distributions (parametric, nonparametric, semiparametric) and to willingness-to-pay space; develop VB for extended discrete choice models (e.g., integrated choice and latent variable); explore online and stochastic variational inference for large streaming datasets; investigate alternative divergences (α- and f-divergences) and structured variational families to better capture posterior dependencies; consider Markov chain variational inference to blend MCMC accuracy with VB speed; improve analytical/simulation-based approximations of the E-LSE; and broaden comparisons with frequentist approximations such as MACML.
- The VB methods and experiments focus on normal mixing distributions and preference-space utility; other distributions or WTP-space specifications are not studied here.
- VB under KL and mean-field factorization can underestimate posterior variances and miss dependencies; although performance is strong, variance calibration remains a known limitation.
- MJI-based variational lower bounds underperform in Ω recovery and prediction; while informative, they are not recommended based on results here.
- The evaluation uses semi-synthetic data derived from a specific stated choice context; generalization to other domains, attribute structures, and larger dimensionalities should be further validated.
- Some potentially useful variants (e.g., NCVMP with QMC) were not included due to numerical instability and positive-definiteness issues.
- Comparisons with alternative frequentist approaches like MACML for MMNP are out of scope; Stan/NUTS was deemed too computationally expensive for the studied sample sizes.
Related Publications
Explore these studies to deepen your understanding of the subject.

