Business

The impact of online review helpfulness and word of mouth communication on box office performance predictions

S. Lee and J. Y. Choeh

This research conducted by Sangjae Lee and Joon Yeon Choeh explores how online review helpfulness and electronic word-of-mouth (eWOM) can significantly enhance the accuracy of box office revenue predictions. With a focus on the Korean movie market, the study reveals that movies with more helpful reviews lead to better prediction outcomes, making this an exciting insight for both filmmakers and marketers alike.

00:00

~3 min • Beginner • English

Index

Introduction

Box office revenue is a key indicator of a movie’s success, and accurate forecasting reduces market risk and improves competitiveness in the film industry. Prior work shows eWOM influences purchase decisions and sales, with variables such as volume and valence often predictive of outcomes. However, the relative importance of these eWOM factors may vary across contexts and outcomes, and there is a paucity of research examining moderators that condition their predictive power. This study investigates whether the helpfulness of reviews and reviewers moderates the relationship between eWOM and box office outcomes. Specifically, the authors compare prediction performance using machine learning methods across subsamples of movies with high versus low review or reviewer helpfulness in the Korean market. The purpose is to assess whether higher-quality eWOM (as proxied by helpfulness) enhances the explanatory and predictive power of standard eWOM measures for week-by-week box office performance.

Literature Review

Prior studies identify both eWOM characteristics (e.g., valence/ratings, volume/number of reviews, review length/depth, sentiment proportions, extremity) and movie characteristics (e.g., star or director power, genre, nationality, release timing, sequels) as determinants of box office revenue. eWOM influences decision-making, repurchase intention, and perceived risk, with social networks amplifying its effect. While volume and valence often relate to sales, their effects can vary by product and context. Review helpfulness, capturing the perceived utility and effectiveness of reviews, links to purchase intentions and sales and is considered a quality signal of eWOM. Helpfulness can therefore moderate the impact of eWOM on performance. Box office prediction methods used in prior research include regression, SVR, neural networks, and Bayesian networks. Building on this literature, the study posits that higher review and reviewer helpfulness will increase the predictive power of eWOM variables. The hypotheses are: (H1) Movies with greater review helpfulness have greater prediction power of eWOM for current box office revenue than those with lower review helpfulness; (H2) Movies with greater reviewer helpfulness have greater prediction power of eWOM for current box office revenue than those with lower reviewer helpfulness.

Methodology

Research design and hypotheses: The study tests whether the helpfulness of reviews and reviewers moderates the predictive power of eWOM for weekly box office revenue (weeks 1–3). Two hypotheses posit higher predictive power in high-helpfulness subsamples for both review helpfulness and reviewer helpfulness. Data sources and sample: eWOM data were crawled from Naver Movies (Korea’s most visited portal). Box office revenues and star power data were obtained from KOFIC. The dataset includes 1,798 films released between Jan 2014 and May 2016, with data collected in June 2016. eWOM and box office data were gathered from one week prior to release through three weeks post-release. Variables: - Dependent variables: Weekly box office revenue for weeks 1, 2, and 3 post-release. For machine learning classification tasks, each week’s revenue was discretized into a binary class indicating whether it is in the top 20% of the distribution (1) or not (0). - eWOM predictors (time-varying unless noted): average number of reviews (volume), average review rating (valence), average review extremity (absolute deviation from the overall mean rating), average length of review (words), average number of emotional reviews, average number of positive reviews, helpfulness of review (time-varying), helpfulness of reviewer (average per movie; not time-varying). - Movie-related control variables (non-time-varying): star power (total revenue earned by actors), award (binary for Academy winners/nominees), sequel (binary), timing of release (holiday season: July or December), genre (binary indicator for drama), nationality (binary indicators for Korea and US). Subsample formation by helpfulness: Movies were split into high vs. low helpfulness subsamples based on average helpfulness thresholds (e.g., prerelease review helpfulness mean = 0.589; movies above this value classified as high helpfulness). Reviewer helpfulness subsamples are based on average reviewer helpfulness (not time-varying). For fair comparisons, high and low subsamples were matched to equal sizes using the smaller group as the cap. Modeling approach: 1) Explanatory modeling: Multiple regression analyses assessed the explanatory power (adjusted R^2) of eWOM variables (with controls) for box office revenue at week t using eWOM at t−1 (prerelease for week 1; week 1 for week 2; week 2 for week 3), separately for high vs. low helpfulness subsamples. 2) Predictive modeling: Four machine learning classifiers were used to predict top-20% box office classes: random forests (ensemble of 10 trees; bootstrap sampling; random feature selection), decision trees with boosting (ensemble of 10 boosted learners with sampling weighted to misclassified/large-error observations), k-nearest neighbor (k=1; class assignment based on most similar movie in the training set), and discriminant analysis (assign to class with highest discriminant score). A fixed random seed was used for bootstrapping. Validation protocol: N-fold cross-validation was used within each matched subsample pair. For review helpfulness: week 1 used 30 folds on 881 films per group; week 2 used 34 folds on 505 films; week 3 used 37 folds on 368 films. For reviewer helpfulness: 29 folds on 578 films for each of weeks 1–3. In each fold, one subset served as validation and the remainder as training. Classification error (misclassification rate) was averaged across folds. Paired t-tests compared average prediction errors between high vs. low helpfulness subsamples for each method and week.

Key Findings

- Regression (explanatory power): • Review helpfulness subsamples: Adjusted R^2 was higher for high helpfulness at week 1 (0.638 vs. 0.567; F=124.1 vs. 89.3) and week 2 (0.518 vs. 0.431; F=40.3 vs. 70.9), but at week 3 the low-helpfulness group showed higher adjusted R^2 (0.177 vs. 0.246; F=3.912 vs. 26.106). • Reviewer helpfulness subsamples: Adjusted R^2 was consistently higher for high helpfulness: week 1 (0.724 vs. 0.532; F=117.5 vs. 107.0), week 2 (0.583 vs. 0.421; F=58.5 vs. 64.8), week 3 (0.415 vs. 0.246; F=15.6 vs. 26.1). • Predictor effects: Review volume (average number of reviews) had a consistently significant positive effect across all subsamples and weeks. Review valence effects were mixed: negative or insignificant for week 1 in several subsamples, turning positive by week 3 in low-helpfulness groups. Review extremity tended to have negative effects in week 1 (e.g., significant negatives in low-helpfulness subsamples), suggesting moderate reviews are more impactful than extreme ones early on. - Machine learning (predictive accuracy; lower error is better): • Review helpfulness (Table 9): High-helpfulness groups had lower average classification errors in most cases. Week 1: random forest 0.077 vs. 0.120 (t=2.255, p=0.023); boosting 0.089 vs. 0.131 (t=1.969, p=0.054); kNN 0.127 vs. 0.152 (ns); discriminant 0.148 vs. 0.179 (ns). Week 2: all methods significantly better in high helpfulness (e.g., RF 0.111 vs. 0.233; t=4.960, p<0.001). Week 3: RF significant (0.259 vs. 0.345; t=2.430, p=0.018); others mostly marginal or ns; discriminant improved (0.247 vs. 0.309; p=0.076). • Reviewer helpfulness (Table 10): Weeks 1 and 2 showed significantly lower errors in high helpfulness across all methods (e.g., week 1 RF 0.055 vs. 0.150; t=4.857, p<0.001). Week 3: discriminant analysis significant (0.235 vs. 0.305; t=2.246, p=0.029); other methods showed non-significant differences. Overall, higher review/reviewer helpfulness generally improved both explanatory power and predictive performance of eWOM-based models, especially in earlier weeks post-release and more strongly for reviewer helpfulness.

Discussion

The findings support the central hypothesis that the quality of eWOM—captured by review and reviewer helpfulness—moderates and enhances the ability of eWOM variables to explain and predict box office outcomes. Movies with more helpful reviews or reviewed by more helpful reviewers yielded higher adjusted R^2 and lower classification errors in most comparisons, indicating that audiences and predictive models alike benefit from informative, credible review signals. Volume consistently drove performance, underscoring the potency of aggregate attention. Valence effects evolved over time, with weaker or negative associations immediately upon release but becoming positive later, aligning with the notion that early audiences may be less influenced by ratings or that ratings stabilize as more information accrues. The negative association of extremity, especially in week 1, suggests moderate, balanced reviews are more influential for experience goods like movies. Cross-method consistency (random forests, boosted trees, kNN, discriminant analysis) reinforces robustness, though some week/method combinations showed marginal or no significance, particularly at week 3 for review helpfulness, implying diminishing marginal utility of eWOM signals or shifting reliance on other factors as time passes. The use of the Korean market extends eWOM insights beyond the predominantly US-focused literature, illustrating that eWOM quality considerations generalize internationally.

Conclusion

This study demonstrates that incorporating review and reviewer helpfulness as moderators enhances the explanatory and predictive power of eWOM for movie box office performance. Using Korean market data and multiple machine learning methods, the authors show that high-helpfulness subsamples generally yield higher adjusted R^2 in regression analyses and lower classification errors in predictive models, especially in weeks 1–2 post-release. Review volume is a consistently strong predictor, while valence and extremity exhibit time-dependent and partial effects. Contributions include: (1) establishing helpfulness as a pivotal moderator improving eWOM-based predictions; (2) providing a comparative evaluation of multiple BI methods under differing eWOM quality conditions; and (3) extending box office prediction research to an Asian market. Future research directions suggested by the authors include: testing additional semantic/textual eWOM features (e.g., concepts, writing styles, proportion of negative reviews); examining interaction effects between eWOM and helpfulness across different international contexts; applying the framework to other product categories; exploring alternative box office discretization thresholds (e.g., 10:90, 50:50) to assess sensitivity; incorporating additional movie covariates (e.g., MPAA ratings/age restrictions, number of screens, screening days, production budget); and evaluating models with additional performance metrics (e.g., RMSE) beyond classification accuracy.

Limitations

- Market and platform scope: Analyses are confined to the Korean market and Naver Movies reviews, which may limit generalizability to other countries or platforms. - Time window: The sample covers releases from January 2014 to May 2016, which may not reflect more recent dynamics in eWOM or box office behavior. - Dependent variable discretization: The predictive task binarizes revenue into top 20% vs. others; results may be sensitive to this threshold and may differ for continuous revenue prediction. - Feature set: Some potentially important movie covariates (e.g., MPAA/age ratings, number of screens, screening days, production budget) were not included. - Modeling/parameter choices: kNN used k=1; tree ensembles used 10 learners; no pruning limits; alternative parameterizations could affect performance. Only classification error was used as the predictive metric. - Subsample formation: High/low helpfulness splits based on average thresholds may introduce class imbalance or selection effects; although sample sizes were matched for comparisons, threshold choices could influence outcomes.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

The impact of international logistics performance on import and export trade: an empirical case of the "Belt and Road" initiative countries

W. Wang, Q. Wu, et al.

Medicine and Health

The Impact of Targeted Treatment and Mass Drug Administration Delivery Strategies on the Prevalence and Intensity of Schistosomiasis in School Aged Children in Africa: A Systematic Review

N. Chanhanga, T. Mindu, et al.

Business

Knowledge sharing and innovation performance: a case study on the impact of organizational culture, structural capital, human resource management practices, and relational capital of real estate agents

C. Lee, W. Yeh, et al.

Medicine and Health

The predictive performance of artificial intelligence on the outcome of stroke: a systematic review and meta-analysis

Y. Yang, L. Tang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny