Linguistics and Languages

The diachronic change of research article abstract difficulty across disciplines: a cognitive information-theoretic approach

X. Zhao, L. Li, et al.

This research delves into the evolving complexity of research article abstracts over 21 years across multiple disciplines. Conducted by Xi Zhao, Li Li, and Wei Xiao, the study reveals intriguing patterns of cognitive difficulty, shedding light on how knowledge is encoded and decoded in various fields.... show more

Introduction

The paper examines how the difficulty of research article (RA) abstracts has changed over time across disciplines, addressing criticisms of traditional readability metrics (e.g., FRE, SMOG) that rely on surface-level features and neglect cognitive processes and task context. The authors propose a cognitively grounded, information-theoretic approach that separates text difficulty into encoding (writer-side information content) and decoding (reader-side processing load), aligning with abstracts’ dual informational and promotional roles. They highlight gaps in longitudinal, cross-disciplinary analyses, noting prior focus on single disciplines or descriptive comparisons without inferential tests. Research questions: (1) Are there diachronic changes in cognitive encoding difficulty of abstracts, and do they vary across natural sciences, social sciences, and humanities? (2) Are there diachronic changes in cognitive decoding difficulty of abstracts, and do they vary across these disciplinary areas?

Literature Review

Prior work shows RA abstracts are generally difficult (e.g., Gazni, 2011; Lei & Yan, 2016; Ante, 2022) and often becoming less readable over time using traditional formulas (FRE, SMOG, NDC; Plavén-Sigray et al., 2017; Wen & Lei, 2022; Wang et al., 2022). The relationship between readability and impact is mixed. Traditional formulas are criticized for using simplistic proxies for complex cognitive reading processes and for overlooking reader-text-task interactions. Cognitive models (e.g., Kintsch & van Dijk; Coh-Metrix) and NLP/ML methods introduce many variables/features, complicating interpretation and use; moreover, gains over classic features can be modest. Disciplinary variations in academic writing are well-attested, but diachronic, cross-disciplinary comparisons of abstract difficulty are scarce; many studies are mono-disciplinary or provide descriptive results without inferential comparison across fields. This motivates a theoretically grounded, parsimonious metric set and a macro-level, cross-disciplinary longitudinal analysis.

Methodology

Design: Longitudinal, cross-disciplinary analysis (2000–2020). Sampling: Systematic stratified random sampling of 1890 RA abstracts (316,319 words) from nine fields: natural sciences (physics, chemistry, biology), social sciences (economics, education, psychology), humanities (philosophy, history, linguistics). For each field, three top journals by 2021 JCR impact factor were selected; 10 research articles per field per year were randomly chosen. Measurements: Encoding difficulty indexed by entropy. Shannon entropy H(T) = -Σ p(w) log2 p(w), with Miller–Madow correction H(MM) = H(T) + (V−1)/(2N) to mitigate small-sample bias (V=types, N=tokens). Decoding difficulty indexed by mean dependency distance (MDD): dependency distance DD = |PG − PD|; MDD = Σ DD_i / N over all dependency relations in a sentence. Tools and processing: Entropy computed in R (v4.0.3) via type probabilities; MDD computed using Stanford CoreNLP parser (v3.9.2) in Python (v3.10) by annotating dependencies, computing DDs, and averaging per sentence; FRE computed in Python for validation (FRE = 206.835 − 1.015(words/sentences) − 84.6(syllables/words)). Analysis: Annual and disciplinary means plotted with fitted curves; simple linear regressions tested diachronic trends (α=0.05). Reliability check: Pearson correlations between FRE and entropy/MDD.

Key Findings

Validation: FRE showed a significant downward trend 2000–2020 (F(1,19)=51.07, Beta=−0.271, p=8.589e−07). FRE correlated significantly with entropy (r=−0.754, p=0.000) and MDD (r=0.488, p=0.025), supporting metric reliability. Encoding difficulty (entropy): Overall, significant upward trend (F(1,19)=84.744, Beta=0.009, p=0.000). By area: Natural sciences increased (F(1,19)=21.389, Beta=0.010, p=0.006); Social sciences increased (F(1,19)=114.000, Beta=0.014, p=0.000); Humanities showed no significant change (F(1,19)=1.585, Beta=0.003, p=0.891). Decoding difficulty (MDD): Overall, significant downward trend (F(1,19)=7.163, Beta=−0.005, p=0.003). By area: Natural sciences decreased (F(1,19)=12.852, Beta=−0.012, p=0.001); Humanities decreased (F(1,19)=7.526, Beta=−0.006, p=0.003); Social sciences showed no significant change (F(1,19)=1.536, Beta=0.003, p=0.521). Overall interpretation: Across 2000–2020, abstracts’ encoding difficulty increased, while decoding difficulty decreased, with disciplinary exceptions noted above.

Discussion

Findings answer the research questions by showing opposing diachronic trends for encoding and decoding difficulty: information content (entropy) increased, consistent with abstracts’ informative role and the cumulative growth of academic knowledge; decoding load (MDD) decreased, aligning with abstracts’ promotional function, the push toward clearer, more accessible academic communication, and the general tendency for dependency distance minimization in human language that supports efficient processing. Disciplinary divergences reflect domain characteristics: Humanities, with interpretive and less paradigmatically cumulative traditions, showed no significant change in encoding difficulty; Social sciences, straddling methods and rhetoric of both sciences and humanities, showed no significant change in decoding difficulty—possibly due to cross-disciplinary complexity and a tendency not to simplify to maintain scientific legitimacy. These results extend prior readability trends by disentangling difficulty into encoding versus decoding components, revealing that while overall surface readability measures often suggest increasing difficulty, syntactic processing load can diminish over time.

Conclusion

The study introduces a cognitive information-theoretic framework for text difficulty that separates encoding (entropy) and decoding (MDD) aspects. Analyzing 1890 abstracts from 2000–2020 across natural sciences, social sciences, and humanities, the authors find rising encoding difficulty and falling decoding difficulty overall, with stable encoding difficulty in humanities and stable decoding difficulty in social sciences. Contributions include a theoretically grounded, parsimonious measurement approach that captures reader-text-task interactions and clarifies disciplinary variation in abstract writing. Future research directions suggested include expanding datasets via automated retrieval across more fields and time spans, enriching indices (e.g., relative entropy, normalized dependency distance, dependency relations), and extending analyses from abstracts to full research articles.

Limitations

Corpus construction required manual, labor-intensive selection; future work could use automated web crawling to broaden coverage of fields, journals, articles, and periods.
The chosen indices (entropy and MDD) are not exhaustive; additional measures (e.g., relative entropy, normalized dependency distance, dependency relations) could provide finer-grained insights.
Findings are based on abstracts; generalization to full articles should be made with caution, and replication on full texts is encouraged.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Translation of the shortened dental arch research into clinical practice: a stakeholder mapping approach

S. B. Khan

Medicine and Health

A multimodal deep learning approach for the prediction of cognitive decline and its effectiveness in clinical trials for Alzheimer’s disease

C. Wang, H. Tachimori, et al.

Linguistics and Languages

Changes in the ways authors refer to themselves: a diachronic study of self-mention in English research articles

H. Wang and Z. Hu

Medicine and Health

The Effects of Non-pharmaceutical Interventions on COVID-19 Mortality: A Generalized Synthetic Control Approach Across 169 Countries

S. Mader and T. Rüttenauer

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny