Linguistics and Languages

Machine translation of Chinese classical poetry: a comparison among ChatGPT, Google Translate, and DeepL Translator

R. Gao, Y. Lin, et al.

Explore the intriguing findings of a study that evaluates machine translation of Chinese classical poetry! Conducted by Ruiyao Gao, Yumeng Lin, Nan Zhao, and Zhenguang G. Cai, this research reveals how ChatGPT outshines Google Translate and DeepL in fidelity, fluency, and style, especially when it comes to preserving the rhythm and rhyme of poetical text.... show more

Introduction

The paper investigates whether ChatGPT, a large language model (LLM), represents an advancement over conventional neural machine translation systems (Google Translate and DeepL Translator) in translating Chinese classical poetry. Machine translation is crucial across domains but literary translation, especially poetry, remains challenging due to dense imagery, ambiguity, allusions, rhythm, and style. With recent advances in LLMs, ChatGPT’s general language understanding, contextual handling, and fluent generation raise the possibility of improved literary translation. The study aims to compare ChatGPT against Google Translate and DeepL in fidelity, fluency, language style, and machine translation style, and to assess whether prompt design (including instructions to interpret meaning and preserve rhyme) can enhance ChatGPT’s performance. Research questions: (1) Does ChatGPT outperform Google Translate and DeepL in Chinese classical poetry translation, and in what aspects? (2) What advantages and limitations does ChatGPT show relative to traditional translators? (3) Can prompt variations improve ChatGPT’s performance and reduce machine-like style?

Literature Review

The literature on machine translation has evolved from rule-based and statistical methods to neural approaches, with transformer-based neural machine translation (NMT) becoming dominant due to improved fluency and accuracy. However, scholars have been skeptical about machine translation for literary texts, citing issues with ambiguity resolution, contextual dependence, stylistic fidelity, and discourse-level coherence. Research highlights deficits in verb richness, syntactic diversity, and linking patterns in machine-translated literature compared to human translations. Recent evaluations show GPT models can be competitive in translation, exhibiting strong fluency but potential misalignments and punctuation overuse, and differing performance across language resource levels. ChatGPT’s strengths include conversational ability, in-context learning, chain-of-thought prompting, and creative generation, which may benefit literary translation. Prompt engineering can tailor outputs for domain- and task-specific needs. Despite promising signs (e.g., performance on high-resource language pairs), ChatGPT’s application to Chinese classical poetry—rich in allusions, imagery, and compact syntax—remains underexplored, particularly for translations into English.

Methodology

Design: Comparative evaluation of translations of Chinese classical poems produced by ChatGPT (GPT-3.5) under two prompts, Google Translate, and DeepL Translator. Outcomes rated on fidelity, fluency, language style, machine translation style, and overall performance.

Materials: 21 poems (12 quatrains and 9 Lushi), selected from China Poetry Network’s Poetry and Verse Weekly Selections (issues 14 and 16, published April 10, 2023 and April 24, 2023). These contemporary selections, curated via expert review and popular voting, were chosen to reduce prior exposure in training and ensure randomness.

Systems and prompts: Google Translate (web), DeepL Translator (web), and ChatGPT (GPT-3.5). Two ChatGPT prompts: (Prompt 1) “Please provide the English translation for the following material.” (Prompt 2) “The following are Chinese classical poems, please interpret their meaning first and translate them into English poems with rhymes.” Prompt 2 embeds domain and task info, requests interpretation first, and instructs rhyme preservation, inspired by in-context learning and chain-of-thought.

Raters: Nine native Chinese speakers with master’s or PhD degrees in English translation (universities in China). Platform: Qualtrics. Procedure: For each of 21 poems, raters evaluated four anonymous translations (Google, DeepL, ChatGPT Prompt 1, ChatGPT Prompt 2) on 5-point Likert scales for fidelity, fluency, language style, machine translation style, and overall. Translation order and scoring order randomized per rater. Compensation: 80 yuan.

Scales: Defined 1–5 for each criterion (fidelity, fluency, language style, machine translation style [higher = more machine-like], overall), with detailed anchors provided.

Ethics: Approved by institutional board; written informed consent obtained.

Analysis: Linear mixed-effects models (R lme4; significance via lmerTest). Fixed effect: translator (Google/DeepL/ChatGPT). Random effects: rater and poem. Sum contrasts with ChatGPT as reference. Forward algorithm to determine effect structure (alpha 0.2) with maximal random effects. Separate models per criterion. Subsequent comparison between ChatGPT Prompt 1 and Prompt 2 across criteria.

Key Findings

ChatGPT vs. DeepL/Google (Prompt 1 used for ChatGPT in three-way comparison): • Fidelity: ChatGPT mean 3.83 exceeded DeepL 2.97 (β=0.85, SE=0.25, t=3.38, p=0.001) and Google 2.75 (β=1.07, SE=0.22, t=4.92, p<0.001); no significant difference between Google and DeepL (2.75 vs. 2.97; β=0.22, SE=0.13, t=1.78, df=14.22, p=0.21). • Fluency: ChatGPT 3.82 exceeded DeepL 2.47 (β=1.35, SE=0.14, t=9.69, df=8, p<0.001) and Google 2.44 (β=1.38, SE=0.12, t=11.30, df=8, p<0.001); Google vs. DeepL not significant (2.44 vs. 2.47; β=0.03, SE=0.09, t=0.35, p=0.93). • Language style: ChatGPT 3.36 exceeded DeepL 2.06 (β=1.30, SE=0.13, t=10.02, df=8, p<0.001) and Google 2.05 (β=1.31, SE=0.11, t=12.27, df=8, p<0.001); Google vs. DeepL not significant (β=0.0053, SE=0.09, t=0.06, p=0.98). • Machine translation style (lower better): ChatGPT 1.69 lower than DeepL 3.91 (β=−2.22, SE=0.28, t=−7.98, df=8, p<0.001) and Google 3.97 (β=−2.28, SE=0.17, t=−13.23, df=8, p<0.001); Google vs. DeepL not significant (3.96 vs. 3.97; β=−0.058, SE=0.21, t=−0.28, p=0.96). • Overall: ChatGPT 3.61 exceeded DeepL 2.29 (β=1.33, SE=0.17, t=7.62, df=11.7, p<0.001) and Google (β=1.33, SE=0.14, t=10.53, df=10.3, p<0.0001). No significant difference between DeepL and Google (2.29 vs. 2.13; β=0.15, SE=0.10, t=1.52, df=10.6, p=0.32).
Prompt effects within ChatGPT: Prompt 2 (interpret first, translate with rhyme) did not significantly change fidelity (3.83 vs. 3.58; β=0.25, SE=0.20, t=1.22, df=8, p=0.26) or fluency (3.82 vs. 3.98; β=0.16, SE=0.14, t=1.16, df=8, p=0.28) relative to Prompt 1. Prompt 2 significantly improved language style, reduced machine translation style, and raised overall performance (overall 4.01 vs. 3.61; β=0.40, SE=0.16, t=2.56, df=8, p=0.034).

Discussion

The study shows ChatGPT significantly outperforms Google Translate and DeepL Translator across fidelity, fluency, language style, machine translation style, and overall quality in translating Chinese classical poetry. Its strengths stem from strong language understanding, contextual reasoning, and creative generation, enabling it to interpret rhetorical devices and classical imagery (e.g., correctly mapping symbolic terms like “蟾辉” to moonlight and “杜宇” to cuckoo) and to handle pro-drop phenomena by supplying appropriate subjects and connective elements in English, resulting in more coherent, natural translations. Prompt engineering further enhances stylistic faithfulness and human-likeness: instructing interpretation before translation and requiring rhyme preserves rhythm and rhyme schemes (often AABB), reduces robotic tone, and better aligns with poetic form.

Nevertheless, challenges remain. Despite superiority over conventional NMT, average fidelity scores fall between “moderate” and “good,” indicating room for improvement in accurately conveying nuanced content. ChatGPT sometimes fails to apply specialized knowledge consistently during full-text translation (e.g., transliterating “郭墙” rather than rendering its meaning) and may introduce hallucinated or stylistically motivated additions (e.g., “with grace”) that diverge from source meaning, especially under constraints to rhyme. These findings suggest that human post-editing remains important and that targeted domain knowledge and controls are needed to balance fluency, poeticity, and faithfulness.

Conclusion

ChatGPT surpassed Google Translate and DeepL Translator on all evaluated criteria for translating Chinese classical poetry. Prompting ChatGPT to interpret first and translate into rhymed English poems further improved language style and reduced machine translation style without sacrificing fidelity or fluency. Analyses indicate ChatGPT can recognize common symbols and imagery, fill in omitted subjects, and maintain inter-sentence coherence to produce fluent, less robotic translations. However, fidelity remains below a consistently “good” level, and issues such as misapplication of specialized knowledge and occasional hallucinations persist. Future work should refine and fine-tune models with domain-specific corpora and annotations, optimize prompt strategies, and explore human–AI collaboration workflows to enhance accuracy, stylistic adequacy, and cultural sensitivity in literary translation.

Limitations

Fidelity remains between “moderate” and “good” on average, indicating incomplete preservation of nuanced meaning.
Inconsistent application of specialized cultural or classical knowledge during full translations (e.g., transliteration of terms like “郭墙” instead of semantic rendering).
Occasional hallucinations or stylistically driven insertions, more evident when enforcing rhyme (Prompt 2), which can deviate from source meaning.
Study focused on GPT-3.5 and two commercial NMT systems; results may not generalize to other models or settings.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Frequency of depression and correlates among Chinese children and adolescents living in poor areas under the background of targeted poverty alleviation: results of a survey in Weining county

X. Chen, X. Yuan, et al.

Medicine and Health

Prevalence and network structure of post-traumatic stress symptoms and their association with suicidality among Chinese mental health professionals immediately following the end of China's Dynamic Zero-COVID Policy: a national survey

P. Chen, L. Zhang, et al.

Medicine and Health

Usability Comparison Among Healthy Participants of an Anthropomorphic Digital Human and a Text-Based Chatbot as a Responder to Questions on Mental Health: Randomized Controlled Trial

A. O. Thunström, H. K. Carlsen, et al.

Medicine and Health

Usability Comparison Among Healthy Participants of an Anthropomorphic Digital Human and a Text-Based Chatbot as a Responder to Questions on Mental Health: Randomized Controlled Trial

A. O. Thunström, H. K. Carlsen, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny