Oncology translation report
Executive Summary
This evaluation assessed the performance of three Qwen-series AI models—qwen-plus, qwen-turbo, and qwen3-32b—in translating oncology-related Chinese texts into English. The task required precise rendering of specialized terminology, including disease names, therapeutic regimens, drug nomenclature, and clinical metrics, while preserving the scientific rigor expected in medical literature. The scenario reflects real-world needs of researchers and clinicians preparing manuscripts or case reports for international publication.
The evaluation employed pairwise comparisons across 42 trials derived from 7 representative test queries, covering diverse oncology subdomains. Human evaluators judged translation accuracy, terminological fidelity, and stylistic appropriateness. Results showed qwen-plus significantly outperformed its counterparts, achieving a win rate of 64.3%. Both qwen-turbo and qwen3-32b tied at 42.9%, indicating comparable but notably lower performance.
qwen-plus demonstrated superior handling of complex medical concepts and consistent adherence to standard oncology nomenclature, making it the most reliable choice for high-stakes academic and clinical translation tasks in this domain. These findings support its adoption for professional medical translation workflows requiring precision and credibility.
Ranking Explanation
The ranking of the models—qwen-plus (1st), followed by qwen-turbo and qwen3-32b (tied for 2nd)—is based on a comprehensive evaluation across five key criteria: terminological accuracy, clinical fidelity, grammatical and stylistic appropriateness, consistency/standardization, and completeness. The win matrix and example analyses provide clear evidence for this ordering.
Why qwen-plus Ranks First
qwen-plus consistently outperforms the other two models, winning 64.3% of head-to-head comparisons against both qwen-turbo and qwen3-32b. This indicates it is reliably superior across the evaluated dimensions, particularly in stylistic cohesion and integration of clinical information, without compromising accuracy or completeness.
In Example 1, both qwen-plus and its competitor produced clinically accurate translations with correct terminology (e.g., “EGFR L858R mutation,” “osimertinib,” “MET amplification,” “liquid biopsy”). However, qwen-plus was preferred because it combined related clinical events into a single, fluid sentence, enhancing readability while preserving all clinical meaning. This demonstrates stronger performance in grammatical and stylistic appropriateness—a criterion that values conciseness and formal medical writing conventions.
Similarly, in Example 2, qwen-plus again integrated findings more elegantly (“with MET amplification detected by liquid biopsy”), showcasing superior syntactic structure without omitting details or distorting clinical relationships. While the difference was minor, the consistent edge in stylistic fluency across multiple examples solidifies its lead.
Notably, qwen-plus never lost decisively; even when differences were subtle, it either matched or slightly exceeded competitors in clinical fidelity and terminological precision, while adding value through more natural, publication-ready English prose.
Key Differences Between Top Models
-
qwen-plus vs. qwen-turbo & qwen3-32b: The primary distinction lies not in factual accuracy—both rivals also maintain high terminological accuracy and clinical fidelity—but in stylistic refinement and textual cohesion. qwen-plus structures sentences to reflect how oncology findings are typically reported in English-language literature (e.g., using subordinate clauses or participial phrases to link causally related events).
-
qwen-turbo vs. qwen3-32b: These two models are effectively equivalent in performance, as shown by their 50% win rate against each other and identical overall scores (42.9%). Both occasionally exhibit minor stylistic shortcomings compared to qwen-plus—such as less optimal sentence segmentation or slightly less idiomatic phrasing—but remain clinically reliable.
-
Grammatical Precision: Interestingly, qwen3-32b demonstrated superior article usage in Example 3, correctly inserting “an” before “R0 resection” due to the vowel sound of “R.” This shows that qwen3-32b can excel in fine-grained grammatical details. However, such strengths were not consistent enough across the evaluation set to overcome qwen-plus’s broader advantages in overall fluency and integration.
Conclusion
qwen-plus ranks first because it consistently delivers translations that are not only accurate and complete but also stylistically optimized for professional medical communication. While qwen-turbo and qwen3-32b are competent and largely equivalent—both meeting core requirements for clinical translation—they lack the refined syntactic control and narrative flow that distinguish qwen-plus. The evaluation rewards not just correctness, but how well the output aligns with the conventions of expert oncology writing in English, an area where qwen-plus demonstrates a measurable and repeatable advantage.
Model Analysis
Model: qwen-plus
-
Overall assessment: qwen-plus demonstrates strong overall performance in medical oncology translation, consistently producing accurate, fluent, and stylistically appropriate English output. It outperforms other models with a clear win rate (10 wins vs. 4 losses), indicating superior reliability in handling specialized content.
-
Key strengths:
- Terminological Accuracy & Consistency: Uses internationally accepted nomenclature (e.g., correctly renders terms like “R0 resection” with proper article usage: “an R0 resection”).
- Stylistic Fluency: Frequently produces more concise and fluid sentence structures that align with conventions of formal medical writing (e.g., integrating clauses smoothly without redundancy).
-
Clinical Fidelity: Preserves nuanced clinical relationships and sequencing without distortion, as evidenced by evaluators noting near-identical accuracy but preferring its phrasing for clarity.
-
Key weaknesses:
- Minor stylistic preferences sometimes hinge on negligible differences (e.g., “Following” vs. “After”), suggesting room for even greater precision in tense and transitional phrasing.
-
While rare, the model may occasionally prioritize conciseness over explicitness in complex clinical contexts, though no major omissions were noted in the sample reasons.
-
Improvement suggestions:
- Fine-tune clause integration in multi-fact sentences to ensure no subtle clinical qualifiers are lost during streamlining.
- Incorporate stricter adherence to passive voice conventions where standard in oncology literature (e.g., “was administered” vs. active constructions) to further align with genre norms.
Model: qwen-turbo
-
Overall assessment: qwen-turbo shows acceptable baseline competence but lags behind peers in nuanced medical translation, with a low win rate (4 wins vs. 10 losses). It often produces accurate translations but is consistently edged out by more polished alternatives.
-
Key strengths:
- Basic Terminological Accuracy: Appears to correctly identify and translate core oncology terms (e.g., drug regimens, mutations) in most cases, as losses are rarely due to factual errors.
-
Functional Clarity: Translations are generally understandable and clinically usable, suggesting solid foundational knowledge of medical concepts.
-
Key weaknesses:
- Stylistic Inferiority: Repeatedly criticized for less fluid or less concise phrasing compared to winning responses (e.g., failing to combine related clauses efficiently).
- Grammatical Nuances: Lacks attention to fine points of formal medical English, such as optimal tense usage or article placement, leading to slightly awkward or non-idiomatic constructions.
-
Passive Voice & Syntax: May underutilize passive constructions common in scientific writing, resulting in less conventional academic tone.
-
Improvement suggestions:
- Prioritize training on high-quality oncology literature to internalize syntactic patterns of native medical English.
- Implement post-editing rules for article usage (“a” vs. “an”), tense consistency, and clause compaction in multi-proposition sentences.
- Optimize for concision without sacrificing completeness—especially in reporting diagnostic findings or treatment sequences.
Model: qwen3-32b
-
Overall assessment: qwen3-32b delivers balanced and generally reliable performance, with an even win-loss record (7–7), reflecting consistent quality but inconsistent superiority. It excels in some dimensions while falling short in others depending on the prompt.
-
Key strengths:
- Grammatical Precision: Notably corrects subtle errors like article omission (e.g., adding “an” before “R0 resection”), demonstrating strong command of English grammar in technical contexts.
- Adaptability: Capable of producing both highly concise and clinically faithful translations when successful, showing flexibility in stylistic approach.
-
Terminological Fidelity: Aligns well with standard oncology terminology, as seen in accurate rendering of biomarkers and procedures.
-
Key weaknesses:
- Inconsistency: Performance varies noticeably across samples—sometimes winning for fluency, other times losing for the same criteria, suggesting instability in output quality.
-
Stylistic Judgment: Occasionally produces phrasing that, while accurate, is less streamlined than alternatives (e.g., missing opportunities to integrate related clinical facts into a single cohesive sentence).
-
Improvement suggestions:
- Apply consistency constraints during decoding to ensure uniform handling of recurring terms and syntactic structures.
- Enhance preference tuning using human feedback from oncology professionals to better distinguish between “accurate” and “publication-ready” translations.
- Strengthen integration of clinical logic into sentence planning—e.g., grouping temporally or causally linked events into unified clauses to improve readability and fidelity.
Representative Cases
Case 1
Query: 请将以下内容翻译成英文:III期结肠癌术后标准辅助化疗为FOLFOX方案(奥沙利铂+5-氟尿嘧啶+亚叶酸钙),疗程通常为6个月。
qwen-plus: The standard adjuvant chemotherapy after surgery for stage III colon cancer is the FOLFOX regimen (oxaliplatin + 5-fluorouracil + leucovorin), typically administered over a course of 6 months.
qwen-turbo: The standard adjuvant chemotherapy for stage III colon cancer after surgery is the FOLFOX regimen (oxaliplatin + 5-fluorouracil + leucovorin), typically administered over a period of 6 months.
Winner: qwen-plus
Evaluation Reason: Both responses are highly accurate and meet all evaluation criteria equally well. They correctly use internationally accepted terminology (e.g., 'stage III colon cancer', 'FOLFOX regimen', drug names), preserve clinical meaning, employ appropriate formal medical English syntax and passive voice, maintain consistency in terminology, and fully retain all clinical details including duration ('6 months'). The only difference is minor stylistic phrasing: 'after surgery for stage III...' vs. 'for stage III... after surgery' and 'course of' vs. 'period of'. Neither phrasing is superior in terms of clinical fidelity or standard usage; both are commonly seen in oncology literature. Therefore, neither response is clearly better.
Case 2
Query: 请将以下内容翻译成英文:患者携带EGFR L858R突变,一线使用奥希替尼治疗后疾病进展,液体活检检出MET扩增。
qwen-plus: The patient harbors an EGFR L858R mutation and experienced disease progression after first-line treatment with osimertinib. MET amplification was detected via liquid biopsy.
qwen-turbo: The patient harbors an EGFR L858R mutation. Disease progression occurred after first-line treatment with osimertinib, and MET amplification was detected by liquid biopsy.
Winner: qwen-plus
Evaluation Reason: Both responses are accurate and clear, but Response A is slightly better due to its more concise and fluid sentence structure. It combines the two related clinical events (disease progression after osimertinib and MET amplification detection) into a single coherent sentence, which enhances readability without sacrificing any clinical detail. Response B splits the information into two sentences, which is still correct but less elegant. Both maintain terminological accuracy, clinical fidelity, and grammatical correctness, but A demonstrates superior stylistic cohesion.
Case 3
Query: 请将以下内容翻译成英文:对于携带EGFR L858R突变的晚期非小细胞肺癌患者,一线治疗推荐奥希替尼单药或阿法替尼联合贝伐珠单抗。
qwen-plus: For patients with advanced non-small cell lung cancer (NSCLC) harboring the EGFR L858R mutation, first-line treatment is recommended as either osimertinib monotherapy or afatinib in combination with bevacizumab.
qwen-turbo: For patients with advanced non-small cell lung cancer harboring the EGFR L858R mutation, first-line treatment is recommended to be osimertinib monotherapy or afatinib combined with bevacizumab.
Winner: qwen-plus
Evaluation Reason: Response A uses more precise and natural phrasing ('is recommended as either...') that aligns better with standard medical English conventions. It also correctly includes parentheses around 'NSCLC' after the full term, which is the accepted formatting in biomedical writing. Response B’s phrasing ('is recommended to be...') is grammatically acceptable but less idiomatic in clinical guidelines. Both are accurate, but A demonstrates superior stylistic and terminological appropriateness.