Computer Science
COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization
A. Esteva, A. Kale, et al.
Discover how CO-Search, developed by Andre Esteva and team, revolutionizes the way we navigate vast COVID-19 literature! This innovative semantic search engine merges powerful techniques like semantic-keyword retrieval, question answering, and summarization to sift through misinformation and deliver accurate insights.
~3 min • Beginner • English
Introduction
The study addresses the challenge of retrieving accurate, relevant scientific information from a rapidly expanding COVID-19 literature corpus. With thousands of new papers published weekly and critical decisions made by clinicians, policymakers, and researchers, traditional keyword search can miss semantically relevant content and is vulnerable to misinformation. The authors propose CO-Search, a cascaded semantic search engine that handles complex natural language queries over the CORD-19 corpus. The research goal is to improve retrieval effectiveness by combining deep-learning-based semantic retrieval with keyword-based methods and to enhance rankings using question answering and abstractive summarization, evaluated rigorously on the TREC-COVID benchmark. The work is important for enabling efficient literature exploration and reducing misinformation risk during a public health crisis.
Literature Review
The paper surveys COVID-19 IR and NLP efforts leveraging the CORD-19 corpus. Neural Covidex used transformer models fine-tuned on MS MARCO for query-document relevance. SLEDGE employed SciBERT, fine-tuned on MS MARCO, re-ranking BM25 results. CovidQA provided early QA over CORD-19. CAIRE offered multi-document summarization leveraging general and biomedical pretraining. Additional efforts include Named Entity Recognition (COVID-NER with 75 entity types) and network analyses identifying co-occurrence patterns of medical terms. Broader surveys reviewed AI applications against COVID-19 and analyzed TREC-COVID’s structure and challenges, noting pandemic-specific IR differences: rapidly changing focused collections, topic formulation, evaluation with sparse judgments, and quick deployment constraints. Ensemble approaches (e.g., RRF102) combining 100+ IR runs showed benefits of fusing lexical, embedding, and feedback methods. These insights motivate CO-Search’s hybrid design integrating semantic and lexical retrieval, with re-ranking via QA and summarization.
Methodology
System overview: CO-Search consists of (1) indexing, (2) retrieval, and (3) re-ranking.
Indexing: Documents are split into paragraphs and image captions. Paragraphs and captions are embedded with Siamese-BERT (SBERT), and entire documents are vectorized with TF-IDF and BM25 (Lucene/Anserini). SBERT extends BERT with tied encoders to enable efficient semantic similarity via cosine distance. Because the corpus contains relatively few long documents, the authors construct a large synthetic training dataset by forming a bipartite graph between paragraphs and the titles of citations referenced in those paragraphs. Positive tuples are (paragraph p, citation title c) where c ∈ p; negatives are sampled where c ∉ p, yielding millions of training pairs to teach SBERT to predict if a title is cited by a paragraph. SBERT training uses cross-entropy loss, Adam (lr=2e-5), linear warm-up over 10% of data, and MEAN pooling.
Retrieval: At query time, SBERT embeds the query and retrieves approximate kNN paragraphs (Annoy), scoring via cosine similarity. TF-IDF and BM25 independently score documents (BM25 with Anserini, k1=1.2, b=0.75). TF-IDF uses scikit-learn with vocabulary size 13,000, max_df=0.5, min_df=3, L2 normalization. The SBERT and TF-IDF scores are combined linearly:
C(q,d) = ρ * max_p cos(p_es, q_es) + (1-ρ) * TF-IDF(q,d), with ρ=0.7.
This rank Rc is fused with the BM25 rank Rb using Reciprocal Rank Fusion:
RRF(q,d) = 1/(k + Rc(d)) + 1/(k + Rb(d)), with k=60.
The fused retrieval returns the top 1000 documents per topic, balancing semantic similarity with keyword importance.
Re-ranking: Two deep-learning modules refine document ranking:
- Question Answering (QA): A sequential paragraph selector (HotpotQA-style multi-hop) filters retrieved paragraphs that together may answer the query. It is pre-trained on a Wikipedia-derived dataset of 113k QA pairs with supporting facts and fine-tuned on PubMedQA for biomedical specificity. A reading comprehension model then extracts answer text spans. A QA score Q(q,d) = 1.1^W is computed, where W is the count of extracted answer spans from the query that appear in document d.
- Abstractive Summarization (AS): An encoder-decoder model with a BERT encoder and a modified GPT-2 decoder augmented with cross-attention generates a concise (<65-token) summary from concatenated top retrieved paragraph sentences (up to 512 tokens). Pretraining uses a gap-sentence objective; finetuning uses single-document supervision (first 512 tokens after abstract as input; first 300 abstract tokens as target). A summary match score S(q,d) = 1/(2*(11 + max cos(p_e, M(q)))) is computed, where M(q) is the embedded summary for the query.
Final rank score: R(q,d) = S(q,d) * Q(q,d) * RRF(q,d). Higher R indicates better relevance. Beam search in QA is diversified to avoid redundant answer paths; negative examples include high TF-IDF paragraphs for the query.
Evaluation protocol: The system is evaluated on TREC-COVID across five rounds, each with 30–50 topics (query, question, narrative). Systems submit up to 1000 documents per topic. Expert judgments rate topic-document pairs as irrelevant, partially relevant, or relevant, using depth-N pooling. Metrics: nDCG@10, P@5, P@10, MAP, and Bpref (robust to missing judgments). Two contexts are analyzed: (1) all submissions on all pairs (judged + non-judged) across manual, feedback, and automatic runs; (2) automatic submissions on judged pairs only, re-scoring after removing non-judged pairs. Team rankings account for multiple submissions and ties.
Key Findings
- Overall TREC-COVID performance: In the all-submissions, all-pairs context, CO-Search ranks within the top 21 systems across rounds, with improved rankings from Round 1 to 2 due to increased judgment fractions, and a plateau/drop in Rounds 3–5 as feedback/manual systems benefit more from accumulating judgments. In the automatic-only, judged-pairs context, CO-Search is consistently strong: top-6 across all metrics/rounds, top-4 across all but four, and top-1 in about half of the metric-round combinations.
- Representative Round 1 (automatic, judged pairs): Bpref 0.5176 (rank 1), MAP 0.4870 (rank 1), P@5 0.8267 (rank 1), P@10 0.7933 (rank 1), nDCG@10 0.7233 (rank 1).
- Representative Round 2 (automatic, judged pairs): Bpref 0.5232 (rank 1), MAP 0.5138 (rank 1), P@5 0.8171 (rank 1), P@10 0.7629 (rank 1), nDCG@10 0.7247 (rank 1).
- Representative Round 5 (automatic, judged pairs): Bpref 0.5253 (rank 3), MAP 0.4884 (rank 3), P@5 0.8760 (rank 3), P@10 0.8420 (rank 3), nDCG@10 0.7567 (rank 4).
- Judgment sparsity: Coverage is very low (Round 1: 0.57%; Round 5: 0.24%), motivating the use of Bpref, which shows stability across contexts.
- Ablation study (Round 5, all-submissions, all-pairs):
• Retrieval components individually: SBERT Bpref 0.3594, MAP 0.1128, P@5 0.4640; TF-IDF Bpref 0.2567, MAP 0.0781; BM25 Bpref 0.4581, MAP 0.1313.
• Combined retrieval (SBERT+TF-IDF fused, then RRF with BM25): Bpref 0.5146, MAP 0.2987, P@5 0.8680, P@10 0.8200, nDCG@10 0.7254.
• Re-ranking boosts: Retrieval+QA Bpref 0.5205, MAP 0.3075; Retrieval+AS Bpref 0.5246, MAP 0.3049; Retrieval+QA+AS Bpref 0.5253, MAP 0.3089, P@5 0.8760, P@10 0.8260, nDCG@10 0.7488. QA and AS each add ~1–2% improvements.
- Comparison to top Round 5 automatic systems (judged documents):
• CO-Search: Bpref 0.5253, MAP 0.4884, P@5 0.8760, P@10 0.8420, nDCG@10 0.7567.
• covidex: Bpref 0.5052, MAP 0.4739, P@5 0.9040, P@10 0.8900, nDCG@10 0.8325.
• uogTr: Bpref 0.4933, MAP 0.4580, P@5 0.9040, P@10 0.8720, nDCG@10 0.7918.
• unique_ptr: Bpref 0.5364, MAP 0.5100, P@5 0.8680, P@10 0.8380, nDCG@10 0.7746.
No single system leads all metrics, suggesting complementary strengths and ensemble potential.
- Topic-wise performance (Round 5): Best on diabetes-related complications of COVID-19 and ACE inhibitor risk topics; worst on triage guidelines and causes of death, reflecting the hybrid system’s reliance on both salient keywords and broader semantic cues.
Discussion
The hybrid semantic-keyword architecture effectively addresses complex natural language queries over a focused, rapidly changing COVID-19 literature. Semantic embeddings (SBERT) capture conceptual similarity, while lexical models (TF-IDF, BM25) ensure attention to critical keywords, producing a strong retrieval foundation. Re-ranking with QA and abstractive summarization further tailors rankings to documents that contain answer spans and that best summarize the retrieved set, yielding consistent gains. Evaluation on TREC-COVID across five rounds demonstrates robustness to corpus growth and sparse annotations, with particularly strong results in the automatic-only judged-pairs setting. Topic-level analysis indicates that the system excels when queries contain distinctive biomedical terms (e.g., diabetes, ACE inhibitors) complemented by semantic patterns, while it is more challenged by broad guideline or causality queries with fewer distinctive keywords. The findings support the utility of combined semantic and lexical signals, and the added value of QA and summarization signals for ranking relevance in pandemic-oriented IR.
Conclusion
CO-Search introduces a cascaded semantic search engine tailored to COVID-19 literature that fuses SBERT-based semantic retrieval with TF-IDF and BM25 keyword methods, and enriches rankings via biomedical QA and abstractive summarization. The system achieves strong performance on TREC-COVID across multiple rounds and metrics, with ablations confirming substantial synergy in the retriever and consistent gains from re-ranking. The work contributes an effective training data augmentation strategy via paragraph–citation bipartite tuples and practical fusion of heterogeneous signals. Future work includes: integrating synthetic query generation to further augment training; incorporating medically fine-tuned transformers (e.g., SciBERT, MedMARCO-finetuned models) via RRF; learning attention over semantic and lexical features conditioned on the query; and exploring ensemble combinations with other high-performing systems to capitalize on complementary strengths.
Limitations
Evaluation relies on sparse and pooled relevance judgments, leading to potential biases and underestimation of systems that retrieve relevant but non-judged documents. Coverage is very low (0.57% in Round 1; 0.24% in Round 5), necessitating metrics like Bpref. The re-ranking modules provide modest gains (approximately 1–2%), indicating room for stronger answer and summary signals. The system depends on the quality and currency of the CORD-19 corpus and may be affected by misinformation or non–peer-reviewed sources in the dataset. Performance may degrade on queries lacking distinctive biomedical keywords or with broad guideline/causality phrasing. Generalizability beyond COVID-19 literature is untested within this study.
Related Publications
Explore these studies to deepen your understanding of the subject.

