BEIR TREC-COVID Benchmark Results
TREC-COVID is a biomedical retrieval benchmark built from the CORD-19 corpus. It is part of the BEIR benchmark [3], which tests zero-shot generalization of retrieval systems across diverse domains.
| Property | Value |
|---|---|
| Corpus size | 171,332 scientific articles |
| Number of queries | 50 COVID-19 research topics |
| Relevant docs per query | 100–500+ (judged by medical experts) |
| Domain | Biomedical — epidemiology, treatment, transmission |
The high number of relevant documents per query is an important characteristic of this dataset. It makes recall-based metrics (Recall@10, MAP@10) inherently low for any system returning only 10–100 results, regardless of result quality. This is discussed further in Section 5.
Three configurations were tested, each building on the previous:
| Component | Config A | Config B | Config C |
|---|---|---|---|
| Index | Apache Solr 9.8.1 with SolrCloud (ZooKeeper) | ||
| Embedding model | GTE-Large [1] | GTE-Large [1] | Qwen3-Embedding-0.6B [2] |
| Search strategy | Hybrid: KNN vector search + BM25 text search | ||
| NLP entity extraction | Yes | Yes | Yes |
| Reranker | — | Qwen3-Reranker-0.6B [2] | Qwen3-Reranker-0.6B [2] |
| Reranking candidates | — | Top 30 | Top 30 |
| WordNet synonym expansion | — | — | WordNet (English) [8] |
NLP entity extraction is applied in all three configurations: at query time, SpaCy analyses the query to identify and extract named entities and keywords, which are used to focus and refine the search terms passed downstream.
Configuration C additionally replaces GTE-Large with Qwen3-Embedding-0.6B, which shares the same model family as the reranker and is fine-tuned for retrieval with instruction prefixes. The key differentiator of Config C over A and B is WordNet synonym expansion — a comprehensive English lexical database developed at Princeton University [8]. After NLP entity extraction, each extracted term is further expanded with its WordNet synonyms before being passed to both the BM25 and KNN stages, broadening lexical coverage without modifying the index.
| Metric | Config A GTE-Large, no reranker |
Config B GTE-Large + Qwen3 reranker |
Config C Qwen3-Emb + reranker + synonyms |
C vs B |
|---|---|---|---|---|
| NDCG@5 | 0.8544 | 0.8621 | 0.9070 | +0.0449 |
| NDCG@10 | 0.8087 | 0.8411 | 0.8828 | +0.0417 |
| Precision@10 | 0.8420 | 0.8800 | 0.9220 | +0.0420 |
| MRR@10 | 0.9900 | 0.9800 | 0.9800 | 0.0000 |
| Recall@10 | 0.0219 | 0.0226 | 0.0236 | +0.0010 |
| Recall@100 | 0.1420 | 0.1420 | 0.1649 | +0.0229 |
| MAP@10 | 0.0206 | 0.0208 | 0.0227 | +0.0019 |
| System | NDCG@10 | Notes |
|---|---|---|
| DPR | 0.3326 | Dense Passage Retrieval, zero-shot |
| TAS-B | 0.4817 | Topic-Aware Sampling BERT |
| ANCE | 0.6543 | Approx. Nearest Neighbor Negative CE |
| BM25 | 0.6559 | BEIR paper baseline [3] |
| SPLADE-v2 | 0.7057 | Sparse learned representations [5] |
| BGE-large (retrieval only) | ~0.770 | FlagEmbedding, no reranker [7] |
| ColBERT v2 | 0.7854 | Late interaction model [4] |
| MonoT5 reranker (top-100) | ~0.807 | Sequence-to-sequence cross-encoder |
| SANDI Solr — Config A (GTE-Large, no reranker) | 0.8087 | This work — hybrid KNN + BM25 only |
| SANDI Solr — Config B (GTE-Large + Qwen3-Rer-0.6B) | 0.8411 | This work — 30 rerank candidates |
| RankGPT (GPT-4 reranker) | 0.8551 | LLM-based listwise reranking [6] |
| SANDI Solr — Config C (Qwen3-Emb-0.6B + Qwen3-Rer-0.6B + WordNet) | 0.8828 | This work — 30 rerank candidates with query expansion |
Configs A and B operate in a zero-shot setting — the query is processed with NLP entity extraction but no external knowledge is added — and can be directly compared to published baselines. Config B (NDCG@10 = 0.8411) using only a 0.6B reranker places above ColBERT v2, MonoT5, BGE-large, and within 1.4 points of RankGPT (GPT-4, 0.8551), at a fraction of the compute cost. Config C adds WordNet synonym expansion at query time, which constitutes additional domain knowledge beyond what the original query provides.
Normalized Discounted Cumulative Gain accounts for graded relevance and rank position. Config C improves +4.2 points over Config B and +7.4 points over Config A. The gain from B to C is larger than the gain from A to B (+3.2 points), showing that synonym expansion and Qwen3 embeddings together contribute more than reranking alone. Direct comparison with zero-shot baselines is limited because Config C uses WordNet query expansion, which is an additional retrieval signal not present in the published systems it is compared against.
The top-5 improvement (+0.0449 from B to C) is consistent with the NDCG@10 gain, confirming that Qwen3 embeddings combined with synonym expansion improve ranking quality across the top result positions, not just in the tail.
Config C reaches Precision@10 = 0.9220, meaning on average more than 9 out of 10 returned results are relevant. This is an exceptional result for a corpus of 171,332 documents. The +4.2 point gain over Config B is a strong signal of the benefit of synonym expansion: BM25 now matches relevant documents that would previously have been missed due to surface-form vocabulary differences.
Mean Reciprocal Rank measures the position of the first relevant result. Config C maintains MRR@10 = 0.9800, exactly matching Config B. Despite synonym expansion broadening the candidate set, the reranker consistently places a highly relevant document at rank 1 in the vast majority of queries — a near-perfect first-result experience that is fully preserved in Config C.
Recall@100 increases by +0.0229, a substantial improvement. Configs A and B both plateau at 0.1420 because the first-stage retrieval using GTE-Large embeddings reaches a coverage ceiling — the 30-candidate pool drawn from KNN + BM25 already captures as many relevant documents as that embedding can identify. Config C's improvement comes from two sources: Qwen3-Embedding-0.6B retrieves a different set of relevant documents than GTE-Large, and synonym expansion widens BM25 coverage to documents that share no exact query terms with the original query. Together they raise the effective candidate ceiling before reranking.
Both metrics remain low, as structurally expected for TREC-COVID. Recall@100 = 0.1649 implies approximately 606 relevant documents per query on average (100 / 0.1649 ≈ 607). Retrieving 10 from ~600 relevant documents gives a theoretical recall ceiling near 1.7%:
Recall@10 ≈ 10 / avg_relevant_per_query ≈ 10 / 607 ≈ 1.6%
The observed 2.36% exceeds this floor, confirming that the retrieved top-10 is disproportionately relevant. Low recall and MAP are structural properties of the dataset, not of the retrieval system.
Config C presents the addition of English synonym expansion using WordNet [8], a large lexical database developed at Princeton University. At query time, query terms are expanded with WordNet synonym sets before being submitted to the retrieval pipeline. This is particularly effective on TREC-COVID, where queries use general language (e.g., "heart disease") while documents use clinical terminology (e.g., "cardiac disorder", "myocardial condition"). Synonym expansion bridges this vocabulary gap directly in the BM25 term-matching stage, recovering relevant documents that semantic embeddings alone do not always surface. The improvement in Precision@10 (+4.2 points) and Recall@100 (+2.3 points) is largely attributable to this effect.
Config C replaces GTE-Large with Qwen3-Embedding-0.6B [2], a retrieval-focused embedding model from the same family as the Qwen3-Reranker-0.6B used in the second stage. While GTE-Large shows strong English embedding quality, using matched embedding and reranker models from the same training lineage tends to improve pipeline coherence — the reranker is better calibrated to the score distribution produced by the embedding model. The improvement in Recall@100 (from 0.1420 to 0.1649) is partly attributable to Qwen3-Embedding-0.6B retrieving a different and more complementary set of candidates compared to GTE-Large.
Config C reaches Precision@10 = 0.9220 — on average more than 9 of the top-10 results are relevant. For search interfaces and RAG pipelines where users see only the top few results, this is the most practically relevant metric on this dataset. For use cases requiring high recall — systematic reviews, literature surveys, legal discovery — Recall@100 = 0.1649 is the appropriate target, and increasing the first-stage candidate pool beyond 30 would further improve it at the cost of higher reranking latency.
Config C maintains MRR@10 = 0.9800, identical to Config B. Despite synonym expansion broadening the candidate set, the Qwen3-Reranker consistently places a highly relevant document at rank 1. This confirms that the reranker is well calibrated to the Qwen3-Embedding score distribution and absorbs the broader candidate set without any degradation to first-result quality.
GTE-Large and Qwen3-Embedding-0.6B both handle semantic similarity across biomedical synonyms and paraphrases. BM25 handles exact keyword matches for drug names, gene identifiers, and technical terms. Synonym expansion further amplifies the BM25 component's ability to match on semantically equivalent surface forms. The three mechanisms are complementary and together cover the full vocabulary range of TREC-COVID queries.
Install libraries: pip install ranx requests
Run Python code:
https://softcorporation.com/products/sandi/evaluate-beir-trec-covid.txt
SANDI Solr was tested in three configurations on BEIR TREC-COVID. All three configurations include NLP entity extraction: at query time, SpaCy analyses the query to identify named entities and keywords used to focus the search. In the zero-shot setting, Config B (GTE-Large + Qwen3-Reranker-0.6B) scores NDCG@10 = 0.8411 and MRR@10 = 0.9800, placing above ColBERT v2, MonoT5, and BGE-large and within 1.4 points of GPT-4-based reranking using only a 0.6B cross-encoder — an impressive result in its own right. Config C further adds WordNet synonym expansion — exclusive to Config C — where each extracted entity is expanded with synonym sets before retrieval. This additional synonym expansion stage lifts all ranking metrics substantially, reaching NDCG@10 = 0.8828, which surpasses GPT-4 listwise reranking — a considerably more expensive approach. The result is a powerful illustration of the enduring value of NLP: combining well-established lexical resources with modern small dense models produces retrieval quality that neither approach achieves alone. Config C results are not directly comparable to zero-shot baselines because query expansion provides additional lexical signal beyond the original query, but they make a compelling case for NLP-enriched search as a practical and highly effective deployment strategy.