SANDI BEIR TREC-COVID Test Results

Abstract

SANDI Solr — a hybrid search system using Apache Solr with dense embeddings and a cross-encoder reranker — was tested on the BEIR TREC-COVID benchmark across three configurations. The best result is achieved by combining Qwen3-Embedding-0.6B, Qwen3-Reranker-0.6B, and NLP-driven query expansion using WordNet synonyms, reaching NDCG@10 = 0.8828, Precision@10 = 0.9220, and Recall@100 = 0.1649 — an outstanding outcome that demonstrates how classical NLP techniques can lift retrieval quality beyond what modern neural models achieve on their own. The decisive factor separating Config C from the already strong Config B (GTE-Large + Qwen3-Reranker-0.6B, NDCG@10 = 0.8411) is the additional WordNet synonym expansion — which broadens lexical coverage by expanding extracted entities with synonym sets at query time. Config C is therefore not directly comparable to zero-shot baselines in the literature, which submit the original query without expansion.

1Dataset: BEIR TREC-COVID

TREC-COVID is a biomedical retrieval benchmark built from the CORD-19 corpus. It is part of the BEIR benchmark [3], which tests zero-shot generalization of retrieval systems across diverse domains.

Property	Value
Corpus size	171,332 scientific articles
Number of queries	50 COVID-19 research topics
Relevant docs per query	100–500+ (judged by medical experts)
Domain	Biomedical — epidemiology, treatment, transmission

The high number of relevant documents per query is an important characteristic of this dataset. It makes recall-based metrics (Recall@10, MAP@10) inherently low for any system returning only 10–100 results, regardless of result quality. This is discussed further in Section 5.

2System Description

Three configurations were tested, each building on the previous:

Configuration A — Hybrid retrieval only (GTE-Large)

Query

→

NLP Entity Extraction

→

Hybrid KNN + BM25

→

Final Results

Configuration B — Hybrid retrieval + reranker (GTE-Large)

Query

→

NLP Entity Extraction

→

Hybrid KNN + BM25

→

Top-30 Candidates

→

Qwen3-Reranker-0.6B

→

Final Results

Configuration C — Qwen3 embeddings + reranker + WordNet synonyms

Query

→

NLP Entity Extraction

→

WordNet Synonyms

→

Expanded Query

→

Hybrid KNN + BM25

→

Top-30 Candidates

→

Qwen3-Reranker-0.6B

→

Final Results

Component	Config A	Config B	Config C
Index	Apache Solr 9.8.1 with SolrCloud (ZooKeeper)
Embedding model	GTE-Large [1]	GTE-Large [1]	Qwen3-Embedding-0.6B [2]
Search strategy	Hybrid: KNN vector search + BM25 text search
NLP entity extraction	Yes	Yes	Yes
Reranker	—	Qwen3-Reranker-0.6B [2]	Qwen3-Reranker-0.6B [2]
Reranking candidates	—	Top 30	Top 30
WordNet synonym expansion	—	—	WordNet (English) [8]

NLP entity extraction is applied in all three configurations: at query time, SpaCy analyses the query to identify and extract named entities and keywords, which are used to focus and refine the search terms passed downstream.

Configuration C additionally replaces GTE-Large with Qwen3-Embedding-0.6B, which shares the same model family as the reranker and is fine-tuned for retrieval with instruction prefixes. The key differentiator of Config C over A and B is WordNet synonym expansion — a comprehensive English lexical database developed at Princeton University [8]. After NLP entity extraction, each extracted term is further expanded with its WordNet synonyms before being passed to both the BM25 and KNN stages, broadening lexical coverage without modifying the index.

3Results

Metric	Config A GTE-Large, no reranker	Config B GTE-Large + Qwen3 reranker	Config C Qwen3-Emb + reranker + synonyms	C vs B
NDCG@5	0.8544	0.8621	0.9070	+0.0449
NDCG@10	0.8087	0.8411	0.8828	+0.0417
Precision@10	0.8420	0.8800	0.9220	+0.0420
MRR@10	0.9900	0.9800	0.9800	0.0000
Recall@10	0.0219	0.0226	0.0236	+0.0010
Recall@100	0.1420	0.1420	0.1649	+0.0229
MAP@10	0.0206	0.0208	0.0227	+0.0019

Config C bold values mark the best result in each row across all three configurations. All metrics improve or match Config B.

4Comparison with Published Results

System	NDCG@10	Notes
DPR	0.3326	Dense Passage Retrieval, zero-shot
TAS-B	0.4817	Topic-Aware Sampling BERT
ANCE	0.6543	Approx. Nearest Neighbor Negative CE
BM25	0.6559	BEIR paper baseline [3]
SPLADE-v2	0.7057	Sparse learned representations [5]
BGE-large (retrieval only)	~0.770	FlagEmbedding, no reranker [7]
ColBERT v2	0.7854	Late interaction model [4]
MonoT5 reranker (top-100)	~0.807	Sequence-to-sequence cross-encoder
SANDI Solr — Config A (GTE-Large, no reranker)	0.8087	This work — hybrid KNN + BM25 only
SANDI Solr — Config B (GTE-Large + Qwen3-Rer-0.6B)	0.8411	This work — 30 rerank candidates
RankGPT (GPT-4 reranker)	0.8551	LLM-based listwise reranking [6]
SANDI Solr — Config C (Qwen3-Emb-0.6B + Qwen3-Rer-0.6B + WordNet)	0.8828	This work — 30 rerank candidates with query expansion

Comparisons are drawn from published papers and leaderboard entries. Numbers may vary by evaluation setup, corpus version, and query preprocessing. Config C uses WordNet query-time synonym expansion; all other systems in this table operate zero-shot on the original query.

Notes on the comparison

Configs A and B operate in a zero-shot setting — the query is processed with NLP entity extraction but no external knowledge is added — and can be directly compared to published baselines. Config B (NDCG@10 = 0.8411) using only a 0.6B reranker places above ColBERT v2, MonoT5, BGE-large, and within 1.4 points of RankGPT (GPT-4, 0.8551), at a fraction of the compute cost. Config C adds WordNet synonym expansion at query time, which constitutes additional domain knowledge beyond what the original query provides.

5Metric Analysis

NDCG@10: 0.8087 → 0.8411 → 0.8828

Normalized Discounted Cumulative Gain accounts for graded relevance and rank position. Config C improves +4.2 points over Config B and +7.4 points over Config A. The gain from B to C is larger than the gain from A to B (+3.2 points), showing that synonym expansion and Qwen3 embeddings together contribute more than reranking alone. Direct comparison with zero-shot baselines is limited because Config C uses WordNet query expansion, which is an additional retrieval signal not present in the published systems it is compared against.

NDCG@5: 0.8544 → 0.8621 → 0.9070

The top-5 improvement (+0.0449 from B to C) is consistent with the NDCG@10 gain, confirming that Qwen3 embeddings combined with synonym expansion improve ranking quality across the top result positions, not just in the tail.

Precision@10: 0.8420 → 0.8800 → 0.9220

Config C reaches Precision@10 = 0.9220, meaning on average more than 9 out of 10 returned results are relevant. This is an exceptional result for a corpus of 171,332 documents. The +4.2 point gain over Config B is a strong signal of the benefit of synonym expansion: BM25 now matches relevant documents that would previously have been missed due to surface-form vocabulary differences.

MRR@10: 0.9900 → 0.9800 → 0.9800

Mean Reciprocal Rank measures the position of the first relevant result. Config C maintains MRR@10 = 0.9800, exactly matching Config B. Despite synonym expansion broadening the candidate set, the reranker consistently places a highly relevant document at rank 1 in the vast majority of queries — a near-perfect first-result experience that is fully preserved in Config C.

Recall@100: 0.1420 → 0.1420 → 0.1649

Recall@100 increases by +0.0229, a substantial improvement. Configs A and B both plateau at 0.1420 because the first-stage retrieval using GTE-Large embeddings reaches a coverage ceiling — the 30-candidate pool drawn from KNN + BM25 already captures as many relevant documents as that embedding can identify. Config C's improvement comes from two sources: Qwen3-Embedding-0.6B retrieves a different set of relevant documents than GTE-Large, and synonym expansion widens BM25 coverage to documents that share no exact query terms with the original query. Together they raise the effective candidate ceiling before reranking.

Recall@10 = 0.0236 and MAP@10 = 0.0227

Both metrics remain low, as structurally expected for TREC-COVID. Recall@100 = 0.1649 implies approximately 606 relevant documents per query on average (100 / 0.1649 ≈ 607). Retrieving 10 from ~600 relevant documents gives a theoretical recall ceiling near 1.7%:

Recall@10 ≈ 10 / avg_relevant_per_query ≈ 10 / 607 ≈ 1.6%

The observed 2.36% exceeds this floor, confirming that the retrieved top-10 is disproportionately relevant. Low recall and MAP are structural properties of the dataset, not of the retrieval system.

6Discussion

Impact of WordNet synonym expansion

Config C presents the addition of English synonym expansion using WordNet [8], a large lexical database developed at Princeton University. At query time, query terms are expanded with WordNet synonym sets before being submitted to the retrieval pipeline. This is particularly effective on TREC-COVID, where queries use general language (e.g., "heart disease") while documents use clinical terminology (e.g., "cardiac disorder", "myocardial condition"). Synonym expansion bridges this vocabulary gap directly in the BM25 term-matching stage, recovering relevant documents that semantic embeddings alone do not always surface. The improvement in Precision@10 (+4.2 points) and Recall@100 (+2.3 points) is largely attributable to this effect.

Qwen3-Embedding-0.6B vs GTE-Large

Config C replaces GTE-Large with Qwen3-Embedding-0.6B [2], a retrieval-focused embedding model from the same family as the Qwen3-Reranker-0.6B used in the second stage. While GTE-Large shows strong English embedding quality, using matched embedding and reranker models from the same training lineage tends to improve pipeline coherence — the reranker is better calibrated to the score distribution produced by the embedding model. The improvement in Recall@100 (from 0.1420 to 0.1649) is partly attributable to Qwen3-Embedding-0.6B retrieving a different and more complementary set of candidates compared to GTE-Large.

Precision and ranking quality

Config C reaches Precision@10 = 0.9220 — on average more than 9 of the top-10 results are relevant. For search interfaces and RAG pipelines where users see only the top few results, this is the most practically relevant metric on this dataset. For use cases requiring high recall — systematic reviews, literature surveys, legal discovery — Recall@100 = 0.1649 is the appropriate target, and increasing the first-stage candidate pool beyond 30 would further improve it at the cost of higher reranking latency.

MRR maintained

Config C maintains MRR@10 = 0.9800, identical to Config B. Despite synonym expansion broadening the candidate set, the Qwen3-Reranker consistently places a highly relevant document at rank 1. This confirms that the reranker is well calibrated to the Qwen3-Embedding score distribution and absorbs the broader candidate set without any degradation to first-result quality.

Hybrid search

GTE-Large and Qwen3-Embedding-0.6B both handle semantic similarity across biomedical synonyms and paraphrases. BM25 handles exact keyword matches for drug names, gene identifiers, and technical terms. Synonym expansion further amplifies the BM25 component's ability to match on semantically equivalent surface forms. The three mechanisms are complementary and together cover the full vocabulary range of TREC-COVID queries.

Code to reproduce test results

Install libraries: pip install ranx requests
Run Python code: https://softcorporation.com/products/sandi/evaluate-beir-trec-covid.txt

7Conclusion

SANDI Solr was tested in three configurations on BEIR TREC-COVID. All three configurations include NLP entity extraction: at query time, SpaCy analyses the query to identify named entities and keywords used to focus the search. In the zero-shot setting, Config B (GTE-Large + Qwen3-Reranker-0.6B) scores NDCG@10 = 0.8411 and MRR@10 = 0.9800, placing above ColBERT v2, MonoT5, and BGE-large and within 1.4 points of GPT-4-based reranking using only a 0.6B cross-encoder — an impressive result in its own right. Config C further adds WordNet synonym expansion — exclusive to Config C — where each extracted entity is expanded with synonym sets before retrieval. This additional synonym expansion stage lifts all ranking metrics substantially, reaching NDCG@10 = 0.8828, which surpasses GPT-4 listwise reranking — a considerably more expensive approach. The result is a powerful illustration of the enduring value of NLP: combining well-established lexical resources with modern small dense models produces retrieval quality that neither approach achieves alone. Config C results are not directly comparable to zero-shot baselines because query expansion provides additional lexical signal beyond the original query, but they make a compelling case for NLP-enriched search as a practical and highly effective deployment strategy.

Apache Solr 9.8.1 Qwen3-Embedding-0.6B Qwen3-Reranker-0.6B WordNet Synonyms Hybrid Search BEIR Benchmark NDCG@10: 0.8828 Precision@10: 0.9220 NLP Query Processing Query Expansion

8References

Li Z., et al. (2023) Towards general text embeddings with multi-stage contrastive learning. Alibaba Group
Zhang Y., et al. (2025) Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. Tongyi Lab, Alibaba Group
Thakur N., et al. (2021) BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 Datasets and Benchmarks Track
Santhanam K., et al. (2022) ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL
Formal T., et al. (2021) SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR
Sun W., et al. (2023) Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. EMNLP
Xiao S., et al. (2023) C-Pack: Packaged Resources To Advance General Chinese Embedding. SIGIR
Miller G. A. (1995) WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39–41. Princeton University, Cognitive Science Laboratory. Distributed under a free license; database © 2010 The Trustees of Princeton University