Table of Contents
The Promise and Limits of Pure Vector Search
Semantic search powered by text embeddings has revolutionized information retrieval. By converting queries and documents into high-dimensional vectors, embedding models capture semantic meaning in ways that traditional keyword search never could. A query for "car" can match documents about "automobiles" and "vehicles" without explicit synonyms. Contextual nuances are preserved, and conceptual similarity drives relevance.
Yet as document collections scale from thousands to millions of items, a troubling pattern emerges: pure embedding-based search begins to fail in subtle but critical ways. The very qualities that make embeddings powerful—their ability to find semantic similarity across broad concepts—become liabilities when precision matters.
The Fundamental Problem: Everything is "Somewhat Similar"
Vector embeddings operate in a continuous space where every document has some degree of similarity to every query. In small collections, this works beautifully. The most relevant documents cluster tightly around the query vector, and irrelevant ones sit far away.
But in large document sets, the distribution changes dramatically:
1. The Curse of Dimensionality
As collections grow, the average distance between any two random points converges toward a similar value. In a million-document collection embedded in 768 dimensions, the difference between the 100th and 10,000th most similar document may be statistically insignificant. The embedding space becomes crowded, and meaningful distinctions blur.
2. Semantic Drift and Topic Bleed
Embeddings excel at capturing broad semantic categories but struggle with specificity. A search for "Python programming exceptions" might return documents about "error handling in Java," "debugging techniques," and "software testing best practices"—all semantically related but not what the user wanted. The model cannot distinguish between "related topics" and "exact matches."
3. The Named Entity Problem
Embeddings are notoriously poor at handling proper nouns, product names, technical identifiers, and domain-specific terminology. A query for "Apache Kafka" might match documents about "Apache Spark," "message queues," or "stream processing frameworks." The word "Kafka" gets diluted into its semantic neighborhood, losing its identity as a specific technology.
4. Multi-Field Embedding Collapse
Even when documents are indexed with embeddings from multiple fields (title, description, content), the fundamental issue persists. Averaging or concatenating field embeddings creates a "semantic soup" where precise terms from high-priority fields get averaged away by bulk content. A document with "machine learning" in the title and 10,000 words about database administration might rank equally with a focused machine learning tutorial.
Real-World Failure Modes
Consider these common scenarios where pure vector search breaks down:
Product Search in E-Commerce
A user searches for "iPhone 15 Pro Max 256GB". Pure embedding search might return:
- iPhone 14 Pro (semantically similar, wrong model)
- iPhone 15 (missing storage variant)
- Samsung Galaxy Ultra (conceptually similar high-end phone)
- iPhone accessories (contextually related)
The specific model number and storage capacity—critical to the user—are treated as semantic noise.
Legal and Compliance Search
Searching for "GDPR Article 17" in a legal database needs to return exactly that article, not "GDPR Article 16," "data protection regulations," or "European privacy laws." Embeddings see these as nearly identical.
Technical Documentation
A developer searching for "PostgreSQL connection pooling timeout configuration" needs documentation about PostgreSQL, specifically about connection pooling, specifically about timeouts. Pure semantic search might return MySQL documentation, general database tuning guides, or connection pool library comparisons—all related, none correct.
The Hybrid Solution: Semantic Understanding Meets Lexical Precision
The answer is not to abandon embeddings but to combine them with traditional search techniques in a complementary architecture. This is the approach taken by production systems like SANDI-Solr, which implements a sophisticated hybrid search strategy:
1. Entity Extraction and Keyword Filtering
Before executing the semantic search, the query is analyzed using NLP to extract:
- Named entities (product names, technologies, locations, organizations)
- Quoted phrases (exact match requirements)
- Keywords (important terms that must appear)
These extracted elements are used as hard filters on the vector search results. You can find semantic neighbors all you want, but if the document doesn't contain "PostgreSQL," it's eliminated. This solves the named entity problem.
2. Weighted Field Combination
Rather than collapsing all embeddings into one, hybrid systems maintain separate search strategies:
- Keyword search with field-specific boosting (titles weighted 10x higher than body text)
- Vector search with field-level embeddings (title vectors, description vectors, content vectors)
- Score fusion that combines both signals with configurable weights
This preserves the precision of keyword matches while adding semantic depth.
3. Query Understanding and Expansion
Modern hybrid systems use LLMs to:
- Expand queries with synonyms and related terms (when appropriate)
- Identify the intent behind multi-word queries
- Determine which parts of the query require exact matching vs. semantic similarity
Example: A query like "fix 404 errors in nginx" might be understood as:
- Entity: nginx (must match)
- Intent: troubleshooting/configuration
- Semantic expansion: error handling, HTTP status codes, web server configuration
4. Re-Ranking with Context
Initial retrieval using hybrid search casts a wide net, then a re-ranking model (often a cross-encoder or small LLM) re-scores results by reading actual query-document pairs. This two-stage approach balances recall (finding all potentially relevant documents) with precision (ranking the best ones at the top).
The SANDI-Solr Approach
SANDI-Solr exemplifies this hybrid philosophy by integrating:
- Traditional Solr search with BM25 scoring and field-level boosting
- Dense vector search using state-of-the-art embedding models (Qwen3)
- NLP-powered entity extraction from queries to identify terms requiring exact matches
- Synonym expansion for recognized entities and keywords
- Configurable score fusion allowing operators to tune the balance between semantic and lexical signals
- LLM-based re-ranking for final precision optimization
The result is a search system that combines the semantic understanding of embeddings with the precision of traditional search, delivering both recall and relevance at scale.
Conclusion: Embeddings are Necessary but Not Sufficient
Pure embedding-based semantic search represents a powerful tool in the information retrieval toolkit, but it is not a complete solution for large-scale search systems. The limitations become apparent at scale: loss of precision, difficulty with named entities, and semantic drift.
The future of search is hybrid: leveraging embeddings for semantic understanding while using traditional IR techniques, entity extraction, and lexical filtering to maintain precision. By combining the strengths of both approaches, systems can deliver the "best of both worlds"—finding documents that are semantically relevant while respecting the specific, precise requirements embedded in user queries.
For production search applications, especially those dealing with technical content, product catalogs, or large knowledge bases, the question is not whether to use embeddings, but how to integrate them intelligently with proven traditional search techniques.
Key Takeaway: Modern search platforms like SANDI-Solr demonstrate that the most effective approach combines vector search with traditional keyword matching, entity extraction, and intelligent query understanding to deliver both semantic depth and lexical precision.
Research: Theoretical Limitations of Embedding-Based Retrieval
Extract from: On the Theoretical Limitations of Embedding-Based Retrieval (arXiv:2508.21038v1)
Authors: Orion Weller*,1,2 , Michael Boratko1 , Iftekhar Naim1 and Jinhyuk Lee1 (Google DeepMind, Johns Hopkins University)
Abstract Summary
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models.
Key Finding: This work demonstrates that we may encounter these theoretical limitations in realistic settings with extremely simple queries.
Theoretical Framework
The research connects known results in learning theory, showing that the number of top-𝑘 subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. The study empirically shows that this holds true even if we restrict to 𝑘 = 2, and directly optimize on the test set with free parameterized embeddings.
The LIMIT Dataset
The researchers created a realistic dataset called LIMIT that stress tests models based on these theoretical results. State-of-the-art models fail on this dataset despite the simple nature of the task.
Key Contributions:
- Introduction of the LIMIT dataset, which highlights the fundamental limitations of embedding models
- Theoretical connection showing that embedding models cannot represent all combinations of top-𝑘 documents until they have a large enough embedding dimension 𝑑
- Empirical validation through best case optimization of the vectors themselves
- Practical connection to existing state-of-the-art models by creating a simple natural language instantiation of the theory
Implications
The research shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation. The results imply that the community should consider how instruction-based retrieval will impact retrievers, as there will be combinations of top-𝑘 documents that current models cannot represent.
Practical Impact: This research validates the need for hybrid search approaches like SANDI-Solr, which combine embeddings with traditional IR techniques to overcome the theoretical limitations of pure vector-based retrieval.
Reference: arXiv:2508.21038v1 - "On the Theoretical Limitations of Embedding-Based Retrieval"
Full paper available at: https://arxiv.org/abs/2508.21038