When we started building the TradeGladiator AI Engine, the obvious choice for retrieval seemed to be RAG (Retrieval-Augmented Generation) with vector embeddings. It is the prevailing approach in 2026 for any system that feeds context into an LLM. But after benchmarking both approaches against our actual signal dataset, we chose BM25 instead. This article explains the reasoning, the trade-offs, and when we would switch.
What Is RAG?
Retrieval-Augmented Generation is a pattern where you embed your documents into a high-dimensional vector space, store them in a vector database (Pinecone, Weaviate, Qdrant, pgvector), and at query time, embed the user's query and retrieve the k nearest neighbors. Those retrieved documents are injected into the LLM's context window to ground its response in real data.
RAG excels when:
- Documents are long-form, unstructured natural language (support articles, research papers, wikis)
- Queries are semantically ambiguous ("what's our refund policy for enterprise clients?")
- You need to capture meaning that transcends exact word matches
The typical RAG stack in 2026 involves an embedding model (OpenAI text-embedding-3-large or Cohere embed-v4), a vector database, and a reranker. The infrastructure cost ranges from $50 to $200+ per month even at modest scale, and latency adds 200-400ms per retrieval.
What Is BM25?
BM25 (Best Matching 25) is a probabilistic ranking function from the 1990s. It scores documents by term frequency, inverse document frequency, and document length normalization. No embeddings, no neural networks, no external APIs. It runs entirely in-process.
The algorithm is deceptively simple: for a given query, BM25 boosts documents that contain rare query terms (high IDF) and penalizes overly long documents. The math is well-understood, deterministic, and fast.
score(D, Q) = sum over qi in Q of:
IDF(qi) * (f(qi,D) * (k1+1)) / (f(qi,D) + k1 * (1 - b + b * |D|/avgDL))
Where f(qi,D) is the term frequency, |D| is document length, avgDL is the average document length, and k1 and b are tunable parameters (typically 1.2 and 0.75).
Why Financial Signals Are Different
Here is the key insight that drove our decision: trading signals are not unstructured natural language. They are structured, keyword-rich documents with precise terminology.
A typical signal document in our system looks like this:
{
"ticker": "AAPL",
"direction": "LONG",
"strategy": "DAY_TRADE_AGGREGATOR",
"confidence": 82,
"entry": { "price": 198.50, "inZone": true },
"target1": 203.00,
"stopLoss": 196.20,
"mtf": {
"1d": { "direction": "BULLISH", "strength": "STRONG" },
"1w": { "direction": "BULLISH", "strength": "MODERATE" }
},
"aiAnalysis": {
"summary": "Strong bullish confluence across timeframes...",
"entry": "Price sitting at key support with volume confirmation..."
}
}
When our reflection loop queries past signals, the retrieval query is something like "AAPL LONG DAY_TRADE_AGGREGATOR bullish". Every term in that query is an exact keyword match. There is zero semantic ambiguity. "AAPL" means Apple Inc., "LONG" means a buy position, "bullish" means bullish.
Vector embeddings add no value here. In fact, they add noise. An embedding model might position "AAPL" near "MSFT" in vector space (they are both mega-cap tech stocks), which is exactly the wrong behavior when you are looking for past AAPL signals specifically.
BM25, by contrast, gives an exact score boost for every matching keyword. If you search for "AAPL LONG," documents containing both "AAPL" and "LONG" rank highest. Documents about MSFT score zero for the AAPL term. This is exactly what we want.
The Cost Comparison: $0 vs $50-200/mo
This is the part that matters for teams building production AI systems on a budget.
| Factor | BM25 (Our Choice) | RAG Pipeline |
|---|---|---|
| Infrastructure cost | $0 | $50-200/mo |
| Embedding API cost | $0 | $5-20/mo |
| Retrieval latency | <5ms | 200-400ms |
| Cold start time | ~50ms (build index) | 1-3s (connect + query) |
| External dependencies | 0 | 2-3 services |
| Deterministic results | Yes | No (model-dependent) |
| Semantic understanding | No | Yes |
For a startup or indie dev building a fintech product, these numbers matter. Every external dependency is a potential failure point. Every dollar in monthly infrastructure is a dollar that does not go toward product development. And every 200ms of added latency makes the user experience measurably worse.
Our Implementation: Inline BM25 in Cloud Functions
Our BM25 implementation lives entirely inside a Firebase Cloud Function. There is no separate search service, no vector database, no embedding pipeline. The entire retrieval system is approximately 200 lines of TypeScript.
Here is how it works:
- Index build: On the first request (or after cache expiry), we load the user's signal history from Firestore, tokenize each document, and build an in-memory BM25 index. For a typical user with 500-2,000 signals, this takes under 50ms.
- Caching: The index is cached in memory with a 5-minute TTL. Cloud Functions instances persist between invocations, so subsequent requests hit the warm cache.
- Query: When the AI reflection loop needs context, it constructs a query from the current signal's metadata (ticker, direction, strategy, key terms from analysis) and runs BM25 scoring against the cached index.
- Result: The top-k results (typically k=5) are returned as context for the LLM prompt. Total retrieval time: under 5ms from cache.
Implementation Detail: Document Serialization
Each signal is serialized into a flat text representation for indexing: "AAPL LONG DAY_TRADE_AGGREGATOR confidence:82 bullish strong_1d moderate_1w target:203 stop:196.2". This preserves the structured nature of the data while making it BM25-searchable. We experimented with field-weighted scoring (boosting ticker matches) but found uniform weighting sufficient for our corpus size.
The 5K document index limit is not a hard wall. BM25 scales linearly with corpus size, and we have tested up to 10K documents with retrieval times under 15ms. For users who exceed this (institutional accounts with years of history), we partition by date range and only index the most recent 12 months.
When to Upgrade to Vector Search
BM25 is not the right answer forever. Here are the conditions under which we would add a vector retrieval layer:
- Natural language journal entries: If we add free-text trade journals where users write paragraphs about their reasoning, BM25 would miss semantic connections between "I was worried about the Fed meeting" and "rate decision uncertainty."
- Cross-asset correlation queries: "Show me signals similar to this setup across all tickers" requires semantic similarity that BM25 cannot provide.
- Corpus size beyond 50K documents: At very large scale, hybrid search (BM25 + vector) with a reranker consistently outperforms either approach alone.
- Multi-language support: If we expand to non-English markets, embedding models handle cross-lingual retrieval natively.
Our architecture is designed for this upgrade path. The retrieval interface is abstracted behind a SignalRetriever contract, so swapping BM25 for a hybrid pipeline requires changing one implementation, not rewiring the entire system.
Benchmarks and Results
We ran a head-to-head comparison on a test dataset of 3,200 signals across 45 tickers, spanning 8 months of trading data. The evaluation metric was precision@5 (how many of the top 5 retrieved signals were genuinely relevant to the query).
| Metric | BM25 | RAG (text-embedding-3-large) |
|---|---|---|
| Precision@5 | 0.91 | 0.84 |
| Recall@10 | 0.78 | 0.86 |
| Mean retrieval latency | 3.2ms | 287ms |
| P95 retrieval latency | 8ms | 520ms |
| Index build time | 42ms | 12.4s |
The results confirmed our hypothesis. For structured, keyword-rich financial data, BM25 achieves higher precision (0.91 vs 0.84) because it does not confuse semantically similar but factually different entities. RAG wins on recall (0.86 vs 0.78), pulling in more loosely related signals, but for our use case, precision matters more than recall. We would rather give the LLM 5 highly relevant signals than 10 somewhat relevant ones.
The latency difference is dramatic. BM25 retrieval at 3.2ms is effectively free in the context of an LLM call that takes 2-4 seconds. RAG adds 287ms of overhead before the LLM even starts generating.
Key Takeaways for Developers
If you are building a fintech AI system and considering your retrieval architecture:
- Profile your data first. If your documents are structured and keyword-rich, BM25 might be all you need.
- Start simple, measure, then optimize. You can always add vector search later. You cannot easily remove complexity.
- Zero-dependency retrieval is a feature. Fewer external services means fewer failure modes, lower cost, and faster iteration.
- Benchmark on your actual data. Generic benchmarks on Wikipedia or MS MARCO do not predict performance on financial signals.
- Design for the upgrade path. Abstract your retrieval behind an interface so the switch to hybrid search is a one-file change.
The best retrieval system is the one that matches your data characteristics, not the one with the most hype. For TradeGladiator's AI trading signals, BM25 delivers better precision, zero infrastructure cost, and sub-5ms latency. That is a hard combination to beat.
Related Reading
Learn how our AI uses retrieved context to improve over time in How AI Learns From Your Past Trades (Reflection Loop), or explore the full AI Engine architecture.