RAG over corporate data: architecture and common mistakes
Why your internal chatbot answers badly and how to fix it. The bottleneck is almost never the model — it is retrieval quality.
An internal RAG chatbot that answers badly is almost never the LLM's fault. It is retrieval. Modern models (Claude, GPT-4) are already good enough that the bottleneck is elsewhere: which documents reach the context and in what shape.
Mistake 1: chunks too big or too small
2000-token chunks dilute the signal — the model gets too much noise. 200-token chunks lose the context the reader needs. The sweet spot we see working: 400–800 tokens, with 10–15% overlap between neighbors. But more important than size: chunks that respect semantic structure (do not cut in the middle of a section).
Mistake 2: embedding search only
Embeddings capture semantic similarity but miss exact terms. If your user asks about "Ministry Resolution 4505" and the embedding approximates it to "health sector regulation", you failed. Always hybrid: BM25 (keyword) + embedding (semantic) + re-ranking. Extra work but quality climbs 30–40%.
Mistake 3: no metadata
Each chunk should carry: source, last-updated date, author, section in the original document. This lets the model say "according to the HR policy updated in March 2026" instead of inventing the date. And it filters noise (discard outdated documents).
Mistake 4: ignoring the user query
Before retrieval, rewrite the query. "How much do I get paid in bonus?" is a bad retrieval query — too ambiguous. Reformulating it to "service bonus policy first half 2026 employees" improves hits dramatically. Done with one extra LLM call before search.
Mistake 5: not evaluating retrieval quality
Before measuring whether the model answers well, measure whether retrieval brings the correct documents. Build a dataset: "for this question, these are the chunks that should come back". Measure precision@k and recall@k. If retrieval does not bring the right chunk, no prompt will save the answer.
Stack we see working in LATAM
For mid-sized companies: pgvector on Postgres (because you already have it), embeddings with OpenAI or Voyage AI (better for Spanish), generation with Claude Sonnet or GPT-4o-mini, hybrid search with BM25 (pgvector gives it), optional re-ranking with Cohere or Voyage. Full stack under 0.001 USD per query at mid scale.
The next level: contextual retrieval
Anthropic published a simple technique: before embedding a chunk, prepend an LLM-generated context paragraph ("this chunk comes from document X about Y"). Improves retrieval ~35% at minimal cost if you use prompt caching.
How we help at Athrun Data Intelligence
We build production RAG systems on corporate data. 30-min call to audit yours if you already have something running, or to design the architecture from scratch.
Sources
- Anthropic — Contextual Retrievalhttps://www.anthropic.com/news/contextual-retrieval
- OpenAI embeddings documentationhttps://platform.openai.com/docs/guides/embeddings
- pgvector — Postgres extensionhttps://github.com/pgvector/pgvector
- Pinecone documentationhttps://docs.pinecone.io/
- LangChain — RAG patternshttps://python.langchain.com/docs/concepts/rag/
Does this resonate? Let us talk.
If this describes a problem you have, schedule 30 minutes with us. No commitment. We tell you if we fit.
Request free diagnostic