RAG over corporate data: architecture and common mistakes

Why your internal chatbot answers badly and how to fix it. The bottleneck is almost never the model — it is retrieval quality.

An internal RAG chatbot that answers badly is almost never the LLM’s fault. It is retrieval. Modern models (Claude, GPT-4) are already good enough that the bottleneck is elsewhere: which documents reach the context and in what shape.

Mistake 1: chunks too big or too small

2000-token chunks dilute the signal — the model gets too much noise. 200-token chunks lose the context the reader needs. The sweet spot we see working: 400–800 tokens, with 10–15% overlap between neighbors. But more important than size: chunks that respect semantic structure (do not cut in the middle of a section).

Mistake 2: embedding search only

Embeddings capture semantic similarity but miss exact terms. If your user asks about “Ministry Resolution 4505” and the embedding approximates it to “health sector regulation”, you failed. Always hybrid: BM25 (keyword) + embedding (semantic) + re-ranking. Extra work but quality climbs 30–40%.

Mistake 3: no metadata

Each chunk should carry: source, last-updated date, author, section in the original document. This lets the model say “according to the HR policy updated in March 2026” instead of inventing the date. And it filters noise (discard outdated documents).

Mistake 4: ignoring the user query

Before retrieval, rewrite the query. “How much do I get paid in bonus?” is a bad retrieval query — too ambiguous. Reformulating it to “service bonus policy first half 2026 employees” improves hits dramatically. Done with one extra LLM call before search.

Mistake 5: not evaluating retrieval quality

Before measuring whether the model answers well, measure whether retrieval brings the correct documents. Build a dataset: “for this question, these are the chunks that should come back”. Measure precision@k and recall@k. If retrieval does not bring the right chunk, no prompt will save the answer.

Stack we see working in LATAM

For mid-sized companies: pgvector on Postgres (because you already have it), embeddings with OpenAI or Voyage AI (better for Spanish), generation with Claude Sonnet or GPT-4o-mini, hybrid search with BM25 (pgvector gives it), optional re-ranking with Cohere or Voyage. Full stack under 0.001 USD per query at mid scale.

The next level: contextual retrieval

Anthropic published a simple technique: before embedding a chunk, prepend an LLM-generated context paragraph (“this chunk comes from document X about Y”). Improves retrieval ~35% at minimal cost if you use prompt caching.

How we help at Athrun Data Intelligence

We build production RAG systems on corporate data. 30-min call to audit yours if you already have something running, or to design the architecture from scratch.

RAG over corporate data: architecture and common mistakes

Mistake 1: chunks too big or too small

Mistake 2: embedding search only

Mistake 3: no metadata

Mistake 4: ignoring the user query

Mistake 5: not evaluating retrieval quality

Stack we see working in LATAM

The next level: contextual retrieval

How we help at Athrun Data Intelligence

Sources

Does this resonate? Let's talk.

Related articles

AI agents in production: guardrails, evals, and when NOT to use them