Engineering·9 min read

Why RAG Feels Broken for Code (And What Context Engineering Fixes)

Traditional RAG pipelines fail on codebases: fixed-size chunks destroy code structure, embeddings miss exact function names, and there's no session memory. Learn how context engineering combines hybrid search, structure-aware chunking, and token budgeting for accurate AI-assisted development.

A

Alex Lopez

Founder, Snipara

·

You set up a RAG pipeline. Chunked your docs. Embedded them. Built a retrieval chain. And somehow, the answers are still wrong. Here's why traditional RAG falls short for code-aware AI — and what context engineering does differently.

Key Takeaways

  • RAG was designed for documents, not codebases — fixed-size chunks destroy code structure and meaning
  • Embedding-only retrieval misses exact matches — function names, class names, and API paths need keyword search
  • Context engineering combines 5 signals — keywords, semantics, structure, session context, and token budgeting
  • No LLM costs for you — context engineering optimizes what you send to your own LLM

The RAG Promise: Why Everyone Built One

Retrieval-Augmented Generation promised to solve the knowledge cutoff problem. Instead of fine-tuning a model on your data, you retrieve relevant documents at query time and pass them to the LLM as context. Elegant in theory.

The standard RAG pipeline looks like this:

1
ChunkSplit documents into fixed-size pieces (512-1024 tokens)
2
EmbedConvert chunks to vectors using an embedding model
3
StorePut vectors in a vector database (Pinecone, Weaviate, pgvector)
4
RetrieveOn query, find top-K most similar chunks by cosine similarity
5
GeneratePass retrieved chunks + query to an LLM for the answer

For a knowledge base of static articles, this works reasonably well. For codebases? It falls apart in predictable ways.

5 Ways RAG Falls Short for Code and Technical Documentation

1. Fixed-Size Chunking Destroys Code Structure

RAG pipelines typically split documents into 512 or 1024 token chunks. Code doesn't respect arbitrary boundaries. A function definition split across two chunks loses its meaning in both.

The problem:

A 600-token function gets split at token 512. Chunk A has the function signature and half the body. Chunk B has the other half and the next function. Neither chunk is useful on its own.

2. Embeddings Miss Exact Matches

Semantic search is great for finding concepts, but code has specific identifiers. If you search for validateAuthToken, cosine similarity on embeddings might return chunks about "authentication patterns" instead of the actual function.

3. No Awareness of Document Structure

A README section titled "Authentication" with H2 header is more relevant than a passing mention in a changelog entry. Standard RAG treats both equally — a flat bag of chunks with no hierarchy.

4. No Session Context or Memory

Every query starts from scratch. If you just asked about the database schema and now ask about the API layer, RAG doesn't know these questions are related. It can't boost results from the same architectural area.

5. No Token Budget Awareness

RAG retrieves top-K chunks regardless of how many tokens they consume. You might get 5 chunks totaling 8,000 tokens when your budget is 4,000. Or you might get 5 tiny chunks that waste the available budget. There's no intelligent allocation.

Context Engineering: A Better Approach for AI-Assisted Development

Context engineering isn't "RAG but better." It's a fundamentally different approach that treats context delivery as a first-class engineering problem.

DimensionTraditional RAGContext Engineering
ChunkingFixed-size (512 tokens)Structure-aware (respects headers, code blocks)
SearchEmbedding similarity onlyHybrid (keyword + semantic + structure scoring)
RankingCosine similarityMulti-factor (relevance, recency, section level, context agreement)
BudgetTop-K (fixed count)Token-aware (fills budget optimally)
MemoryNoneSession context + persistent decisions
Complex queriesSingle retrieval passRecursive decomposition (rlm_decompose → rlm_multi_query)

How Snipara's Context Engine Works

Snipara's engine combines five signals to find the right context:

Keyword Search

BM25 with length normalization. Finds exact function names, class names, API paths.

Semantic Search

384-dim embeddings with cosine similarity. Finds conceptually related content.

Structure Scoring

H1 > H2 > H3 > paragraph. Section titles weighted 3x over body text.

Session Context

Previous queries boost related sections. The engine learns what you're working on.

Token Budgeting

Fills your budget optimally — no wasted tokens, no overflow.

These signals are fused using Reciprocal Rank Fusion (RRF) — a proven algorithm from information retrieval research that combines multiple ranked lists without requiring score normalization.

Token Reduction
95%
500K → ~5K tokens
Retrieval Latency
<1s
Hybrid search p95
Answer Accuracy
3-5x
More source citations

When RAG Is Fine (And When You Need Context Engineering)

RAG isn't bad — it's just designed for a different problem. Here's when each approach is the right choice:

RAG is fine for:

  • Static knowledge bases (FAQ, support docs)
  • Uniform document types (articles, PDFs)
  • Simple Q&A with no follow-up context
  • Cases where approximate answers are acceptable

Context engineering is better for:

  • Codebases and technical documentation
  • Mixed content (code, markdown, configs, schemas)
  • Multi-turn development sessions
  • Team settings with shared conventions
  • Cost-sensitive workflows (token optimization)

Try Context Engineering for Free

Snipara's free plan includes 100 queries per month — enough to see the difference on a real project. No credit card required.

Quick Start

# Claude Code (recommended)
/plugin marketplace add Snipara/snipara-claude
/snipara:quickstart
# VS Code
ext install snipara.snipara
A

Alex Lopez

Founder, Snipara

Share this article

LinkedInShare