general/llm-context-engineering

LLM Context Engineering

multichainguide🤖 Auto-generatedconfidence mediumhealth 100%
v1.0.0·by AgentRel Community·Updated 3/20/2026

Overview

Context engineering is the practice of carefully constructing the input to a language model to maximize output quality. As context windows grow (Claude: 200K tokens, GPT-4o: 128K), effectively managing what goes in — and what doesn't — is a core skill for building reliable AI applications.

Context Window Fundamentals

Total context = system prompt + conversation history + retrieved docs + tools/schema + output

Token budget example (200K window):

  • System prompt: ~1K tokens
  • Tools/function schemas: ~2-5K tokens
  • Retrieved context (RAG): ~20-50K tokens
  • Conversation history: ~10-30K tokens
  • Reserved for output: ~4-8K tokens

System Prompt Best Practices

You are [specific role]. Your job is to [specific task].

## Constraints
- Always [hard rule 1]
- Never [hard rule 2]
- When uncertain, [fallback behavior]

## Output Format
Respond in [format]. Example:
[concrete example of desired output]

## Context
[Background information that doesn't change]

Key principles:

  1. Be specific, not vague — "Answer in 2-3 sentences" not "Be concise"
  2. Use examples — Show don't tell; one good example > 100 words of description
  3. Separate instructions from data — Use XML tags or headers to delimit sections
  4. Put critical instructions last — Recency bias means later instructions are followed more reliably

RAG vs Fine-Tuning Decision Matrix

ScenarioRecommendation
Dynamic / frequently updated dataRAG
Proprietary knowledge baseRAG
Style/tone/format changesFine-tuning
Domain-specific reasoning patternsFine-tuning
Both knowledge + behavior changesRAG + Fine-tuning
Cost-sensitive productionFine-tuning (smaller model)

RAG Implementation Pattern

// 1. Chunk documents (overlap to preserve context)
const chunks = splitText(document, { chunkSize: 512, overlap: 50 })

// 2. Embed and store
const embeddings = await embed(chunks)
await vectorDB.upsert(embeddings)

// 3. Retrieve at query time
const query = userMessage
const relevant = await vectorDB.query(query, { topK: 5, minScore: 0.7 })

// 4. Construct context
const context = relevant.map(r => r.text).join('\n\n---\n\n')

// 5. Build prompt with retrieved context
const prompt = `
<context>
${context}
</context>

User question: ${query}

Answer based only on the context above. If not found, say so.
`

Context Compression Techniques

1. Summarization — Compress old conversation turns:

if (tokenCount(history) > 50_000) {
  const summary = await llm.summarize(history.slice(0, -10))
  history = [{ role: 'system', content: `Previous context: ${summary}` }, ...history.slice(-10)]
}

2. Selective retrieval — Only include relevant chunks, not entire documents

3. Structured extraction — Pre-extract key facts into structured format before adding to context

Prompt Injection Defense

// Never interpolate untrusted user input directly into system prompts
// ❌ Bad
const systemPrompt = `You are a helpful assistant. User info: ${userInput}`

// ✅ Good — separate system from user data
const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: userInput },  // untrusted content in user turn
]

Measuring Context Quality

  • Faithfulness: Does the output match the provided context?
  • Answer relevance: Does the output address the actual question?
  • Context recall: Were the relevant chunks actually retrieved?
  • Use RAGAS for automated RAG evaluation

Reference