general/llm-context-engineering
LLM Context Engineering
multichainguide🤖 Auto-generatedconfidence mediumhealth 100%
v1.0.0·by AgentRel Community·Updated 3/20/2026
Overview
Context engineering is the practice of carefully constructing the input to a language model to maximize output quality. As context windows grow (Claude: 200K tokens, GPT-4o: 128K), effectively managing what goes in — and what doesn't — is a core skill for building reliable AI applications.
Context Window Fundamentals
Total context = system prompt + conversation history + retrieved docs + tools/schema + output
Token budget example (200K window):
- System prompt: ~1K tokens
- Tools/function schemas: ~2-5K tokens
- Retrieved context (RAG): ~20-50K tokens
- Conversation history: ~10-30K tokens
- Reserved for output: ~4-8K tokens
System Prompt Best Practices
You are [specific role]. Your job is to [specific task].
## Constraints
- Always [hard rule 1]
- Never [hard rule 2]
- When uncertain, [fallback behavior]
## Output Format
Respond in [format]. Example:
[concrete example of desired output]
## Context
[Background information that doesn't change]
Key principles:
- Be specific, not vague — "Answer in 2-3 sentences" not "Be concise"
- Use examples — Show don't tell; one good example > 100 words of description
- Separate instructions from data — Use XML tags or headers to delimit sections
- Put critical instructions last — Recency bias means later instructions are followed more reliably
RAG vs Fine-Tuning Decision Matrix
| Scenario | Recommendation |
|---|---|
| Dynamic / frequently updated data | RAG |
| Proprietary knowledge base | RAG |
| Style/tone/format changes | Fine-tuning |
| Domain-specific reasoning patterns | Fine-tuning |
| Both knowledge + behavior changes | RAG + Fine-tuning |
| Cost-sensitive production | Fine-tuning (smaller model) |
RAG Implementation Pattern
// 1. Chunk documents (overlap to preserve context)
const chunks = splitText(document, { chunkSize: 512, overlap: 50 })
// 2. Embed and store
const embeddings = await embed(chunks)
await vectorDB.upsert(embeddings)
// 3. Retrieve at query time
const query = userMessage
const relevant = await vectorDB.query(query, { topK: 5, minScore: 0.7 })
// 4. Construct context
const context = relevant.map(r => r.text).join('\n\n---\n\n')
// 5. Build prompt with retrieved context
const prompt = `
<context>
${context}
</context>
User question: ${query}
Answer based only on the context above. If not found, say so.
`
Context Compression Techniques
1. Summarization — Compress old conversation turns:
if (tokenCount(history) > 50_000) {
const summary = await llm.summarize(history.slice(0, -10))
history = [{ role: 'system', content: `Previous context: ${summary}` }, ...history.slice(-10)]
}
2. Selective retrieval — Only include relevant chunks, not entire documents
3. Structured extraction — Pre-extract key facts into structured format before adding to context
Prompt Injection Defense
// Never interpolate untrusted user input directly into system prompts
// ❌ Bad
const systemPrompt = `You are a helpful assistant. User info: ${userInput}`
// ✅ Good — separate system from user data
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: userInput }, // untrusted content in user turn
]
Measuring Context Quality
- Faithfulness: Does the output match the provided context?
- Answer relevance: Does the output address the actual question?
- Context recall: Were the relevant chunks actually retrieved?
- Use RAGAS for automated RAG evaluation