This document helps you choose the right answer strategy for your use case.
Best for: General-purpose Q&A, complex questions, high-quality answers
Configuration:
- Model:
llama-3.3-70b-versatile - Max context pages: 5
- Temperature: 0.3
- Tokens per page: ~2000
Performance:
- Search: 50-500ms (depends on search strategy)
- Context prep: <50ms
- LLM call: 500-3000ms
- Total: ~1-3.5s
Quality: ⭐⭐⭐⭐⭐ (5/5) Speed: ⭐⭐⭐ (3/5) Cost: ⭐⭐⭐ (3/5)
Pros:
- ✅ Excellent reasoning and comprehension
- ✅ Handles complex multi-part questions
- ✅ Good at code examples and technical details
- ✅ Consistent, reliable answers
Cons:
- ❌ Slower than 8B models
- ❌ More expensive per token
- ❌ May be overkill for simple questions
Use cases:
- Complex how-to questions
- Multi-step implementations
- Comparing features/models
- Debugging and troubleshooting
- Code generation with explanations
Best for: Simple factual questions, quick lookups, high-volume traffic
Configuration:
- Model:
llama-3.1-8b-instant - Max context pages: 2-3
- Temperature: 0.3
- Tokens per page: ~1000
Expected Performance:
- Search: 50-500ms
- Context prep: <30ms
- LLM call: 200-800ms
- Total: ~0.3-1.3s
Quality: ⭐⭐⭐⭐ (4/5) Speed: ⭐⭐⭐⭐⭐ (5/5) Cost: ⭐⭐⭐⭐⭐ (5/5)
Best for:
- "What models are available?"
- "What is X?"
- "How much does Y cost?"
- Simple API lookups
Best for: Very complex questions requiring lots of context
Configuration:
- Model:
mixtral-8x7b-32768 - Max context pages: 10
- Temperature: 0.3
- Tokens per page: ~3000
Expected Performance:
- Search: 50-500ms
- Context prep: ~100ms
- LLM call: 1000-4000ms
- Total: ~1.5-5s
Quality: ⭐⭐⭐⭐⭐ (5/5) Speed: ⭐⭐ (2/5) Cost: ⭐⭐ (2/5)
Best for:
- "Compare all available models"
- "How do I implement a complete chat system with X, Y, Z features?"
- Questions requiring synthesis across many docs
Best for: Code generation and technical implementation questions
Configuration:
- Model:
llama-3.3-70b-versatile - Max context pages: 5
- Temperature: 0.2 (more deterministic for code)
- Custom system prompt optimized for code
Expected Performance:
- Similar to llama-3.3-70b-default
- Total: ~1-3.5s
Quality: ⭐⭐⭐⭐⭐ (5/5) for code Speed: ⭐⭐⭐ (3/5) Cost: ⭐⭐⭐ (3/5)
Best for:
- "Show me example code for X"
- "How do I implement Y in Python/JavaScript?"
- API usage examples
- Debugging code issues
Best for: Academic/research use, when you need to cite sources
Configuration:
- Model:
llama-3.3-70b-versatile - Max context pages: 5
- Temperature: 0.3
- Custom system prompt for citations
- Post-processing to add inline citations
Expected Performance:
- Similar to llama-3.3-70b-default
- +50-100ms for citation processing
- Total: ~1.5-4s
Quality: ⭐⭐⭐⭐⭐ (5/5) Speed: ⭐⭐⭐ (3/5) Cost: ⭐⭐⭐ (3/5)
Best for:
- Documentation generation
- Training materials
- When you need exact source references
- Verifying information
| Strategy | Speed | Quality | Cost | Context | Best For |
|---|---|---|---|---|---|
| llama-3.3-70b-default ✅ | Medium | Excellent | Medium | 5 pages | General purpose, complex Q&A |
| llama-3.1-8b-fast | Fast | Good | Low | 2-3 pages | Simple lookups, high volume |
| mixtral-8x7b-extended | Slow | Excellent | High | 10 pages | Very complex questions |
| llama-3.3-70b-code | Medium | Excellent | Medium | 5 pages | Code generation |
| citation-mode | Medium | Excellent | Medium | 5 pages | Research, documentation |
-
Is the question simple and factual?
- Yes → Use
llama-3.1-8b-fast(when implemented) - No → Continue
- Yes → Use
-
Does it require lots of documentation context?
- Yes → Use
mixtral-8x7b-extended(when implemented) - No → Continue
- Yes → Use
-
Is it primarily about code/implementation?
- Yes → Use
llama-3.3-70b-code(when implemented) - No → Continue
- Yes → Use
-
Do you need citations?
- Yes → Use
citation-mode(when implemented) - No → Use
llama-3.3-70b-default✅
- Yes → Use
Factual Questions ("What is X?")
- 🥇 llama-3.1-8b-fast (fast + good enough)
- 🥈 llama-3.3-70b-default (better quality)
How-To Questions ("How do I X?")
- 🥇 llama-3.3-70b-default (balanced)
- 🥈 llama-3.3-70b-code (if code-heavy)
Complex Questions ("How do I build X with Y and Z?")
- 🥇 mixtral-8x7b-extended (more context)
- 🥈 llama-3.3-70b-default (usually sufficient)
Code Examples ("Show me code for X")
- 🥇 llama-3.3-70b-code (optimized)
- 🥈 llama-3.3-70b-default (good enough)
Comparison Questions ("What's the difference between X and Y?")
- 🥇 llama-3.3-70b-default (good reasoning)
- 🥈 mixtral-8x7b-extended (if comparing many things)
Need fastest response (<1s goal)
- Use
llama-3.1-8b-fast(when implemented) - Reduce
maxContextPagesto 2 - Use fastest search strategy
Need best quality (no time constraint)
- Use
mixtral-8x7b-extended(when implemented) - Increase
maxContextPagesto 10 - Raise
minScoreto 60+
Balanced (1-3s acceptable)
- Use
llama-3.3-70b-default✅ (current default) - Default settings work well
High volume, many simple questions
- Implement and use
llama-3.1-8b-fast - Consider caching common answers
- Use rate limiting
Low volume, complex questions
- Use
llama-3.3-70b-defaultormixtral-8x7b-extended - Quality over speed
Mixed traffic
- Route simple questions to fast strategy
- Route complex questions to powerful strategy
- Consider implementing multi-strategy endpoint
Approximate token costs per answer (based on Groq pricing):
- Context: ~10k tokens (5 pages × 2k each)
- Answer: ~500 tokens
- Total: ~10.5k tokens per answer
- Context: ~3k tokens (3 pages × 1k each)
- Answer: ~300 tokens
- Total: ~3.3k tokens per answer
- ~3x cheaper than 70B
- Context: ~30k tokens (10 pages × 3k each)
- Answer: ~700 tokens
- Total: ~30.7k tokens per answer
- ~3x more expensive than default
- Use smaller models for simple questions
- Reduce
maxContextPages - Increase
minScore(fewer pages in context) - Cache common questions/answers
- Implement tiered routing
-
Copy template:
cp answer/llama-3.3-70b-default.ts answer/my-strategy.ts -
Modify configuration:
const DEFAULT_MODEL = "your-model"; const DEFAULT_MAX_CONTEXT_PAGES = 3; const DEFAULT_TEMPERATURE = 0.5; -
Update strategy metadata:
export const answerStrategy: AnswerStrategy = { name: "my-strategy", description: "What makes it special", answer: async (query, options) => { /* ... */ } }; -
Activate:
// answer/index.ts import { answerStrategy } from "./my-strategy.ts"; -
Test:
curl "http://localhost:8000/answer/info" curl "http://localhost:8000/answer?q=test"
Create a comparison test script:
// testing/answer-comparison.ts
import { questions } from "./questions.ts";
const strategies = [
"llama-3.3-70b-default",
"llama-3.1-8b-fast",
"mixtral-8x7b-extended"
];
for (const strategy of strategies) {
for (const question of questions) {
// Switch strategy
// Run question
// Collect metrics
}
}
// Compare results
-
Response time breakdown:
- Search duration
- Context prep duration
- LLM duration
- Total duration
-
Quality metrics:
- User feedback/ratings
- Answer correctness
- Relevance of sources
-
Cost metrics:
- Tokens per answer
- Cost per answer
- Monthly spend
-
Usage patterns:
- Question types
- Model distribution
- Cache hit rates
-
Speed:
- Profile with
enableTiming: true - Optimize slowest component
- Consider caching
- Profile with
-
Quality:
- Test with real questions
- Tune system prompts
- Adjust context size
-
Cost:
- Route by question complexity
- Cache common answers
- Use smaller models when possible
Automatically select strategy based on question analysis:
async function smartRoute(question: string) {
const complexity = analyzeComplexity(question);
const type = classifyQuestion(question);
if (complexity === "simple" && type === "factual") {
return "llama-3.1-8b-fast";
} else if (type === "code") {
return "llama-3.3-70b-code";
} else if (complexity === "complex") {
return "mixtral-8x7b-extended";
}
return "llama-3.3-70b-default";
}
Cache answers for common questions:
const cachedAnswer = await cache.get(questionHash);
if (cachedAnswer) return cachedAnswer;
const answer = await generateAnswer(question);
await cache.set(questionHash, answer, ttl);
Support conversation context:
await answerQuestion(question, {
conversationHistory: previousQA,
maxContextPages: 3 // Reduced to fit history
});
- Groq Models: https://console.groq.com/docs/models
- Groq Pricing: https://groq.com/pricing/
- Model Benchmarks: https://console.groq.com/docs/benchmarks
