Answer Strategy Comparison

This document helps you choose the right answer strategy for your use case.

Current Strategies

1. llama-3.3-70b-default ✅ (Active)

Best for: General-purpose Q&A, complex questions, high-quality answers

Configuration:

Model: llama-3.3-70b-versatile
Max context pages: 5
Temperature: 0.3
Tokens per page: ~2000

Performance:

Search: 50-500ms (depends on search strategy)
Context prep: <50ms
LLM call: 500-3000ms
Total: ~1-3.5s

Quality: ⭐⭐⭐⭐⭐ (5/5) Speed: ⭐⭐⭐ (3/5) Cost: ⭐⭐⭐ (3/5)

Pros:

✅ Excellent reasoning and comprehension
✅ Handles complex multi-part questions
✅ Good at code examples and technical details
✅ Consistent, reliable answers

Cons:

❌ Slower than 8B models
❌ More expensive per token
❌ May be overkill for simple questions

Use cases:

Complex how-to questions
Multi-step implementations
Comparing features/models
Debugging and troubleshooting
Code generation with explanations

Future Strategies (Templates)

2. llama-3.1-8b-fast (Not yet implemented)

Best for: Simple factual questions, quick lookups, high-volume traffic

Configuration:

Model: llama-3.1-8b-instant
Max context pages: 2-3
Temperature: 0.3
Tokens per page: ~1000

Expected Performance:

Search: 50-500ms
Context prep: <30ms
LLM call: 200-800ms
Total: ~0.3-1.3s

Quality: ⭐⭐⭐⭐ (4/5) Speed: ⭐⭐⭐⭐⭐ (5/5) Cost: ⭐⭐⭐⭐⭐ (5/5)

Best for:

"What models are available?"
"What is X?"
"How much does Y cost?"
Simple API lookups

3. mixtral-8x7b-extended (Not yet implemented)

Best for: Very complex questions requiring lots of context

Configuration:

Model: mixtral-8x7b-32768
Max context pages: 10
Temperature: 0.3
Tokens per page: ~3000

Expected Performance:

Search: 50-500ms
Context prep: ~100ms
LLM call: 1000-4000ms
Total: ~1.5-5s

Quality: ⭐⭐⭐⭐⭐ (5/5) Speed: ⭐⭐ (2/5) Cost: ⭐⭐ (2/5)

Best for:

"Compare all available models"
"How do I implement a complete chat system with X, Y, Z features?"
Questions requiring synthesis across many docs

4. llama-3.3-70b-code (Not yet implemented)

Best for: Code generation and technical implementation questions

Configuration:

Model: llama-3.3-70b-versatile
Max context pages: 5
Temperature: 0.2 (more deterministic for code)
Custom system prompt optimized for code

Expected Performance:

Similar to llama-3.3-70b-default
Total: ~1-3.5s

Quality: ⭐⭐⭐⭐⭐ (5/5) for code Speed: ⭐⭐⭐ (3/5) Cost: ⭐⭐⭐ (3/5)

Best for:

"Show me example code for X"
"How do I implement Y in Python/JavaScript?"
API usage examples
Debugging code issues

5. citation-mode (Not yet implemented)

Best for: Academic/research use, when you need to cite sources

Configuration:

Model: llama-3.3-70b-versatile
Max context pages: 5
Temperature: 0.3
Custom system prompt for citations
Post-processing to add inline citations

Expected Performance:

Similar to llama-3.3-70b-default
+50-100ms for citation processing
Total: ~1.5-4s

Quality: ⭐⭐⭐⭐⭐ (5/5) Speed: ⭐⭐⭐ (3/5) Cost: ⭐⭐⭐ (3/5)

Best for:

Documentation generation
Training materials
When you need exact source references
Verifying information

Comparison Matrix

Strategy	Speed	Quality	Cost	Context	Best For
llama-3.3-70b-default ✅	Medium	Excellent	Medium	5 pages	General purpose, complex Q&A
llama-3.1-8b-fast	Fast	Good	Low	2-3 pages	Simple lookups, high volume
mixtral-8x7b-extended	Slow	Excellent	High	10 pages	Very complex questions
llama-3.3-70b-code	Medium	Excellent	Medium	5 pages	Code generation
citation-mode	Medium	Excellent	Medium	5 pages	Research, documentation

Choosing a Strategy

Decision Tree

Is the question simple and factual?
- Yes → Use llama-3.1-8b-fast (when implemented)
- No → Continue
Does it require lots of documentation context?
- Yes → Use mixtral-8x7b-extended (when implemented)
- No → Continue
Is it primarily about code/implementation?
- Yes → Use llama-3.3-70b-code (when implemented)
- No → Continue
Do you need citations?
- Yes → Use citation-mode (when implemented)
- No → Use llama-3.3-70b-default ✅

By Question Type

Factual Questions ("What is X?")

🥇 llama-3.1-8b-fast (fast + good enough)
🥈 llama-3.3-70b-default (better quality)

How-To Questions ("How do I X?")

🥇 llama-3.3-70b-default (balanced)
🥈 llama-3.3-70b-code (if code-heavy)

Complex Questions ("How do I build X with Y and Z?")

🥇 mixtral-8x7b-extended (more context)
🥈 llama-3.3-70b-default (usually sufficient)

Code Examples ("Show me code for X")

🥇 llama-3.3-70b-code (optimized)
🥈 llama-3.3-70b-default (good enough)

Comparison Questions ("What's the difference between X and Y?")

🥇 llama-3.3-70b-default (good reasoning)
🥈 mixtral-8x7b-extended (if comparing many things)

By Performance Requirements

Need fastest response (<1s goal)

Use llama-3.1-8b-fast (when implemented)
Reduce maxContextPages to 2
Use fastest search strategy

Need best quality (no time constraint)

Use mixtral-8x7b-extended (when implemented)
Increase maxContextPages to 10
Raise minScore to 60+

Balanced (1-3s acceptable)

Use llama-3.3-70b-default ✅ (current default)
Default settings work well

By Traffic Patterns

High volume, many simple questions

Implement and use llama-3.1-8b-fast
Consider caching common answers
Use rate limiting

Low volume, complex questions

Use llama-3.3-70b-default or mixtral-8x7b-extended
Quality over speed

Mixed traffic

Route simple questions to fast strategy
Route complex questions to powerful strategy
Consider implementing multi-strategy endpoint

Cost Considerations

Approximate token costs per answer (based on Groq pricing):

llama-3.3-70b-default

Context: ~10k tokens (5 pages × 2k each)
Answer: ~500 tokens
Total: ~10.5k tokens per answer

llama-3.1-8b-fast (estimated)

Context: ~3k tokens (3 pages × 1k each)
Answer: ~300 tokens
Total: ~3.3k tokens per answer
~3x cheaper than 70B

mixtral-8x7b-extended (estimated)

Context: ~30k tokens (10 pages × 3k each)
Answer: ~700 tokens
Total: ~30.7k tokens per answer
~3x more expensive than default

Cost-saving strategies:

Use smaller models for simple questions
Reduce maxContextPages
Increase minScore (fewer pages in context)
Cache common questions/answers
Implement tiered routing

Implementation Guide

Adding a New Strategy

Copy template:

cp answer/llama-3.3-70b-default.ts answer/my-strategy.ts

Modify configuration:

const DEFAULT_MODEL = "your-model";
const DEFAULT_MAX_CONTEXT_PAGES = 3;
const DEFAULT_TEMPERATURE = 0.5;

Update strategy metadata:

export const answerStrategy: AnswerStrategy = {
  name: "my-strategy",
  description: "What makes it special",
  answer: async (query, options) => { /* ... */ }
};

Activate:

// answer/index.ts
import { answerStrategy } from "./my-strategy.ts";

Test:

curl "http://localhost:8000/answer/info"
curl "http://localhost:8000/answer?q=test"

Testing Multiple Strategies

Create a comparison test script:

// testing/answer-comparison.ts
import { questions } from "./questions.ts";

const strategies = [
  "llama-3.3-70b-default",
  "llama-3.1-8b-fast",
  "mixtral-8x7b-extended"
];

for (const strategy of strategies) {
  for (const question of questions) {
    // Switch strategy
    // Run question
    // Collect metrics
  }
}
// Compare results

Monitoring and Optimization

Key Metrics to Track

Response time breakdown:
- Search duration
- Context prep duration
- LLM duration
- Total duration
Quality metrics:
- User feedback/ratings
- Answer correctness
- Relevance of sources
Cost metrics:
- Tokens per answer
- Cost per answer
- Monthly spend
Usage patterns:
- Question types
- Model distribution
- Cache hit rates

Optimization Tips

Speed:
- Profile with enableTiming: true
- Optimize slowest component
- Consider caching
Quality:
- Test with real questions
- Tune system prompts
- Adjust context size
Cost:
- Route by question complexity
- Cache common answers
- Use smaller models when possible

Future Enhancements

Multi-Strategy Routing

Automatically select strategy based on question analysis:

async function smartRoute(question: string) {
  const complexity = analyzeComplexity(question);
  const type = classifyQuestion(question);
  
  if (complexity === "simple" && type === "factual") {
    return "llama-3.1-8b-fast";
  } else if (type === "code") {
    return "llama-3.3-70b-code";
  } else if (complexity === "complex") {
    return "mixtral-8x7b-extended";
  }
  return "llama-3.3-70b-default";
}

Answer Caching

Cache answers for common questions:

const cachedAnswer = await cache.get(questionHash);
if (cachedAnswer) return cachedAnswer;

const answer = await generateAnswer(question);
await cache.set(questionHash, answer, ttl);

Follow-up Questions

Support conversation context:

await answerQuestion(question, {
  conversationHistory: previousQA,
  maxContextPages: 3 // Reduced to fit history
});

Resources

Groq Models: https://console.groq.com/docs/models
Groq Pricing: https://groq.com/pricing/
Model Benchmarks: https://console.groq.com/docs/benchmarks