Answer strategies implement Retrieval-Augmented Generation (RAG) to answer user questions using documentation search and LLM inference.
Each answer strategy follows this general pattern:
- Search: Use the active search strategy to find relevant documentation pages
- Retrieve: Get full content from the top N search results
- Format: Format the documentation as context for the LLM
- Generate: Call the Groq LLM with the context and user question to generate an answer
The active strategy is selected in answer/index.ts by uncommenting the desired import.
llama-3.3-70b-default - Default strategy using Llama 3.3 70B
- Uses up to 5 documentation pages in context
- Truncates each page to ~2000 tokens to manage context window
- Temperature: 0.3 (relatively focused/deterministic)
- Model:
llama-3.3-70b-versatile
Best for: General purpose Q&A with good balance of quality and speed
- Model:
llama-3.3-70b-versatile - Context: Up to 5 pages, 2000 tokens each
- Temperature: 0.3 (default)
- Performance: ~1-3s depending on search and LLM response time
Configuration options:
{
limit: 10, // Search results to consider
minScore: 0, // Minimum search score
maxContextPages: 5, // Pages to include in context
temperature: 0.3, // LLM temperature
model: "llama-3.3-70b-versatile" // Override model
}
To create a new answer strategy:
- Create a new file in
/answer/(e.g.,llama-3.1-8b-fast.ts) - Implement the
AnswerStrategyinterface:
import type { AnswerStrategy, AnswerResult, AnswerOptions } from "./types.ts";
export const answerStrategy: AnswerStrategy = {
name: "your-strategy-name",
description: "Brief description of your strategy",
answer: async (query: string, options: AnswerOptions = {}): Promise<AnswerResult> => {
// Your implementation here
// 1. Search for relevant pages
// 2. Get full content
// 3. Format as context
// 4. Call LLM
// 5. Return AnswerResult
},
};
- Add the import to
answer/index.tsand comment out other strategies - Test your strategy
Here are some ideas for additional strategies you might want to implement:
- llama-3.1-8b-fast: Faster responses with smaller model, fewer pages
- mixtral-8x7b-extended: More context pages with Mixtral's larger context window
- llama-3.3-70b-code-focused: Specialized prompts for code examples and API usage
- multi-step-reasoning: Break down complex questions into sub-questions
- citation-mode: Include specific citations/references in the answer
- conversational: Support follow-up questions with conversation history
GET /answer?q=How+do+I+use+the+Groq+API
GET /answer?q=What+models+are+available&maxContextPages=3&temperature=0.5
GET /answer/info
{ "answer": "markdown formatted answer", "query": "user's original question", "searchResults": [ { "path": "api/reference", "url": "https://...", "title": "API Reference", "score": 95.2 } ], "contextUsed": 5, "totalTokens": 8500, "metadata": { "strategy": "llama-3.3-70b-default", "model": "llama-3.3-70b-versatile", "temperature": 0.3, "searchResultsCount": 10, "timings": { "search": 45.2, "contextPrep": 5.1, "llm": 1250.3, "total": 1300.6 } } }
- More pages = more comprehensive but slower and more expensive
- Fewer pages = faster but might miss relevant info
- Recommended: 3-5 pages for most questions
- Per page: 1000-3000 tokens is a good range
- Total context: Stay under 32k tokens for most models
- Llama 3.3 70B supports up to 128k context, but quality may degrade
- 0.0-0.3: Focused, deterministic answers (good for facts)
- 0.4-0.7: Balanced creativity and accuracy
- 0.8-1.0: More creative but less consistent (avoid for docs Q&A)
- Higher limit: More pages to choose from (10-20 recommended)
- Higher minScore: Only use highly relevant pages (50-70 threshold)
- Balance between precision (fewer, better pages) and recall (more pages, might miss relevant info)
-
Match model to task:
- Simple factual questions: Smaller/faster models
- Complex reasoning: Larger models (70B+)
- Code generation: Models fine-tuned for code
-
System prompts matter:
- Be specific about the documentation domain (Groq API)
- Set expectations for answer format (markdown, citations, etc.)
- Provide guidelines for handling missing information
- Include URL formatting instructions (e.g., "remove .md extensions")
-
Manage context window:
- Don't just dump all pages - be selective
- Truncate long pages intelligently
- Consider summarizing very long pages first
-
Error handling:
- Gracefully handle search failures
- Provide fallback responses when LLM fails
- Include relevant metadata for debugging
-
Post-processing:
- Use
cleanupMarkdownLinks()to remove .md extensions from generated links - Add custom post-processing for your specific needs
- Validate links and citations if needed
- Use
Answers automatically clean up markdown links to remove .md extensions:
// Before cleanup
[Compound](https://console.groq.com/docs/agentic-tooling/compound-beta.md)
// After cleanup
[Compound](https://console.groq.com/docs/agentic-tooling/compound-beta)
This happens through:
- System prompt: Instructs the LLM to avoid .md extensions
- Post-processing:
cleanupMarkdownLinks()function removes any remaining .md extensions
To disable cleanup (not recommended):
// In your custom strategy, skip the cleanup:
return {
answer: llmResponse, // Don't call cleanupMarkdownLinks()
// ... rest of result
};
Test your strategy with various question types:
- Factual: "What models does Groq support?"
- How-to: "How do I authenticate with the API?"
- Comparison: "What's the difference between model X and Y?"
- Code examples: "Show me an example of streaming responses"
- Complex: "How do I implement rate limiting with retries?"
Target performance for answer strategies:
- Search: 50-500ms (depends on active search strategy)
- Context prep: <50ms
- LLM call: 500-3000ms (depends on model and response length)
- Total: 1-4s for most queries
Strategies taking >5s should be optimized or marked as "detailed" mode.
