A Hono API server that fetches, caches, and processes Groq documentation pages with token counting and AI-generated metadata.
- Fetches documentation pages from Groq's console
- Caches page content, metadata, token counts, and embeddings in SQLite
- Token counting using tiktoken (GPT-4 encoding)
- AI-generated metadata (categories, tags, use cases, sample questions)
- Content embeddings generation with multiple strategies (local ONNX, Transformers.js, API-based)
- Semantic search with configurable strategies (embeddings + cosine similarity)
- RAG-based question answering with configurable answer strategies (search + LLM)
- Hash-based change detection to skip unchanged pages during recalculation
- Rate limiting with async-sema to avoid WAF blocking
- RESTful API endpoints for accessing pages, search, and Q&A
- Modular code structure with pluggable strategies
On first run, the cache will be empty. You should populate it by running:
GET /cache/recalculate
This will:
- Fetch all pages from the URLs list
- Calculate token counts for each page
- Generate AI metadata (categories, tags, use cases, questions)
- Generate embeddings for each page
- Calculate content hashes for change detection
- Store everything in the SQLite cache
- Return a summary of what was cached
Important: This will take some time as it processes all pages, generates metadata, and calculates tokens for each. Be patient!
Note: On subsequent runs, unchanged pages (detected by content hash) will be automatically skipped unless you use force mode.
Check that the cache was populated:
GET /cache/stats
This returns:
{ "cachedPages": 121, "totalTokens": 1234567 }
You should run /cache/recalculate in these scenarios:
- First time setup - Cache is empty
- URL list changes - You've added or removed URLs from the
urlsarray - Content updates - Documentation pages have been updated and you want fresh data
- Token count needed - You need accurate token counts for new content
- Metadata refresh - You want to regenerate AI metadata or embeddings
By default, /cache/recalculate uses hash-based change detection:
GET /cache/recalculate
Behavior:
- Fetches each page and calculates its content hash (SHA-256)
- Compares hash with cached version
- Skips pages with unchanged content (saves time and API calls)
- Only processes pages that have changed
- Still generates embeddings and metadata for changed pages
Response includes:
processed- Number of pages actually processedskipped- Number of pages skipped (unchanged)force- Alwaysfalsein default mode
To force recalculation of all pages (ignoring hash checks):
GET /cache/recalculate?force=true
Use cases:
- Regenerating all metadata/embeddings even if content unchanged
- After updating metadata generation prompts
- When you want to ensure everything is fresh
For single page updates, you can use:
GET /cache/clear/:path
This clears the cache for a specific page. The next time that page is requested via /page/:path, it will be fetched fresh and recached.
- Weekly: Run recalculate (default mode) to catch any documentation updates efficiently
- After major docs changes: Use force mode to regenerate everything
- When adding new pages: Update the
urlsarray, then run recalculate
Get the root docs page (cached if available).
Get a specific page by path. Examples:
/page/api-reference/page/agentic-tooling/compound-beta/page/model/llama-3.1-8b-instant
Response includes:
url- The source URLcontent- Full page content with frontmattercharCount- Character counttokenCount- Token count (calculated with tiktoken)- All frontmatter fields flattened (title, description, image, etc.)
Caching: Responses are cached. First request fetches and caches, subsequent requests are instant.
Get a list of all available page paths.
Response:
[ "docs", "agentic-tooling", "api-reference", ... ]
Search pages by query string.
Query Parameters:
q(required) - Search query stringlimit(optional) - Maximum number of results (default: 10)minScore(optional) - Minimum score threshold (default: 0)
Example:
GET /search?q=authentication&limit=5
Response:
{ "query": "authentication", "results": [ { "path": "api-reference", "url": "https://console.groq.com/docs/api-reference.md", "title": "API Reference", "score": 45, "snippet": "...authentication tokens are required for all API requests..." }, { "path": "quickstart", "url": "https://console.groq.com/docs/quickstart.md", "title": "Quick Start", "score": 32, "snippet": "...get your API key for authentication..." } ], "totalResults": 2, "totalPages": 121 }
Search Features:
- Keyword matching in titles and content
- Metadata boost (tags, categories, use cases)
- Score-based ranking
- Content snippets around matches
- Uses cached pages when available for faster results
Note: Currently uses embeddings-based semantic search. Multiple strategies available (see Search section).
Answer questions using RAG (Retrieval-Augmented Generation).
Query Parameters:
q(required) - Question to answerlimit(optional) - Max search results to consider (default: 10)minScore(optional) - Minimum search score threshold (default: 0)maxContextPages(optional) - Max pages to include in LLM context (default: 5)temperature(optional) - LLM temperature 0-1 (default: 0.3)model(optional) - Override LLM model (default: llama-3.3-70b-versatile)
Example:
GET /answer?q=How+do+I+authenticate+with+the+API&maxContextPages=5
Response:
{ "answer": "To authenticate with the Groq API, you need to...", "query": "How do I authenticate with the API?", "searchResults": [ { "path": "api-reference", "url": "https://console.groq.com/docs/api-reference.md", "title": "API Reference", "score": 92.5 } ], "contextUsed": 5, "totalTokens": 8500, "metadata": { "strategy": "llama-3.3-70b-default", "model": "llama-3.3-70b-versatile", "temperature": 0.3, "searchResultsCount": 10, "timings": { "search": 45.2, "contextPrep": 5.1, "llm": 1250.3, "total": 1300.6 } } }
How it works:
- Uses active search strategy to find relevant documentation
- Retrieves full content from top N results
- Formats documentation as context for the LLM
- Calls Groq API (Llama 3.3 70B) to generate an answer
- Returns markdown-formatted answer with sources
See /answer/ folder for available strategies and documentation.
Get information about the active answer strategy.
Response:
{ "strategy": { "name": "llama-3.3-70b-default", "description": "RAG using active search strategy + Llama 3.3 70B with up to 5 doc pages in context" }, "defaultOptions": { "model": "llama-3.3-70b-versatile", "temperature": 0.3, "maxContextPages": 5 }, "availableParams": { "q": "Query string (required)", "limit": "Max search results to consider (default: 10)", "minScore": "Minimum search score threshold (default: 0)", "maxContextPages": "Max pages to include in LLM context (default: 5)", "temperature": "LLM temperature (default: 0.3)", "model": "Override LLM model (optional)" } }
Run test queries against the active answer strategy.
Response:
{ "strategy": { "name": "llama-3.3-70b-default", "description": "..." }, "totalQueries": 1, "tests": [ { "query": "What is Compound and how does it work?", "answer": "markdown formatted answer...", "searchResults": [ { "path": "agentic-tooling/compound-beta", "url": "https://...", "title": "Compound", "score": 95.2 } ], "contextUsed": 5, "totalTokens": 8500, "durationMs": 1250.5, "timings": { "search": 45.2, "contextPrep": 5.1, "llm": 1200.2, "total": 1250.5 } } ], "summary": { "totalDurationMs": 1250.5, "avgDurationMs": 1250.5, "avgSearchMs": 45.2, "avgContextPrepMs": 5.1, "avgLlmMs": 1200.2, "avgTotalMs": 1250.5, "totalContextUsed": 5, "totalTokens": 8500, "errors": 0 } }
Get metadata for all pages (does not use cache - fetches fresh).
Response:
{ "pages": [ { "url": "...", "charCount": 1234, "frontmatter": {...} } ], "contents": [...], "totalPages": 121, "totalChars": 1234567 }
Get cache statistics.
Response:
{ "cachedPages": 121, "totalTokens": 1234567 }
Clear the entire cache.
Response:
{ "message": "Cache cleared", "success": true }
Clear cache for a specific page.
Example:
GET /cache/clear/api-reference
Response:
{ "message": "Cache cleared for api-reference", "success": true }
Recalculate pages with AI metadata and embeddings generation.
Query Parameters:
force(optional): Set totrueto force recalculation of all pages, ignoring hash checks
Default Mode (no query params):
GET /cache/recalculate
Force Mode:
GET /cache/recalculate?force=true
Response (Default Mode):
{ "message": "Recalculated 5 pages, skipped 116 unchanged pages", "results": [ { "path": "api-reference", "url": "https://console.groq.com/docs/api-reference.md", "charCount": 1234, "tokenCount": 567, "title": "API Reference", "metadata": { "categories": ["API", "Reference"], "tags": ["api", "endpoints", "rest"], "useCases": ["Integrating with Groq API"], "questions": ["How do I authenticate?", "What endpoints are available?"] } }, { "path": "docs", "skipped": true, "reason": "Content unchanged (hash matches)" } ], "totalPages": 121, "processed": 5, "skipped": 116, "withMetadata": 5, "withoutMetadata": 0, "cached": true, "force": false }
Response (Force Mode):
{ "message": "Recalculated 121 pages with AI metadata (force mode)", "results": [...], "totalPages": 121, "processed": 121, "skipped": 0, "force": true }
What it does:
- Fetches all pages (or skips unchanged ones in default mode)
- Calculates token counts
- Generates AI metadata (categories, tags, use cases, questions)
- Generates embeddings (currently fake, ready for Groq API)
- Calculates content hashes for change detection
- Stores everything in cache
Important: This can take several minutes depending on:
- Number of pages to process (skipped pages are fast)
- Network speed
- Token calculation time
- AI metadata generation time (uses Groq API)
-
First Request:
- Check cache → Not found
- Fetch from URL
- Calculate tokens
- Store in cache
- Return data
-
Subsequent Requests:
- Check cache → Found
- Return cached data immediately
Cache is stored in SQLite with the following schema:
CREATE TABLE groq_docs_cache_v3 (
url TEXT PRIMARY KEY,
content TEXT NOT NULL,
charCount INTEGER NOT NULL,
tokenCount INTEGER,
frontmatter TEXT NOT NULL,
metadata TEXT,
contentHash TEXT,
embeddings TEXT,
cachedAt INTEGER NOT NULL
)
Fields:
url- Source URL (primary key)content- Full page content with frontmattercharCount- Character counttokenCount- Token count (calculated with tiktoken)frontmatter- Parsed frontmatter (JSON)metadata- AI-generated metadata (categories, tags, use cases, questions)contentHash- SHA-256 hash of content (for change detection)embeddings- Content embeddings vector (JSON array)cachedAt- Timestamp when cached
Cache is invalidated when:
- You manually clear it via
/cache/clear - You recalculate via
/cache/recalculate - Cache is cleared for a specific page via
/cache/clear/:path
Note: Cache does NOT automatically expire. If documentation changes, you must manually recalculate.
-
Add URL to the
urlsarray inmain.tsx:const urls = [ // ... existing URLs "https://console.groq.com/docs/new-page.md", ]; -
Run recalculate:
POST /cache/recalculate -
Verify:
GET /cache/stats GET /list # Should include your new page
Token counts are calculated using tiktoken with the gpt-4 encoding (cl100k_base). This is the same encoding used by:
- GPT-4
- GPT-3.5-turbo
- Many other OpenAI models
Token counts are:
- Calculated on first fetch
- Stored in cache
- Returned in API responses
- Expensive to compute (which is why caching is important)
Each page can have AI-generated metadata using Groq's chat completions API:
- Categories: 2-4 broad categories (e.g., "API", "Authentication", "Models")
- Tags: 5-10 specific tags/keywords
- Use Cases: 2-4 practical use cases or scenarios
- Questions: 5-10 questions users might ask
Metadata is generated during /cache/recalculate and stored in the cache.
The API includes a search endpoint (/search) that allows you to search across all documentation pages using various semantic search strategies.
The search system supports multiple strategies that can be switched by commenting/uncommenting imports in search/index.ts. Each strategy has different trade-offs in terms of speed, accuracy, and infrastructure requirements.
File: search/transformers-local-onnx.ts
Pre-downloaded ONNX models for the fastest embedding generation with zero network overhead.
Performance: ~10-30ms per query (after initial ~50ms model load)
Advantages:
- ✅ No network calls - works completely offline
- ✅ No downloads on first run - instant startup
- ✅ No isolate loading delays - perfect for serverless
- ✅ Same accuracy as the cached version
- ✅ Perfect for production - predictable performance
Setup:
- Download the model:
cd search/models ./download-model.sh
- Activate in
search/index.ts:import { searchStrategy, generateEmbeddings } from "./transformers-local-onnx.ts";
Requirements: ~23MB disk space for model files
See search/models/SETUP.md for detailed setup instructions.
File: search/transformers-cosine.ts
Uses Transformers.js with automatic model downloading from Hugging Face.
Performance:
- First run: ~3-5s (downloads ~23MB model)
- Cached: ~150ms model load + ~10-30ms per query
Advantages:
- ✅ No API keys needed
- ✅ Works in browser and Deno
- ✅ Automatic caching
Disadvantages:
- ❌ Slow first run (downloads model)
- ❌ Isolate loading delays in serverless environments
- ❌ May not work in some restricted environments
All require API keys but offer different trade-offs:
| Strategy | File | Speed | Cost | Pros |
|---|---|---|---|---|
| Mixedbread | mixedbread-embeddings-cosine.ts | ~50-100ms | Free tier | High quality, 1024 dims |
| OpenAI | openai-cosine.ts | ~100-200ms | Paid | High quality, reliable |
| HuggingFace | hf-inference-qwen3-cosine.ts | ~150-300ms | Free tier | Qwen3-8B model |
| Cloudflare | cloudflare-bge-cosine.ts | ~50-150ms | Free tier | Works on CF Workers |
| JigsawStack | jigsawstack-orama.ts | ~550ms | Free tier | Managed search |
Edit search/index.ts and comment/uncomment the desired strategy:
// Comment out current strategy
// import { searchStrategy, generateEmbeddings } from "./transformers-cosine.ts";
// Uncomment desired strategy
import { searchStrategy, generateEmbeddings } from "./transformers-local-onnx.ts";
The search system uses semantic embeddings for intelligent search:
- Understands meaning, not just keywords
- Finds relevant results even with different wording
- Returns ranked results with similarity scores
- Includes content snippets with highlighted matches
- Uses cosine similarity for fast comparison
- Embedding Generation: Content is converted to 384-dimensional vectors
- Cosine Similarity: Query embeddings compared against page embeddings
- Ranking: Results sorted by similarity score
- Snippet Generation: Context-aware snippets around relevant content
The API includes answer generation using Retrieval-Augmented Generation (RAG) - combining semantic search with LLM inference to answer questions about the documentation.
- Search: Use active search strategy to find relevant documentation pages
- Retrieve: Get full content from top N search results
- Format: Package documentation as context for the LLM
- Generate: Call Groq LLM to generate an answer based on the context
Answer strategies are located in the /answer/ folder and can be switched by editing answer/index.ts.
File: answer/llama-3.3-70b-default.ts
Uses Groq's Llama 3.3 70B model with up to 5 documentation pages in context.
Performance: ~1-3s total (depends on search + LLM response time)
Configuration:
- Model:
llama-3.3-70b-versatile - Max context pages: 5
- Tokens per page: ~2000
- Temperature: 0.3 (focused/deterministic)
Advantages:
- ✅ High-quality answers with good reasoning
- ✅ Handles complex questions well
- ✅ Includes relevant citations
- ✅ Markdown-formatted responses
Usage:
# Basic question GET /answer?q=How+do+I+use+streaming # With options GET /answer?q=What+models+are+available&maxContextPages=3&temperature=0.5 # Different model GET /answer?q=Quick+question&model=llama-3.1-8b-instant
You can create custom strategies for different use cases:
Ideas for new strategies:
- llama-3.1-8b-fast: Faster responses with smaller model (good for simple questions)
- mixtral-8x7b-extended: More context pages with Mixtral's larger context window
- llama-3.3-70b-code: Specialized prompts for code examples and API usage
- citation-mode: Include specific citations/references in answers
- multi-step: Break down complex questions into sub-questions
See answer/README.md and answer/QUICK-START.md for detailed documentation and guides.
Each strategy implements:
- Search integration: Uses active search strategy from
/search/ - Context management: Formats pages to fit context window
- LLM configuration: Model selection, temperature, prompts
- Response formatting: Structures answer with metadata
For better answers:
- Increase
maxContextPages(more documentation context) - Raise
minScore(only use highly relevant pages) - Use larger models (70B+ for complex reasoning)
- Lower temperature (0.2-0.3 for factual accuracy)
For faster responses:
- Decrease
maxContextPages(less context to process) - Use smaller models (8B for simple questions)
- Use faster search strategies
For creative responses:
- Increase temperature (0.6-0.8)
- Use models good at creative writing
- Adjust system prompts
Content embeddings are generated for each page using the active search strategy (see Search section above).
Current Default: Local ONNX models (transformers-local-onnx.ts)
- Model: all-MiniLM-L6-v2
- Dimensions: 384
- Generation: ~10-30ms per page
- Storage: Cached as JSON arrays in SQLite
Embeddings are:
- Generated during
/cache/recalculate - Stored in cache for fast retrieval
- Used for semantic search and similarity matching
- Portable across different strategies (same dimensions)
Content hashes (SHA-256) are calculated and stored for each page. This enables:
- Smart recalculation: Skip unchanged pages automatically
- Efficient updates: Only process pages that have actually changed
- Performance: Significantly faster recalculation when most content is unchanged
Hashes are compared during /cache/recalculate (default mode) to determine if a page needs reprocessing.
Run /cache/recalculate to refresh everything.
- Check
/listto see if the path exists - Verify the URL is in the
urlsarray - Ensure the path matches the URL structure (e.g.,
api-referencefor/docs/api-reference.md)
- Clear cache for that page:
POST /cache/clear/:path - Request the page again:
GET /page/:path - Or recalculate everything:
POST /cache/recalculate
- Use
/page/:pathendpoints (cached) instead of/data(uncached) - Check cache stats:
GET /cache/stats - Ensure cache is populated before production use
The codebase is organized into modular files:
main.tsx- Main Hono app, routes, and URL definitionsutils.ts- Utility functions:- Cache management (getFromCache, setCache, clearCache, getCacheStats)
- Content fetching (getTextFromUrl)
- Frontmatter parsing (parseFrontmatter, addUrlSourceToFrontmatter)
- Token counting (calculateTokenCount)
- Hash calculation (calculateContentHash)
- Rate limiting for fetches
groq.ts- Groq API functions:- Chat completions (groqChatCompletion)
- Metadata generation (generatePageMetadata)
search/- Search strategies with pluggable implementations:index.ts- Main entry point, switches between strategiestypes.ts- Type definitions for searchutils.ts- Shared utilities (cosine similarity, snippets)- Multiple strategy files (transformers-local-onnx, mixedbread, openai, etc.)
answer/- Answer strategies with pluggable RAG implementations:index.ts- Main entry point, switches between strategiestypes.ts- Type definitions for answersutils.ts- Shared utilities (context formatting, token estimation)- Multiple strategy files (llama-3.3-70b-default, etc.)
# Start the server deno task serve # Or manually deno run --allow-net --allow-env main.tsx
Note: SQLite caching is automatically disabled when running locally (detected via valtown environment variable). The app will work without caching, but cache-related endpoints will return appropriate messages.
The project includes several convenience tasks defined in deno.json:
# Start the development server deno task serve
# Recalculate with active search strategy (smart mode, skips unchanged pages) deno task recalc # Force recalculation (recalculates all pages) deno task recalc-f # Recalculate with Mixedbread embeddings strategy deno task recalc-mxbai # Force recalculation with Mixedbread embeddings deno task recalc-mxbai-f
# Test search strategy with detailed timing breakdown deno task search # Test answer strategy with detailed timing breakdown and search results deno task answer
Test Output Features:
- ⏱️ Comprehensive timing breakdown (search, context prep, LLM call, total)
- 📊 Shows search results used for context
- 💬 Displays generated answers (truncated for readability)
- 📈 Summary statistics (averages, totals)
- 🎯 Strategy information
Example test output:
⏱️ Timing breakdown:
Search: 45.2ms
Context prep: 5.1ms
LLM call: 1200.2ms
Total: 1250.5ms
📚 Search results used (top 5):
✓ 1. Compound
Path: agentic-tooling/compound-beta
Score: 95.20
💬 Generated Answer:
──────────────────────────────────────────────────────────────────────────────
Compound is a beta feature...
(answer continues)
──────────────────────────────────────────────────────────────────────────────
The app is configured to work with Val Town. Export uses:
export default (typeof Deno !== "undefined" && Deno.env.get("valtown")) ? app.fetch : app;
SQLite caching is automatically enabled when running in Val Town (detected via valtown environment variable).
GROQ_API_KEY- Required for AI metadata generation (optional, will disable metadata if not set)valtown- Automatically set by Val Town (detects environment)
- Use default recalculate mode - Automatically skips unchanged pages
- Cache is your friend - Always populate cache before production use
- Rate limiting - Built-in rate limiting prevents WAF blocking (1 request per 3 seconds for docs, 2 requests per second for Groq API)
- Hash checking - Default recalculation mode is much faster when most content is unchanged
