Groq Docs API

A Hono API server that fetches, caches, and processes Groq documentation pages with token counting and AI-generated metadata.

Features

Fetches documentation pages from Groq's console
Caches page content, metadata, token counts, and embeddings in SQLite
Token counting using tiktoken (GPT-4 encoding)
AI-generated metadata (categories, tags, use cases, sample questions)
Content embeddings generation with multiple strategies (local ONNX, Transformers.js, API-based)
Semantic search with configurable strategies (embeddings + cosine similarity)
RAG-based question answering with configurable answer strategies (search + LLM)
Hash-based change detection to skip unchanged pages during recalculation
Rate limiting with async-sema to avoid WAF blocking
RESTful API endpoints for accessing pages, search, and Q&A
Modular code structure with pluggable strategies

First-Time Setup

1. Initial Cache Population

On first run, the cache will be empty. You should populate it by running:

GET /cache/recalculate

This will:

Fetch all pages from the URLs list
Calculate token counts for each page
Generate AI metadata (categories, tags, use cases, questions)
Generate embeddings for each page
Calculate content hashes for change detection
Store everything in the SQLite cache
Return a summary of what was cached

Important: This will take some time as it processes all pages, generates metadata, and calculates tokens for each. Be patient!

Note: On subsequent runs, unchanged pages (detected by content hash) will be automatically skipped unless you use force mode.

2. Verify Cache

Check that the cache was populated:

GET /cache/stats

This returns:

{
  "cachedPages": 121,
  "totalTokens": 1234567
}

When to Recalculate

You should run /cache/recalculate in these scenarios:

✅ Required Recalculations

First time setup - Cache is empty
URL list changes - You've added or removed URLs from the urls array
Content updates - Documentation pages have been updated and you want fresh data
Token count needed - You need accurate token counts for new content
Metadata refresh - You want to regenerate AI metadata or embeddings

🔄 Default Mode (Smart Recalculation)

By default, /cache/recalculate uses hash-based change detection:

GET /cache/recalculate

Behavior:

Fetches each page and calculates its content hash (SHA-256)
Compares hash with cached version
Skips pages with unchanged content (saves time and API calls)
Only processes pages that have changed
Still generates embeddings and metadata for changed pages

Response includes:

processed - Number of pages actually processed
skipped - Number of pages skipped (unchanged)
force - Always false in default mode

⚡ Force Mode (Recalculate Everything)

To force recalculation of all pages (ignoring hash checks):

GET /cache/recalculate?force=true

Use cases:

Regenerating all metadata/embeddings even if content unchanged
After updating metadata generation prompts
When you want to ensure everything is fresh

⚠️ Partial Updates

For single page updates, you can use:

GET /cache/clear/:path

This clears the cache for a specific page. The next time that page is requested via /page/:path, it will be fetched fresh and recached.

🔄 Routine Maintenance

Weekly: Run recalculate (default mode) to catch any documentation updates efficiently
After major docs changes: Use force mode to regenerate everything
When adding new pages: Update the urls array, then run recalculate

API Endpoints

Page Endpoints

`GET /page/docs`

Get the root docs page (cached if available).

`GET /page/:path`

Get a specific page by path. Examples:

/page/api-reference
/page/agentic-tooling/compound-beta
/page/model/llama-3.1-8b-instant

Response includes:

url - The source URL
content - Full page content with frontmatter
charCount - Character count
tokenCount - Token count (calculated with tiktoken)
All frontmatter fields flattened (title, description, image, etc.)

Caching: Responses are cached. First request fetches and caches, subsequent requests are instant.

`GET /list`

Get a list of all available page paths.

Response:

[
  "docs",
  "agentic-tooling",
  "api-reference",
  ...
]

`GET /search`

Search pages by query string.

Query Parameters:

q (required) - Search query string
limit (optional) - Maximum number of results (default: 10)
minScore (optional) - Minimum score threshold (default: 0)

Example:

GET /search?q=authentication&limit=5

Response:

{
  "query": "authentication",
  "results": [
    {
      "path": "api-reference",
      "url": "https://console.groq.com/docs/api-reference.md",
      "title": "API Reference",
      "score": 45,
      "snippet": "...authentication tokens are required for all API requests..."
    },
    {
      "path": "quickstart",
      "url": "https://console.groq.com/docs/quickstart.md",
      "title": "Quick Start",
      "score": 32,
      "snippet": "...get your API key for authentication..."
    }
  ],
  "totalResults": 2,
  "totalPages": 121
}

Search Features:

Keyword matching in titles and content
Metadata boost (tags, categories, use cases)
Score-based ranking
Content snippets around matches
Uses cached pages when available for faster results

Note: Currently uses embeddings-based semantic search. Multiple strategies available (see Search section).

`GET /answer`

Answer questions using RAG (Retrieval-Augmented Generation).

Query Parameters:

q (required) - Question to answer
limit (optional) - Max search results to consider (default: 10)
minScore (optional) - Minimum search score threshold (default: 0)
maxContextPages (optional) - Max pages to include in LLM context (default: 5)
temperature (optional) - LLM temperature 0-1 (default: 0.3)
model (optional) - Override LLM model (default: llama-3.3-70b-versatile)

Example:

GET /answer?q=How+do+I+authenticate+with+the+API&maxContextPages=5

Response:

{
  "answer": "To authenticate with the Groq API, you need to...",
  "query": "How do I authenticate with the API?",
  "searchResults": [
    {
      "path": "api-reference",
      "url": "https://console.groq.com/docs/api-reference.md",
      "title": "API Reference",
      "score": 92.5
    }
  ],
  "contextUsed": 5,
  "totalTokens": 8500,
  "metadata": {
    "strategy": "llama-3.3-70b-default",
    "model": "llama-3.3-70b-versatile",
    "temperature": 0.3,
    "searchResultsCount": 10,
    "timings": {
      "search": 45.2,
      "contextPrep": 5.1,
      "llm": 1250.3,
      "total": 1300.6
    }
  }
}

How it works:

Uses active search strategy to find relevant documentation
Retrieves full content from top N results
Formats documentation as context for the LLM
Calls Groq API (Llama 3.3 70B) to generate an answer
Returns markdown-formatted answer with sources

See /answer/ folder for available strategies and documentation.

`GET /answer/info`

Get information about the active answer strategy.

Response:

{
  "strategy": {
    "name": "llama-3.3-70b-default",
    "description": "RAG using active search strategy + Llama 3.3 70B with up to 5 doc pages in context"
  },
  "defaultOptions": {
    "model": "llama-3.3-70b-versatile",
    "temperature": 0.3,
    "maxContextPages": 5
  },
  "availableParams": {
    "q": "Query string (required)",
    "limit": "Max search results to consider (default: 10)",
    "minScore": "Minimum search score threshold (default: 0)",
    "maxContextPages": "Max pages to include in LLM context (default: 5)",
    "temperature": "LLM temperature (default: 0.3)",
    "model": "Override LLM model (optional)"
  }
}

`GET /answer/test`

Run test queries against the active answer strategy.

Response:

{
  "strategy": {
    "name": "llama-3.3-70b-default",
    "description": "..."
  },
  "totalQueries": 1,
  "tests": [
    {
      "query": "What is Compound and how does it work?",
      "answer": "markdown formatted answer...",
      "searchResults": [
        {
          "path": "agentic-tooling/compound-beta",
          "url": "https://...",
          "title": "Compound",
          "score": 95.2
        }
      ],
      "contextUsed": 5,
      "totalTokens": 8500,
      "durationMs": 1250.5,
      "timings": {
        "search": 45.2,
        "contextPrep": 5.1,
        "llm": 1200.2,
        "total": 1250.5
      }
    }
  ],
  "summary": {
    "totalDurationMs": 1250.5,
    "avgDurationMs": 1250.5,
    "avgSearchMs": 45.2,
    "avgContextPrepMs": 5.1,
    "avgLlmMs": 1200.2,
    "avgTotalMs": 1250.5,
    "totalContextUsed": 5,
    "totalTokens": 8500,
    "errors": 0
  }
}

`GET /data`

Get metadata for all pages (does not use cache - fetches fresh).

Response:

{
  "pages": [
    {
      "url": "...",
      "charCount": 1234,
      "frontmatter": {...}
    }
  ],
  "contents": [...],
  "totalPages": 121,
  "totalChars": 1234567
}

Cache Management Endpoints

`GET /cache/stats`

Get cache statistics.

Response:

{
  "cachedPages": 121,
  "totalTokens": 1234567
}

`GET /cache/clear`

Clear the entire cache.

Response:

{
  "message": "Cache cleared",
  "success": true
}

`GET /cache/clear/:path`

Clear cache for a specific page.

Example:

GET /cache/clear/api-reference

Response:

{
  "message": "Cache cleared for api-reference",
  "success": true
}

`GET /cache/recalculate`

Recalculate pages with AI metadata and embeddings generation.

Query Parameters:

force (optional): Set to true to force recalculation of all pages, ignoring hash checks

Default Mode (no query params):

GET /cache/recalculate

Force Mode:

GET /cache/recalculate?force=true

Response (Default Mode):

{
  "message": "Recalculated 5 pages, skipped 116 unchanged pages",
  "results": [
    {
      "path": "api-reference",
      "url": "https://console.groq.com/docs/api-reference.md",
      "charCount": 1234,
      "tokenCount": 567,
      "title": "API Reference",
      "metadata": {
        "categories": ["API", "Reference"],
        "tags": ["api", "endpoints", "rest"],
        "useCases": ["Integrating with Groq API"],
        "questions": ["How do I authenticate?", "What endpoints are available?"]
      }
    },
    {
      "path": "docs",
      "skipped": true,
      "reason": "Content unchanged (hash matches)"
    }
  ],
  "totalPages": 121,
  "processed": 5,
  "skipped": 116,
  "withMetadata": 5,
  "withoutMetadata": 0,
  "cached": true,
  "force": false
}

Response (Force Mode):

{
  "message": "Recalculated 121 pages with AI metadata (force mode)",
  "results": [...],
  "totalPages": 121,
  "processed": 121,
  "skipped": 0,
  "force": true
}

What it does:

Fetches all pages (or skips unchanged ones in default mode)
Calculates token counts
Generates AI metadata (categories, tags, use cases, questions)
Generates embeddings (currently fake, ready for Groq API)
Calculates content hashes for change detection
Stores everything in cache

Important: This can take several minutes depending on:

Number of pages to process (skipped pages are fast)
Network speed
Token calculation time
AI metadata generation time (uses Groq API)

Cache Behavior

How Caching Works

First Request:
- Check cache → Not found
- Fetch from URL
- Calculate tokens
- Store in cache
- Return data
Subsequent Requests:
- Check cache → Found
- Return cached data immediately

Cache Storage

Cache is stored in SQLite with the following schema:

CREATE TABLE groq_docs_cache_v3 (
  url TEXT PRIMARY KEY,
  content TEXT NOT NULL,
  charCount INTEGER NOT NULL,
  tokenCount INTEGER,
  frontmatter TEXT NOT NULL,
  metadata TEXT,
  contentHash TEXT,
  embeddings TEXT,
  cachedAt INTEGER NOT NULL
)

Fields:

url - Source URL (primary key)
content - Full page content with frontmatter
charCount - Character count
tokenCount - Token count (calculated with tiktoken)
frontmatter - Parsed frontmatter (JSON)
metadata - AI-generated metadata (categories, tags, use cases, questions)
contentHash - SHA-256 hash of content (for change detection)
embeddings - Content embeddings vector (JSON array)
cachedAt - Timestamp when cached

Cache Invalidation

Cache is invalidated when:

You manually clear it via /cache/clear
You recalculate via /cache/recalculate
Cache is cleared for a specific page via /cache/clear/:path

Note: Cache does NOT automatically expire. If documentation changes, you must manually recalculate.

Adding New Pages

Add URL to the urls array in main.tsx:

const urls = [
  // ... existing URLs
  "https://console.groq.com/docs/new-page.md",
];

Run recalculate:
```
POST /cache/recalculate
```

Verify:

GET /cache/stats
GET /list  # Should include your new page

Token Counting

Token counts are calculated using tiktoken with the gpt-4 encoding (cl100k_base). This is the same encoding used by:

GPT-4
GPT-3.5-turbo
Many other OpenAI models

Token counts are:

Calculated on first fetch
Stored in cache
Returned in API responses
Expensive to compute (which is why caching is important)

AI Metadata Generation

Each page can have AI-generated metadata using Groq's chat completions API:

Categories: 2-4 broad categories (e.g., "API", "Authentication", "Models")
Tags: 5-10 specific tags/keywords
Use Cases: 2-4 practical use cases or scenarios
Questions: 5-10 questions users might ask

Metadata is generated during /cache/recalculate and stored in the cache.

Search

The API includes a search endpoint (/search) that allows you to search across all documentation pages using various semantic search strategies.

Available Search Strategies

The search system supports multiple strategies that can be switched by commenting/uncommenting imports in search/index.ts. Each strategy has different trade-offs in terms of speed, accuracy, and infrastructure requirements.

🏆 Recommended: Local ONNX Models (Fastest)

File: search/transformers-local-onnx.ts

Pre-downloaded ONNX models for the fastest embedding generation with zero network overhead.

Performance: ~10-30ms per query (after initial ~50ms model load)

Advantages:

✅ No network calls - works completely offline
✅ No downloads on first run - instant startup
✅ No isolate loading delays - perfect for serverless
✅ Same accuracy as the cached version
✅ Perfect for production - predictable performance

Setup:

Download the model:
```
cd search/models
./download-model.sh
```

Activate in search/index.ts:

import { searchStrategy, generateEmbeddings } from "./transformers-local-onnx.ts";

Requirements: ~23MB disk space for model files

See search/models/SETUP.md for detailed setup instructions.

Alternative: Transformers.js with Auto-Download

File: search/transformers-cosine.ts

Uses Transformers.js with automatic model downloading from Hugging Face.

Performance:

First run: ~3-5s (downloads ~23MB model)
Cached: ~150ms model load + ~10-30ms per query

Advantages:

✅ No API keys needed
✅ Works in browser and Deno
✅ Automatic caching

Disadvantages:

❌ Slow first run (downloads model)
❌ Isolate loading delays in serverless environments
❌ May not work in some restricted environments

Other Strategies (API-Based)

All require API keys but offer different trade-offs:

Strategy	File	Speed	Cost	Pros
Mixedbread	`mixedbread-embeddings-cosine.ts`	~50-100ms	Free tier	High quality, 1024 dims
OpenAI	`openai-cosine.ts`	~100-200ms	Paid	High quality, reliable
HuggingFace	`hf-inference-qwen3-cosine.ts`	~150-300ms	Free tier	Qwen3-8B model
Cloudflare	`cloudflare-bge-cosine.ts`	~50-150ms	Free tier	Works on CF Workers
JigsawStack	`jigsawstack-orama.ts`	~550ms	Free tier	Managed search

Switching Strategies

Edit search/index.ts and comment/uncomment the desired strategy:

// Comment out current strategy
// import { searchStrategy, generateEmbeddings } from "./transformers-cosine.ts";

// Uncomment desired strategy
import { searchStrategy, generateEmbeddings } from "./transformers-local-onnx.ts";

Current Implementation (Semantic Search)

The search system uses semantic embeddings for intelligent search:

Understands meaning, not just keywords
Finds relevant results even with different wording
Returns ranked results with similarity scores
Includes content snippets with highlighted matches
Uses cosine similarity for fast comparison

Search Architecture

Embedding Generation: Content is converted to 384-dimensional vectors
Cosine Similarity: Query embeddings compared against page embeddings
Ranking: Results sorted by similarity score
Snippet Generation: Context-aware snippets around relevant content

Answer Strategies (RAG)

The API includes answer generation using Retrieval-Augmented Generation (RAG) - combining semantic search with LLM inference to answer questions about the documentation.

How RAG Works

Search: Use active search strategy to find relevant documentation pages
Retrieve: Get full content from top N search results
Format: Package documentation as context for the LLM
Generate: Call Groq LLM to generate an answer based on the context

Available Answer Strategies

Answer strategies are located in the /answer/ folder and can be switched by editing answer/index.ts.

🏆 Current Default: llama-3.3-70b-default

File: answer/llama-3.3-70b-default.ts

Uses Groq's Llama 3.3 70B model with up to 5 documentation pages in context.

Performance: ~1-3s total (depends on search + LLM response time)

Configuration:

Model: llama-3.3-70b-versatile
Max context pages: 5
Tokens per page: ~2000
Temperature: 0.3 (focused/deterministic)

Advantages:

✅ High-quality answers with good reasoning
✅ Handles complex questions well
✅ Includes relevant citations
✅ Markdown-formatted responses

Usage:

# Basic question
GET /answer?q=How+do+I+use+streaming

# With options
GET /answer?q=What+models+are+available&maxContextPages=3&temperature=0.5

# Different model
GET /answer?q=Quick+question&model=llama-3.1-8b-instant

Creating New Answer Strategies

You can create custom strategies for different use cases:

Ideas for new strategies:

llama-3.1-8b-fast: Faster responses with smaller model (good for simple questions)
mixtral-8x7b-extended: More context pages with Mixtral's larger context window
llama-3.3-70b-code: Specialized prompts for code examples and API usage
citation-mode: Include specific citations/references in answers
multi-step: Break down complex questions into sub-questions

See answer/README.md and answer/QUICK-START.md for detailed documentation and guides.

Answer Strategy Architecture

Each strategy implements:

Search integration: Uses active search strategy from /search/
Context management: Formats pages to fit context window
LLM configuration: Model selection, temperature, prompts
Response formatting: Structures answer with metadata

Tuning Answer Quality

For better answers:

Increase maxContextPages (more documentation context)
Raise minScore (only use highly relevant pages)
Use larger models (70B+ for complex reasoning)
Lower temperature (0.2-0.3 for factual accuracy)

For faster responses:

Decrease maxContextPages (less context to process)
Use smaller models (8B for simple questions)
Use faster search strategies

For creative responses:

Increase temperature (0.6-0.8)
Use models good at creative writing
Adjust system prompts

Embeddings

Content embeddings are generated for each page using the active search strategy (see Search section above).

Current Default: Local ONNX models (transformers-local-onnx.ts)

Model: all-MiniLM-L6-v2
Dimensions: 384
Generation: ~10-30ms per page
Storage: Cached as JSON arrays in SQLite

Embeddings are:

Generated during /cache/recalculate
Stored in cache for fast retrieval
Used for semantic search and similarity matching
Portable across different strategies (same dimensions)

Hash-Based Change Detection

Content hashes (SHA-256) are calculated and stored for each page. This enables:

Smart recalculation: Skip unchanged pages automatically
Efficient updates: Only process pages that have actually changed
Performance: Significantly faster recalculation when most content is unchanged

Hashes are compared during /cache/recalculate (default mode) to determine if a page needs reprocessing.

Troubleshooting

Cache seems stale

Run /cache/recalculate to refresh everything.

Page not found

Check /list to see if the path exists
Verify the URL is in the urls array
Ensure the path matches the URL structure (e.g., api-reference for /docs/api-reference.md)

Token counts seem wrong

Clear cache for that page: POST /cache/clear/:path
Request the page again: GET /page/:path
Or recalculate everything: POST /cache/recalculate

Performance issues

Use /page/:path endpoints (cached) instead of /data (uncached)
Check cache stats: GET /cache/stats
Ensure cache is populated before production use

Code Structure

The codebase is organized into modular files:

main.tsx - Main Hono app, routes, and URL definitions
utils.ts - Utility functions:
- Cache management (getFromCache, setCache, clearCache, getCacheStats)
- Content fetching (getTextFromUrl)
- Frontmatter parsing (parseFrontmatter, addUrlSourceToFrontmatter)
- Token counting (calculateTokenCount)
- Hash calculation (calculateContentHash)
- Rate limiting for fetches
groq.ts - Groq API functions:
- Chat completions (groqChatCompletion)
- Metadata generation (generatePageMetadata)
search/ - Search strategies with pluggable implementations:
- index.ts - Main entry point, switches between strategies
- types.ts - Type definitions for search
- utils.ts - Shared utilities (cosine similarity, snippets)
- Multiple strategy files (transformers-local-onnx, mixedbread, openai, etc.)
answer/ - Answer strategies with pluggable RAG implementations:
- index.ts - Main entry point, switches between strategies
- types.ts - Type definitions for answers
- utils.ts - Shared utilities (context formatting, token estimation)
- Multiple strategy files (llama-3.3-70b-default, etc.)

Development

Local Development

# Start the server
deno task serve

# Or manually
deno run --allow-net --allow-env main.tsx

Note: SQLite caching is automatically disabled when running locally (detected via valtown environment variable). The app will work without caching, but cache-related endpoints will return appropriate messages.

Deno Tasks

The project includes several convenience tasks defined in deno.json:

Server Tasks

# Start the development server
deno task serve

Recalculation Tasks

# Recalculate with active search strategy (smart mode, skips unchanged pages)
deno task recalc

# Force recalculation (recalculates all pages)
deno task recalc-f

# Recalculate with Mixedbread embeddings strategy
deno task recalc-mxbai

# Force recalculation with Mixedbread embeddings
deno task recalc-mxbai-f

Testing Tasks

# Test search strategy with detailed timing breakdown
deno task search

# Test answer strategy with detailed timing breakdown and search results
deno task answer

Test Output Features:

⏱️ Comprehensive timing breakdown (search, context prep, LLM call, total)
📊 Shows search results used for context
💬 Displays generated answers (truncated for readability)
📈 Summary statistics (averages, totals)
🎯 Strategy information

Example test output:

⏱️  Timing breakdown:
   Search: 45.2ms
   Context prep: 5.1ms
   LLM call: 1200.2ms
   Total: 1250.5ms

📚 Search results used (top 5):
✓ 1. Compound
     Path: agentic-tooling/compound-beta
     Score: 95.20

💬 Generated Answer:
──────────────────────────────────────────────────────────────────────────────
Compound is a beta feature...
(answer continues)
──────────────────────────────────────────────────────────────────────────────

Val Town

The app is configured to work with Val Town. Export uses:

export default (typeof Deno !== "undefined" && Deno.env.get("valtown")) ? app.fetch : app;

SQLite caching is automatically enabled when running in Val Town (detected via valtown environment variable).

Environment Variables

GROQ_API_KEY - Required for AI metadata generation (optional, will disable metadata if not set)
valtown - Automatically set by Val Town (detects environment)

Performance Tips

Use default recalculate mode - Automatically skips unchanged pages
Cache is your friend - Always populate cache before production use
Rate limiting - Built-in rate limiting prevents WAF blocking (1 request per 3 seconds for docs, 2 requests per second for Groq API)
Hash checking - Default recalculation mode is much faster when most content is unchanged