FeaturesTemplatesShowcaseTownie
AI
BlogDocsPricing
Log inSign up
emcho
emchohello-transcription
Remix of emcho/hello-mcp
Public
Like
hello-transcription
Home
Code
7
frontend
1
routes
3
.vtignore
CLAUDE.md
README.md
deno.json
H
main.tsx
Branches
1
Pull requests
Remixes
1
History
Environment variables
1
Val Town is a collaborative website to build and scale JavaScript apps.
Deploy APIs, crons, & store data โ€“ all from the browser, and deployed in milliseconds.
Sign up now
Code
/
CLAUDE.md
Code
/
CLAUDE.md
Search
9/2/2025
Viewing readonly version of main branch: v20
View latest version
CLAUDE.md

Hello-Transcription - OpenAI Realtime API Transcription Demo

๐ŸŽฏ Project Overview

Hello-Transcription demonstrates the transcription-only mode of OpenAI's Realtime API. Unlike the conversational mode, this implementation focuses purely on speech-to-text conversion without generating AI responses, making it ideal for subtitles, live captions, meeting transcriptions, and other transcription-focused use cases.

Created: September 2, 2025
Platform: Val Town
API: OpenAI Realtime API (Transcription Mode)
Key Feature: Real-time streaming transcription with multiple model support

๐Ÿ—๏ธ Technical Stack

  • Runtime: Deno (Val Town platform)
  • Framework: Hono (lightweight web framework)
  • Transcription: OpenAI Realtime API in transcription mode
  • Connection: WebRTC with data channel for events
  • Frontend: Vanilla JavaScript with split-view interface
  • Models: GPT-4o Transcribe, GPT-4o Mini Transcribe, Whisper-1

๐Ÿ“ Project Structure

hello-transcription/
โ”œโ”€โ”€ frontend/
โ”‚   โ””โ”€โ”€ index.html       # Split-view transcription interface
โ”œโ”€โ”€ routes/
โ”‚   โ”œโ”€โ”€ rtc.ts          # WebRTC session setup for transcription
โ”‚   โ”œโ”€โ”€ observer.ts     # WebSocket observer for transcription events
โ”‚   โ””โ”€โ”€ utils.ts        # Transcription session configuration
โ”œโ”€โ”€ main.tsx            # Main entry point
โ”œโ”€โ”€ deno.json          # Deno configuration
โ”œโ”€โ”€ README.md          # User documentation
โ””โ”€โ”€ CLAUDE.md          # This file - technical documentation

๐Ÿ”‘ Core Concepts

Transcription vs Conversation Mode

The Realtime API supports two distinct modes:

  1. Conversation Mode (type: "realtime"):

    • Two-way interaction with AI
    • User speaks โ†’ AI responds
    • Used in hello-realtime and hello-mcp
  2. Transcription Mode (type: "transcription"):

    • One-way speech-to-text only
    • User speaks โ†’ Text output
    • No AI responses generated
    • Lower latency, lower cost
    • This demo uses transcription mode

Transcription Session Object

{ type: "transcription", input_audio_format: "pcm16", input_audio_transcription: { model: "gpt-4o-transcribe", // or "gpt-4o-mini-transcribe", "whisper-1" prompt: "", // Optional context hint language: "en" // ISO-639-1 language code }, turn_detection: { type: "server_vad", threshold: 0.5, prefix_padding_ms: 300, silence_duration_ms: 500 }, input_audio_noise_reduction: { type: "near_field" // or "far_field", null }, include: ["item.input_audio_transcription.logprobs"] // Optional }

๐Ÿ› ๏ธ Key Components

1. Utils Configuration (/routes/utils.ts)

Handles transcription session configuration with sensible defaults:

export function makeTranscriptionSession(config: TranscriptionConfig = {}) { const { model = "gpt-4o-transcribe", language = "en", prompt = "", enableVAD = true, noiseReduction = "near_field", includeLogprobs = false } = config; // Build session object... }

Key Configuration Options:

  • model: Transcription model selection
  • language: Primary language for better accuracy
  • prompt: Context hints (e.g., "Expect medical terminology")
  • enableVAD: Automatic voice activity detection
  • noiseReduction: Audio preprocessing type
  • includeLogprobs: Confidence scores for words

2. RTC Route Handler (/routes/rtc.ts)

Creates WebRTC sessions specifically for transcription:

// Get config from query params const model = c.req.query("model") || "gpt-4o-transcribe"; const language = c.req.query("language") || "en"; const vad = c.req.query("vad") !== "false"; const logprobs = c.req.query("logprobs") === "true"; // Create transcription session const sessionConfig = makeTranscriptionSession({ model, language, enableVAD: vad, includeLogprobs: logprobs });

Important: Uses type: "transcription" not type: "realtime"

3. Observer WebSocket (/routes/observer.ts)

Monitors transcription events via server-side WebSocket:

ws.onmessage = (event) => { const data = JSON.parse(event.data); if (data.type === "conversation.item.input_audio_transcription.delta") { // Streaming partial transcription console.log(`๐Ÿ“ Transcription delta: "${data.delta}"`); } else if (data.type === "conversation.item.input_audio_transcription.completed") { // Final transcription for segment console.log(`โœ… Transcription completed: "${data.transcript}"`); } };

4. Frontend Interface (/frontend/index.html)

Split-view interface with real-time transcription display:

Left Panel - Transcriptions

  • Shows live transcription stream
  • Partial transcriptions update in real-time
  • Final transcriptions marked with green border
  • Each segment timestamped

Right Panel - Event Logs

  • Technical event stream
  • Debug information
  • Connection status

Data Channel Handling

dataChannel.onmessage = (event) => { const data = JSON.parse(event.data); if (data.type === "conversation.item.input_audio_transcription.delta") { // Update partial transcription addTranscript(data.item_id, data.delta, false); } else if (data.type === "conversation.item.input_audio_transcription.completed") { // Mark transcription as final addTranscript(data.item_id, data.transcript, true); } };

๐Ÿ“Š Model Comparison

GPT-4o Transcribe

  • Streaming: Yes - incremental updates via delta events
  • Latency: Low
  • Accuracy: High
  • Use Case: Live subtitles, real-time captions

GPT-4o Mini Transcribe

  • Streaming: Yes - incremental updates
  • Latency: Very low
  • Accuracy: Good
  • Use Case: Quick transcriptions, lower cost

Whisper-1

  • Streaming: No - complete segments only
  • Latency: Higher (waits for complete utterance)
  • Accuracy: Very high
  • Use Case: High-accuracy transcriptions, post-processing

๐Ÿ”„ Event Flow

Transcription Event Sequence

  1. Audio Input

    User speaks โ†’ Microphone โ†’ WebRTC โ†’ OpenAI
    
  2. VAD Processing (if enabled)

    Voice detected โ†’ Buffer audio โ†’ Silence detected โ†’ Commit buffer
    
  3. Transcription Events

    input_audio_buffer.committed
    โ†“
    conversation.item.input_audio_transcription.delta (streaming models)
    โ†“
    conversation.item.input_audio_transcription.completed
    

Event Types

Delta Event (Streaming)

{ "type": "conversation.item.input_audio_transcription.delta", "item_id": "item_003", "content_index": 0, "delta": "Hello, how" }

Completed Event

{ "type": "conversation.item.input_audio_transcription.completed", "item_id": "item_003", "content_index": 0, "transcript": "Hello, how are you today?" }

โš™๏ธ Configuration Details

Voice Activity Detection (VAD)

VAD automatically detects speech segments:

turn_detection: { type: "server_vad", threshold: 0.5, // Sensitivity (0-1) prefix_padding_ms: 300, // Audio before speech silence_duration_ms: 500 // Silence to end segment }

VAD Disabled:

turn_detection: null // Manual control required

Noise Reduction

Three noise reduction modes:

  1. near_field: Optimized for close microphones (default)
  2. far_field: For distant microphones/speakers
  3. null: No noise reduction

Language Configuration

ISO-639-1 language codes improve accuracy:

  • "en" - English
  • "es" - Spanish
  • "fr" - French
  • "de" - German
  • "zh" - Chinese
  • "ja" - Japanese

Logprobs (Confidence Scores)

When enabled, provides word-level confidence:

include: ["item.input_audio_transcription.logprobs"]

Returns probability scores for each transcribed word, useful for:

  • Highlighting uncertain words
  • Quality assessment
  • Post-processing decisions

๐Ÿงช Testing Guide

Local Development

  1. Setup Environment

    # Create .env file echo "OPENAI_API_KEY=sk-..." > .env # Install Deno curl -fsSL https://deno.land/install.sh | sh
  2. Run Development Server

    # With auto-reload deno run --watch --allow-all main.tsx # Or standard deno run --allow-all main.tsx
  3. Test Transcription

    • Open http://localhost:8000
    • Select model and language
    • Click "Start"
    • Speak clearly
    • Watch transcriptions appear

Testing Different Models

  1. Test Streaming (GPT-4o)

    • Select "GPT-4o Transcribe"
    • Speak continuously
    • Notice incremental updates
  2. Test Non-Streaming (Whisper-1)

    • Select "Whisper-1"
    • Speak, then pause
    • Notice complete segments only
  3. Test VAD

    • Enable VAD
    • Speak with pauses
    • Notice automatic segmentation
  4. Test Without VAD

    • Disable VAD
    • Requires manual commit (not implemented in this demo)

๐Ÿ› Common Issues & Solutions

Issue: No transcriptions appearing

Solutions:

  • Check microphone permissions
  • Verify OPENAI_API_KEY is set
  • Check browser console for errors
  • Ensure WebRTC connection established

Issue: Transcriptions cut off mid-sentence

Solutions:

  • Adjust VAD silence_duration_ms (longer = fewer cuts)
  • Try different VAD threshold
  • Consider using different model

Issue: Poor transcription accuracy

Solutions:

  • Set correct language parameter
  • Use appropriate noise reduction setting
  • Provide context via prompt parameter
  • Try higher accuracy model (Whisper-1)

Issue: High latency

Solutions:

  • Use GPT-4o Mini for lower latency
  • Check network connection
  • Consider disabling logprobs

๐Ÿ“ˆ Performance Characteristics

Latency Comparison

  • GPT-4o Mini: ~100-200ms first token
  • GPT-4o: ~150-250ms first token
  • Whisper-1: ~500-1000ms (full segment)

Throughput

  • Handles real-time audio (16kHz PCM16)
  • Multiple concurrent sessions supported
  • No buffering required for streaming models

Cost Optimization

  • Transcription-only mode is cheaper than conversation mode
  • No AI inference costs
  • GPT-4o Mini most cost-effective
  • Whisper-1 for batch/accuracy needs

๐Ÿ”’ Security Considerations

Current Implementation

  • API key stored in environment variable
  • No authentication on endpoints
  • No rate limiting
  • Single-tenant design

Production Recommendations

  1. Authentication: Add user authentication
  2. Rate Limiting: Implement per-user limits
  3. CORS: Configure appropriate origins
  4. Monitoring: Track usage and errors
  5. Encryption: Ensure HTTPS only
  6. Token Security: Use short-lived tokens

๐Ÿš€ Deployment

Val Town Deployment

  1. Create/Remix Val

    vt remix emcho/hello-transcription my-transcription
  2. Set Environment

    • Add OPENAI_API_KEY in Val Town secrets
  3. Deploy

    vt push
  4. Access

    • URL: https://[your-val-name].val.run

Environment Variables

  • OPENAI_API_KEY - Required for OpenAI API access

๐Ÿ“ Future Enhancements

Potential Features

  1. Recording & Export

    • Save transcriptions to file
    • Export as SRT/VTT subtitles
    • Download audio recordings
  2. Advanced Controls

    • Manual VAD control
    • Custom VAD parameters UI
    • Prompt templates
  3. Multi-Stream

    • Multiple speaker support
    • Speaker diarization
    • Parallel transcriptions
  4. Post-Processing

    • Punctuation enhancement
    • Grammar correction
    • Translation
  5. Visualization

    • Audio waveform display
    • VAD activity indicator
    • Confidence heat map
  6. Integration

    • Webhook support
    • Real-time API streaming
    • Database storage

๐Ÿ”— References

Documentation

  • OpenAI Realtime Transcription Guide
  • Realtime API Reference
  • Voice Activity Detection Guide
  • Val Town Documentation

Related Projects

  • hello-realtime - Conversation mode demo
  • hello-mcp - MCP tool execution demo

Key Differences from Conversation Mode

  1. Session type: transcription vs realtime
  2. No AI responses generated
  3. Different event types
  4. Lower latency and cost
  5. Focused on speech-to-text only

๐Ÿ’ก Implementation Notes

Critical Discoveries

  1. Session Type: Must use type: "transcription" not type: "realtime"
  2. Event Names: Different from conversation mode events
  3. Model Behavior: Whisper doesn't stream, GPT-4o models do
  4. VAD Impact: Significantly affects transcription segmentation
  5. Language Setting: Dramatically improves accuracy for non-English

Best Practices

  1. Always specify language for non-English content
  2. Use near_field noise reduction for headset mics
  3. Enable VAD for natural speech segmentation
  4. Choose model based on latency vs accuracy needs
  5. Monitor item_id for proper segment ordering

๐ŸŽฏ Summary

Hello-Transcription successfully demonstrates the transcription-only capabilities of OpenAI's Realtime API. Key achievements:

  1. Pure Transcription: No AI responses, focused solely on speech-to-text
  2. Model Flexibility: Support for three different transcription models
  3. Real-time Streaming: Live transcription updates for supported models
  4. Configuration Options: VAD, noise reduction, language, logprobs
  5. Clean Interface: Split-view design for transcriptions and logs

This implementation serves as a foundation for building transcription-focused applications like live captioning, meeting transcription, subtitle generation, and accessibility tools.

Go to top
X (Twitter)
Discord community
GitHub discussions
YouTube channel
Bluesky
Product
FeaturesPricing
Developers
DocsStatusAPI ExamplesNPM Package Examples
Explore
ShowcaseTemplatesNewest ValsTrending ValsNewsletter
Company
AboutBlogCareersBrandhi@val.town
Terms of usePrivacy policyAbuse contact
ยฉ 2025 Val Town, Inc.