Hello-Transcription demonstrates the transcription-only mode of OpenAI's Realtime API. Unlike the conversational mode, this implementation focuses purely on speech-to-text conversion without generating AI responses, making it ideal for subtitles, live captions, meeting transcriptions, and other transcription-focused use cases.
Created: September 2, 2025
Platform: Val Town
API: OpenAI Realtime API (Transcription Mode)
Key Feature: Real-time streaming transcription with multiple model support
- Runtime: Deno (Val Town platform)
- Framework: Hono (lightweight web framework)
- Transcription: OpenAI Realtime API in transcription mode
- Connection: WebRTC with data channel for events
- Frontend: Vanilla JavaScript with split-view interface
- Models: GPT-4o Transcribe, GPT-4o Mini Transcribe, Whisper-1
hello-transcription/
โโโ frontend/
โ โโโ index.html # Split-view transcription interface
โโโ routes/
โ โโโ rtc.ts # WebRTC session setup for transcription
โ โโโ observer.ts # WebSocket observer for transcription events
โ โโโ utils.ts # Transcription session configuration
โโโ main.tsx # Main entry point
โโโ deno.json # Deno configuration
โโโ README.md # User documentation
โโโ CLAUDE.md # This file - technical documentation
The Realtime API supports two distinct modes:
-
Conversation Mode (
type: "realtime"
):- Two-way interaction with AI
- User speaks โ AI responds
- Used in hello-realtime and hello-mcp
-
Transcription Mode (
type: "transcription"
):- One-way speech-to-text only
- User speaks โ Text output
- No AI responses generated
- Lower latency, lower cost
- This demo uses transcription mode
{
type: "transcription",
input_audio_format: "pcm16",
input_audio_transcription: {
model: "gpt-4o-transcribe", // or "gpt-4o-mini-transcribe", "whisper-1"
prompt: "", // Optional context hint
language: "en" // ISO-639-1 language code
},
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500
},
input_audio_noise_reduction: {
type: "near_field" // or "far_field", null
},
include: ["item.input_audio_transcription.logprobs"] // Optional
}
Handles transcription session configuration with sensible defaults:
export function makeTranscriptionSession(config: TranscriptionConfig = {}) {
const {
model = "gpt-4o-transcribe",
language = "en",
prompt = "",
enableVAD = true,
noiseReduction = "near_field",
includeLogprobs = false
} = config;
// Build session object...
}
Key Configuration Options:
- model: Transcription model selection
- language: Primary language for better accuracy
- prompt: Context hints (e.g., "Expect medical terminology")
- enableVAD: Automatic voice activity detection
- noiseReduction: Audio preprocessing type
- includeLogprobs: Confidence scores for words
Creates WebRTC sessions specifically for transcription:
// Get config from query params
const model = c.req.query("model") || "gpt-4o-transcribe";
const language = c.req.query("language") || "en";
const vad = c.req.query("vad") !== "false";
const logprobs = c.req.query("logprobs") === "true";
// Create transcription session
const sessionConfig = makeTranscriptionSession({
model,
language,
enableVAD: vad,
includeLogprobs: logprobs
});
Important: Uses type: "transcription"
not type: "realtime"
Monitors transcription events via server-side WebSocket:
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "conversation.item.input_audio_transcription.delta") {
// Streaming partial transcription
console.log(`๐ Transcription delta: "${data.delta}"`);
} else if (data.type === "conversation.item.input_audio_transcription.completed") {
// Final transcription for segment
console.log(`โ
Transcription completed: "${data.transcript}"`);
}
};
Split-view interface with real-time transcription display:
- Shows live transcription stream
- Partial transcriptions update in real-time
- Final transcriptions marked with green border
- Each segment timestamped
- Technical event stream
- Debug information
- Connection status
dataChannel.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "conversation.item.input_audio_transcription.delta") {
// Update partial transcription
addTranscript(data.item_id, data.delta, false);
} else if (data.type === "conversation.item.input_audio_transcription.completed") {
// Mark transcription as final
addTranscript(data.item_id, data.transcript, true);
}
};
- Streaming: Yes - incremental updates via delta events
- Latency: Low
- Accuracy: High
- Use Case: Live subtitles, real-time captions
- Streaming: Yes - incremental updates
- Latency: Very low
- Accuracy: Good
- Use Case: Quick transcriptions, lower cost
- Streaming: No - complete segments only
- Latency: Higher (waits for complete utterance)
- Accuracy: Very high
- Use Case: High-accuracy transcriptions, post-processing
-
Audio Input
User speaks โ Microphone โ WebRTC โ OpenAI
-
VAD Processing (if enabled)
Voice detected โ Buffer audio โ Silence detected โ Commit buffer
-
Transcription Events
input_audio_buffer.committed โ conversation.item.input_audio_transcription.delta (streaming models) โ conversation.item.input_audio_transcription.completed
{ "type": "conversation.item.input_audio_transcription.delta", "item_id": "item_003", "content_index": 0, "delta": "Hello, how" }
{ "type": "conversation.item.input_audio_transcription.completed", "item_id": "item_003", "content_index": 0, "transcript": "Hello, how are you today?" }
VAD automatically detects speech segments:
turn_detection: {
type: "server_vad",
threshold: 0.5, // Sensitivity (0-1)
prefix_padding_ms: 300, // Audio before speech
silence_duration_ms: 500 // Silence to end segment
}
VAD Disabled:
turn_detection: null // Manual control required
Three noise reduction modes:
- near_field: Optimized for close microphones (default)
- far_field: For distant microphones/speakers
- null: No noise reduction
ISO-639-1 language codes improve accuracy:
- "en" - English
- "es" - Spanish
- "fr" - French
- "de" - German
- "zh" - Chinese
- "ja" - Japanese
When enabled, provides word-level confidence:
include: ["item.input_audio_transcription.logprobs"]
Returns probability scores for each transcribed word, useful for:
- Highlighting uncertain words
- Quality assessment
- Post-processing decisions
-
Setup Environment
# Create .env file echo "OPENAI_API_KEY=sk-..." > .env # Install Deno curl -fsSL https://deno.land/install.sh | sh -
Run Development Server
# With auto-reload deno run --watch --allow-all main.tsx # Or standard deno run --allow-all main.tsx -
Test Transcription
- Open http://localhost:8000
- Select model and language
- Click "Start"
- Speak clearly
- Watch transcriptions appear
-
Test Streaming (GPT-4o)
- Select "GPT-4o Transcribe"
- Speak continuously
- Notice incremental updates
-
Test Non-Streaming (Whisper-1)
- Select "Whisper-1"
- Speak, then pause
- Notice complete segments only
-
Test VAD
- Enable VAD
- Speak with pauses
- Notice automatic segmentation
-
Test Without VAD
- Disable VAD
- Requires manual commit (not implemented in this demo)
Solutions:
- Check microphone permissions
- Verify OPENAI_API_KEY is set
- Check browser console for errors
- Ensure WebRTC connection established
Solutions:
- Adjust VAD silence_duration_ms (longer = fewer cuts)
- Try different VAD threshold
- Consider using different model
Solutions:
- Set correct language parameter
- Use appropriate noise reduction setting
- Provide context via prompt parameter
- Try higher accuracy model (Whisper-1)
Solutions:
- Use GPT-4o Mini for lower latency
- Check network connection
- Consider disabling logprobs
- GPT-4o Mini: ~100-200ms first token
- GPT-4o: ~150-250ms first token
- Whisper-1: ~500-1000ms (full segment)
- Handles real-time audio (16kHz PCM16)
- Multiple concurrent sessions supported
- No buffering required for streaming models
- Transcription-only mode is cheaper than conversation mode
- No AI inference costs
- GPT-4o Mini most cost-effective
- Whisper-1 for batch/accuracy needs
- API key stored in environment variable
- No authentication on endpoints
- No rate limiting
- Single-tenant design
- Authentication: Add user authentication
- Rate Limiting: Implement per-user limits
- CORS: Configure appropriate origins
- Monitoring: Track usage and errors
- Encryption: Ensure HTTPS only
- Token Security: Use short-lived tokens
-
Create/Remix Val
vt remix emcho/hello-transcription my-transcription -
Set Environment
- Add
OPENAI_API_KEY
in Val Town secrets
- Add
-
Deploy
vt push -
Access
- URL:
https://[your-val-name].val.run
- URL:
OPENAI_API_KEY
- Required for OpenAI API access
-
Recording & Export
- Save transcriptions to file
- Export as SRT/VTT subtitles
- Download audio recordings
-
Advanced Controls
- Manual VAD control
- Custom VAD parameters UI
- Prompt templates
-
Multi-Stream
- Multiple speaker support
- Speaker diarization
- Parallel transcriptions
-
Post-Processing
- Punctuation enhancement
- Grammar correction
- Translation
-
Visualization
- Audio waveform display
- VAD activity indicator
- Confidence heat map
-
Integration
- Webhook support
- Real-time API streaming
- Database storage
- OpenAI Realtime Transcription Guide
- Realtime API Reference
- Voice Activity Detection Guide
- Val Town Documentation
- hello-realtime - Conversation mode demo
- hello-mcp - MCP tool execution demo
- Session type:
transcription
vsrealtime
- No AI responses generated
- Different event types
- Lower latency and cost
- Focused on speech-to-text only
- Session Type: Must use
type: "transcription"
nottype: "realtime"
- Event Names: Different from conversation mode events
- Model Behavior: Whisper doesn't stream, GPT-4o models do
- VAD Impact: Significantly affects transcription segmentation
- Language Setting: Dramatically improves accuracy for non-English
- Always specify language for non-English content
- Use near_field noise reduction for headset mics
- Enable VAD for natural speech segmentation
- Choose model based on latency vs accuracy needs
- Monitor item_id for proper segment ordering
Hello-Transcription successfully demonstrates the transcription-only capabilities of OpenAI's Realtime API. Key achievements:
- Pure Transcription: No AI responses, focused solely on speech-to-text
- Model Flexibility: Support for three different transcription models
- Real-time Streaming: Live transcription updates for supported models
- Configuration Options: VAD, noise reduction, language, logprobs
- Clean Interface: Split-view design for transcriptions and logs
This implementation serves as a foundation for building transcription-focused applications like live captioning, meeting transcription, subtitle generation, and accessibility tools.