Hello-Transcription demonstrates the transcription-only mode of OpenAI's Realtime API. Unlike the conversational mode, this implementation focuses purely on speech-to-text conversion without generating AI responses, making it ideal for subtitles, live captions, meeting transcriptions, and other transcription-focused use cases.
Created: September 2, 2025
Platform: Val Town
API: OpenAI Realtime API (Transcription Mode)
Key Feature: Real-time streaming transcription with multiple model support
hello-transcription/
โโโ frontend/
โ โโโ index.html # Split-view transcription interface
โโโ routes/
โ โโโ rtc.ts # WebRTC session setup for transcription
โ โโโ observer.ts # WebSocket observer for transcription events
โ โโโ utils.ts # Transcription session configuration
โโโ main.tsx # Main entry point
โโโ deno.json # Deno configuration
โโโ README.md # User documentation
โโโ CLAUDE.md # This file - technical documentation
The Realtime API supports two distinct modes:
Conversation Mode (type: "realtime"
):
Transcription Mode (type: "transcription"
):
{
type: "transcription",
input_audio_format: "pcm16",
input_audio_transcription: {
model: "gpt-4o-transcribe", // or "gpt-4o-mini-transcribe", "whisper-1"
prompt: "", // Optional context hint
language: "en" // ISO-639-1 language code
},
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500
},
input_audio_noise_reduction: {
type: "near_field" // or "far_field", null
},
include: ["item.input_audio_transcription.logprobs"] // Optional
}
Handles transcription session configuration with sensible defaults:
export function makeTranscriptionSession(config: TranscriptionConfig = {}) {
const {
model = "gpt-4o-transcribe",
language = "en",
prompt = "",
enableVAD = true,
noiseReduction = "near_field",
includeLogprobs = false
} = config;
// Build session object...
}
Key Configuration Options:
Creates WebRTC sessions specifically for transcription:
// Get config from query params
const model = c.req.query("model") || "gpt-4o-transcribe";
const language = c.req.query("language") || "en";
const vad = c.req.query("vad") !== "false";
const logprobs = c.req.query("logprobs") === "true";
// Create transcription session
const sessionConfig = makeTranscriptionSession({
model,
language,
enableVAD: vad,
includeLogprobs: logprobs
});
Important: Uses type: "transcription"
not type: "realtime"
Monitors transcription events via server-side WebSocket:
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "conversation.item.input_audio_transcription.delta") {
// Streaming partial transcription
console.log(`๐ Transcription delta: "${data.delta}"`);
} else if (data.type === "conversation.item.input_audio_transcription.completed") {
// Final transcription for segment
console.log(`โ
Transcription completed: "${data.transcript}"`);
}
};
Split-view interface with real-time transcription display:
dataChannel.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "conversation.item.input_audio_transcription.delta") {
// Update partial transcription
addTranscript(data.item_id, data.delta, false);
} else if (data.type === "conversation.item.input_audio_transcription.completed") {
// Mark transcription as final
addTranscript(data.item_id, data.transcript, true);
}
};
Audio Input
User speaks โ Microphone โ WebRTC โ OpenAI
VAD Processing (if enabled)
Voice detected โ Buffer audio โ Silence detected โ Commit buffer
Transcription Events
input_audio_buffer.committed
โ
conversation.item.input_audio_transcription.delta (streaming models)
โ
conversation.item.input_audio_transcription.completed
{ "type": "conversation.item.input_audio_transcription.delta", "item_id": "item_003", "content_index": 0, "delta": "Hello, how" }
{ "type": "conversation.item.input_audio_transcription.completed", "item_id": "item_003", "content_index": 0, "transcript": "Hello, how are you today?" }
VAD automatically detects speech segments:
turn_detection: {
type: "server_vad",
threshold: 0.5, // Sensitivity (0-1)
prefix_padding_ms: 300, // Audio before speech
silence_duration_ms: 500 // Silence to end segment
}
VAD Disabled:
turn_detection: null // Manual control required
Three noise reduction modes:
ISO-639-1 language codes improve accuracy:
When enabled, provides word-level confidence:
include: ["item.input_audio_transcription.logprobs"]
Returns probability scores for each transcribed word, useful for:
Setup Environment
# Create .env file echo "OPENAI_API_KEY=sk-..." > .env # Install Deno curl -fsSL https://deno.land/install.sh | sh
Run Development Server
# With auto-reload deno run --watch --allow-all main.tsx # Or standard deno run --allow-all main.tsx
Test Transcription
Test Streaming (GPT-4o)
Test Non-Streaming (Whisper-1)
Test VAD
Test Without VAD
Solutions:
Solutions:
Solutions:
Solutions:
Create/Remix Val
vt remix emcho/hello-transcription my-transcription
Set Environment
OPENAI_API_KEY
in Val Town secretsDeploy
vt push
Access
https://[your-val-name].val.run
OPENAI_API_KEY
- Required for OpenAI API accessRecording & Export
Advanced Controls
Multi-Stream
Post-Processing
Visualization
Integration
transcription
vs realtime
type: "transcription"
not type: "realtime"
Hello-Transcription successfully demonstrates the transcription-only capabilities of OpenAI's Realtime API. Key achievements:
This implementation serves as a foundation for building transcription-focused applications like live captioning, meeting transcription, subtitle generation, and accessibility tools.