• Blog
  • Docs
  • Pricing
  • We’re hiring!
Log inSign up
Prada

Prada

Locals-AI

A local, truly private, AI experiment in your browser
Public
Like
Locals-AI
Home
Code
14
api
1
components
7
docs
1
hooks
2
lib
5
screens
3
.vtignore
AGENTS.md
App.tsx
README.md
deno.json
H
main.ts
tailwind.config.js
types.ts
Environment variables
2
Branches
12
Pull requests
Remixes
History
Val Town is a collaborative website to build and scale JavaScript apps.
Deploy APIs, crons, & store data – all from the browser, and deployed in milliseconds.
Sign up now
Code
/
docs
/
extended-thinking-research.md
Code
/
docs
/
extended-thinking-research.md
Search
…
Viewing readonly version of main branch: v1192
View latest version
extended-thinking-research.md

Extended Thinking Feature - Research & Implementation Documentation

Project Overview

Project: Locals-AI - Browser-based LLM chat application Tech Stack: React, Deno/Val.town, WebLLM (MLC-AI), Tailwind CSS Models Tested: Llama 3.1 8B, Llama 3.2 3B, Phi-3 Mini Date: February 2026


Goal

Implement a Claude-like "Extended Thinking" feature that:

  1. Shows the model's reasoning process in an expandable UI
  2. Expands while thinking, collapses after completion
  3. Improves answer quality for complex problems
  4. Works with browser-based open source LLMs

Implementation Phases

Phase 1: Basic Thinking Feature

Approach: Single call with <thinking></thinking> tags parsed from streaming response.

Initial System Prompt:

You are a helpful, friendly AI assistant. When answering complex questions,
first reason through your thinking inside <thinking></thinking> tags,
then provide your final answer.

Results:

  • Streaming worked well
  • UI expand/collapse worked
  • Model showed reasoning but quality was inconsistent

Phase 2: Testing Reasoning Quality

Test Questions Used:

QuestionPurpose
"If a shirt costs $25 and is 20% off, and I have a $5 coupon after the discount, what do I pay?"Multi-step math
"I have 3 apples. I buy 2 bags with 6 apples each. I give away half. How many do I have?"Edge case (7.5 apples)
"A bat and ball cost $1.10. The bat costs $1 more than the ball. How much is the ball?"Classic logic trap (answer: $0.05)
"Which is heavier: a pound of feathers or a pound of bricks?"Trick question
"If you pass the person in 2nd place, what place are you in?"Logic trap
"I need to cook pasta (10 min), sauce (20 min), bread (15 min). When to start each so they finish together?"Scheduling - became primary test

Key Finding: The scheduling question consistently failed across multiple prompt iterations.


Phase 3: Prompt Engineering Iterations

Attempt 1: Structured UNDERSTAND/PLAN/EXECUTE/VERIFY

<thinking>
UNDERSTAND: What exactly is being asked?
PLAN: What approach will solve this?
EXECUTE: Work through step-by-step.
VERIFY: Check your answer with actual numbers.
</thinking>

Result: Model attempted structure but made arithmetic errors (10+20=30, not 20) and didn't catch contradictions in VERIFY step.

Attempt 2: Added Explicit Rules

CRITICAL RULES:
1. For TIMING/SCHEDULING: Work BACKWARDS from the goal
   - Find longest task duration
   - Calculate: start_time = end_time - task_duration

Result: Model read instructions but didn't follow them. Still started with "shortest task first" intuition.

Attempt 3: Added Worked Example

Example: If tasks take 10, 15, 20 min and must finish at minute 20:
  * 20-min task starts at minute 0
  * 15-min task starts at minute 5
  * 10-min task starts at minute 10

Result: Closer but still made calculation errors. Model understood concept but couldn't execute arithmetic reliably.


Phase 4: Research on Llama 3.1 8B Limitations

Web Search Findings:

"Llama-3.1-8B showed poor ability to solve school-level math problems. The key problem lies in poor reasoning and a tendency to generate incoherent text."

"This limitation is attributed to the autoregressive nature of LLMs, which generate tokens sequentially and can easily propagate errors in reasoning across multiple steps."

Benchmark Data:

  • MATH benchmark: GPT-4o-mini (70.2) vs Llama 3.1 8B (~51.9)
  • Multi-step reasoning accuracy significantly lower than larger models

What Helps (from research):

  1. Few-shot examples - More effective than zero-shot CoT
  2. Self-consistency - Run multiple times, pick most common answer
  3. Fine-tuned reasoning models - e.g., Llama-3.1-Intuitive-Thinker
  4. Separate arithmetic from reasoning - Model does better on pure calculation

Sources:

  • Artificial Analysis - Llama 3.1 8B
  • Math Capabilities of Llama Models
  • Llama 3.1 Intuitive Thinker

Phase 5: Dual Mode Solution

Based on research, implemented two thinking modes:

Think Mode (Lean)

  • Simple, natural reasoning
  • Single inference call with streaming
  • No rigid structure

Prompt:

When answering questions that benefit from reasoning, think through your
approach inside <thinking></thinking> tags, then provide your answer.

For simple questions, respond directly without thinking tags.

Deep Think Mode (Robust)

  • Structured UNDERSTAND/PLAN/EXECUTE/VERIFY
  • Few-shot examples included in prompt
  • Self-consistency: 3 runs, vote on best answer
  • Combined thinking from all runs shown in UI

Prompt includes worked examples:

EXAMPLE - Scheduling Problem:
Q: Tasks A (10 min), B (15 min), C (20 min) must finish together. When to start each?
<thinking>
UNDERSTAND: 3 tasks, need simultaneous completion.
PLAN: Longest task (C, 20 min) sets total time. Work backwards.
EXECUTE:
- End time = 20 min (set by longest task)
- C: 20-20=0, start at minute 0
- B: 20-15=5, start at minute 5
- A: 20-10=10, start at minute 10
VERIFY: 0+20=20, 5+15=20, 10+10=20. All done at minute 20.
</thinking>
Start C at 0, B at 5, A at 10.

Final Test Results

Scheduling Problem: "Pasta (10), Sauce (20), Bread (15) - finish together"

ModeResultNotes
Off (no thinking)WRONGStarted pasta first
Think (lean)CORRECTNatural reasoning worked
Deep ThinkCORRECTAll 3 runs got it right

Key Discovery

Simple Think mode succeeded where complex structured prompts failed because:

  1. Model reasoned naturally without rigid format
  2. Less cognitive load from complex instructions
  3. "Start longest first, then next longest" intuition worked

This validates the dual-mode approach:

  • Think = Default for most problems
  • Deep = Reserve for known-difficult problems or when Think fails

Technical Implementation

Files Modified

FileChanges
types.tsAdded ThinkingMode, StreamCallback, updated Message with thinking?: string
hooks/useWebLLM.tsTwo prompts, streaming parser, self-consistency logic
components/ThinkingBubble.tsxExpandable UI with auto-expand/collapse
components/MessageBubble.tsxIntegrated ThinkingBubble, detects deep mode
screens/MainScreen.tsx3-state cycle button, progress indicator

Key Code: Thinking Tag Parser

function createThinkingParser() { let state = { phase: "scanning", thinkingBuffer: "", responseBuffer: "", tagBuffer: "" }; return { feed(chunk: string) { for (const char of chunk) { if (state.phase === "scanning") { state.tagBuffer += char; if (state.tagBuffer.endsWith("<thinking>")) { state.phase = "in_thinking"; state.tagBuffer = ""; } } else if (state.phase === "in_thinking") { if (state.tagBuffer.endsWith("</thinking>")) { state.thinkingBuffer += state.tagBuffer.slice(0, -11); state.phase = "in_response"; } // ... buffer management } } return { thinking: state.thinkingBuffer, response: state.responseBuffer, phase }; }, finalize() { /* handle incomplete tags */ } }; }

Key Code: Self-Consistency Voting

const pickBestResult = (results: SendMessageResult[]): SendMessageResult => { // Group by first line of answer (key similarity) const votes = new Map(); for (const result of results) { const key = result.content.split('\n')[0].toLowerCase().slice(0, 100); // ... voting logic } // Return answer with most votes + longest thinking // Combine all thinking for transparency: "[Run 1]\n...\n[Run 2]\n..." };

UI/UX Design

Mode Selector

  • Cycle button: Off (gray) → Think (blue) → Deep (purple)
  • Disabled during generation
  • Tooltip explains each mode

Thinking Bubble

  • Auto-expands while streaming
  • Auto-collapses 1 second after completion
  • Click to toggle expand/collapse
  • Collapsed preview: first 50 chars ellipsed
  • Labels: "Thought process" vs "Deep analysis (3 runs)"

Progress Indicator

  • Deep mode shows: "Deep thinking (2/3)..."
  • Updates as each run completes

Conclusions

What Works for Llama 3.1 8B

  1. Lean prompting - Simple CoT often better than complex structure
  2. Few-shot examples - Critical for pattern-following tasks
  3. Self-consistency - Multiple runs catch errors
  4. Streaming - Good UX, shows progress
  5. Transparent reasoning - Users can verify logic

What Doesn't Work

  1. Complex structured prompts - Model gets confused
  2. Expecting reliable arithmetic - Error propagation is real
  3. Single-shot for hard problems - Need multiple attempts
  4. Trusting VERIFY step - Model doesn't actually catch errors

Recommendations for Similar Projects

  1. Start with simple prompts, add complexity only if needed
  2. Implement multiple thinking modes for different use cases
  3. Use self-consistency for high-stakes answers
  4. Show reasoning to users for transparency and debugging
  5. Accept model limitations - some problems need bigger models

Future Improvements

  • Configurable number of runs for deep mode
  • Show all 3 answers in expandable comparison view
  • Confidence indicator based on agreement level
  • Fine-tuned reasoning model option (Llama-3.1-Intuitive-Thinker)
  • Automatic mode selection based on question complexity

Appendix: Test Prompts for Validation

Scheduling (Primary Test)

I need to cook pasta, make sauce, and bake garlic bread.
The pasta takes 10 min, sauce takes 20 min, bread takes 15 min.
In what order should I start things so everything is ready at the same time?

Correct: Sauce at 0, Bread at 5, Pasta at 10

Alternative Scheduling Tests

I need to wash the car (25 min), vacuum the interior (10 min),
and wax the exterior (20 min). What time should I start each
so they all finish together?

Correct: Wash at 0, Wax at 5, Vacuum at 15

Math with Verification

A bat and ball cost $1.10 total. The bat costs $1 more than the ball.
How much does the ball cost?

Correct: $0.05 (common wrong answer: $0.10)

Edge Case Detection

I have 3 apples. I buy 2 bags with 6 apples each.
I give away half of all my apples. How many do I have?

Correct: 7 or 8 (model should note you can't give away 7.5 apples)

FeaturesVersion controlCode intelligenceCLIMCP
Use cases
TeamsAI agentsSlackGTM
DocsShowcaseTemplatesNewestTrendingAPI examplesNPM packages
PricingNewsletterBlogAboutCareers
We’re hiring!
Brandhi@val.townStatus
X (Twitter)
Discord community
GitHub discussions
YouTube channel
Bluesky
Open Source Pledge
Terms of usePrivacy policyAbuse contact
© 2026 Val Town, Inc.