Extended Thinking Feature - Research & Implementation Documentation

Project Overview

Project: Locals-AI - Browser-based LLM chat application Tech Stack: React, Deno/Val.town, WebLLM (MLC-AI), Tailwind CSS Models Tested: Llama 3.1 8B, Llama 3.2 3B, Phi-3 Mini Date: February 2026

Goal

Implement a Claude-like "Extended Thinking" feature that:

Shows the model's reasoning process in an expandable UI
Expands while thinking, collapses after completion
Improves answer quality for complex problems
Works with browser-based open source LLMs

Implementation Phases

Phase 1: Basic Thinking Feature

Approach: Single call with <thinking></thinking> tags parsed from streaming response.

Initial System Prompt:

You are a helpful, friendly AI assistant. When answering complex questions,
first reason through your thinking inside <thinking></thinking> tags,
then provide your final answer.

Results:

Streaming worked well
UI expand/collapse worked
Model showed reasoning but quality was inconsistent

Phase 2: Testing Reasoning Quality

Test Questions Used:

Question	Purpose
"If a shirt costs $25 and is 20% off, and I have a $5 coupon after the discount, what do I pay?"	Multi-step math
"I have 3 apples. I buy 2 bags with 6 apples each. I give away half. How many do I have?"	Edge case (7.5 apples)
"A bat and ball cost $1.10. The bat costs $1 more than the ball. How much is the ball?"	Classic logic trap (answer: $0.05)
"Which is heavier: a pound of feathers or a pound of bricks?"	Trick question
"If you pass the person in 2nd place, what place are you in?"	Logic trap
"I need to cook pasta (10 min), sauce (20 min), bread (15 min). When to start each so they finish together?"	Scheduling - became primary test

Key Finding: The scheduling question consistently failed across multiple prompt iterations.

Phase 3: Prompt Engineering Iterations

Attempt 1: Structured UNDERSTAND/PLAN/EXECUTE/VERIFY

<thinking>
UNDERSTAND: What exactly is being asked?
PLAN: What approach will solve this?
EXECUTE: Work through step-by-step.
VERIFY: Check your answer with actual numbers.
</thinking>

Result: Model attempted structure but made arithmetic errors (10+20=30, not 20) and didn't catch contradictions in VERIFY step.

Attempt 2: Added Explicit Rules

CRITICAL RULES:
1. For TIMING/SCHEDULING: Work BACKWARDS from the goal
   - Find longest task duration
   - Calculate: start_time = end_time - task_duration

Result: Model read instructions but didn't follow them. Still started with "shortest task first" intuition.

Attempt 3: Added Worked Example

Example: If tasks take 10, 15, 20 min and must finish at minute 20:
  * 20-min task starts at minute 0
  * 15-min task starts at minute 5
  * 10-min task starts at minute 10

Result: Closer but still made calculation errors. Model understood concept but couldn't execute arithmetic reliably.

Phase 4: Research on Llama 3.1 8B Limitations

Web Search Findings:

"Llama-3.1-8B showed poor ability to solve school-level math problems. The key problem lies in poor reasoning and a tendency to generate incoherent text."

"This limitation is attributed to the autoregressive nature of LLMs, which generate tokens sequentially and can easily propagate errors in reasoning across multiple steps."

Benchmark Data:

MATH benchmark: GPT-4o-mini (70.2) vs Llama 3.1 8B (~51.9)
Multi-step reasoning accuracy significantly lower than larger models

What Helps (from research):

Few-shot examples - More effective than zero-shot CoT
Self-consistency - Run multiple times, pick most common answer
Fine-tuned reasoning models - e.g., Llama-3.1-Intuitive-Thinker
Separate arithmetic from reasoning - Model does better on pure calculation

Sources:

Phase 5: Dual Mode Solution

Based on research, implemented two thinking modes:

Think Mode (Lean)

Simple, natural reasoning
Single inference call with streaming
No rigid structure

Prompt:

When answering questions that benefit from reasoning, think through your
approach inside <thinking></thinking> tags, then provide your answer.

For simple questions, respond directly without thinking tags.

Deep Think Mode (Robust)

Structured UNDERSTAND/PLAN/EXECUTE/VERIFY
Few-shot examples included in prompt
Self-consistency: 3 runs, vote on best answer
Combined thinking from all runs shown in UI

Prompt includes worked examples:

EXAMPLE - Scheduling Problem:
Q: Tasks A (10 min), B (15 min), C (20 min) must finish together. When to start each?
<thinking>
UNDERSTAND: 3 tasks, need simultaneous completion.
PLAN: Longest task (C, 20 min) sets total time. Work backwards.
EXECUTE:
- End time = 20 min (set by longest task)
- C: 20-20=0, start at minute 0
- B: 20-15=5, start at minute 5
- A: 20-10=10, start at minute 10
VERIFY: 0+20=20, 5+15=20, 10+10=20. All done at minute 20.
</thinking>
Start C at 0, B at 5, A at 10.

Final Test Results

Scheduling Problem: "Pasta (10), Sauce (20), Bread (15) - finish together"

Mode	Result	Notes
Off (no thinking)	WRONG	Started pasta first
Think (lean)	CORRECT	Natural reasoning worked
Deep Think	CORRECT	All 3 runs got it right

Key Discovery

Simple Think mode succeeded where complex structured prompts failed because:

Model reasoned naturally without rigid format
Less cognitive load from complex instructions
"Start longest first, then next longest" intuition worked

This validates the dual-mode approach:

Think = Default for most problems
Deep = Reserve for known-difficult problems or when Think fails

Technical Implementation

Files Modified

File	Changes
`types.ts`	Added `ThinkingMode`, `StreamCallback`, updated `Message` with `thinking?: string`
`hooks/useWebLLM.ts`	Two prompts, streaming parser, self-consistency logic
`components/ThinkingBubble.tsx`	Expandable UI with auto-expand/collapse
`components/MessageBubble.tsx`	Integrated ThinkingBubble, detects deep mode
`screens/MainScreen.tsx`	3-state cycle button, progress indicator

Key Code: Thinking Tag Parser

function createThinkingParser() {
  let state = { phase: "scanning", thinkingBuffer: "", responseBuffer: "", tagBuffer: "" };

  return {
    feed(chunk: string) {
      for (const char of chunk) {
        if (state.phase === "scanning") {
          state.tagBuffer += char;
          if (state.tagBuffer.endsWith("<thinking>")) {
            state.phase = "in_thinking";
            state.tagBuffer = "";
          }
        } else if (state.phase === "in_thinking") {
          if (state.tagBuffer.endsWith("</thinking>")) {
            state.thinkingBuffer += state.tagBuffer.slice(0, -11);
            state.phase = "in_response";
          }
          // ... buffer management
        }
      }
      return { thinking: state.thinkingBuffer, response: state.responseBuffer, phase };
    },
    finalize() { /* handle incomplete tags */ }
  };
}

Key Code: Self-Consistency Voting

const pickBestResult = (results: SendMessageResult[]): SendMessageResult => {
  // Group by first line of answer (key similarity)
  const votes = new Map();
  for (const result of results) {
    const key = result.content.split('\n')[0].toLowerCase().slice(0, 100);
    // ... voting logic
  }

  // Return answer with most votes + longest thinking
  // Combine all thinking for transparency: "[Run 1]\n...\n[Run 2]\n..."
};

UI/UX Design

Mode Selector

Cycle button: Off (gray) → Think (blue) → Deep (purple)
Disabled during generation
Tooltip explains each mode

Thinking Bubble

Auto-expands while streaming
Auto-collapses 1 second after completion
Click to toggle expand/collapse
Collapsed preview: first 50 chars ellipsed
Labels: "Thought process" vs "Deep analysis (3 runs)"

Progress Indicator

Deep mode shows: "Deep thinking (2/3)..."
Updates as each run completes

Conclusions

What Works for Llama 3.1 8B

Lean prompting - Simple CoT often better than complex structure
Few-shot examples - Critical for pattern-following tasks
Self-consistency - Multiple runs catch errors
Streaming - Good UX, shows progress
Transparent reasoning - Users can verify logic

What Doesn't Work

Complex structured prompts - Model gets confused
Expecting reliable arithmetic - Error propagation is real
Single-shot for hard problems - Need multiple attempts
Trusting VERIFY step - Model doesn't actually catch errors

Recommendations for Similar Projects

Start with simple prompts, add complexity only if needed
Implement multiple thinking modes for different use cases
Use self-consistency for high-stakes answers
Show reasoning to users for transparency and debugging
Accept model limitations - some problems need bigger models

Future Improvements

Configurable number of runs for deep mode
Show all 3 answers in expandable comparison view
Confidence indicator based on agreement level
Fine-tuned reasoning model option (Llama-3.1-Intuitive-Thinker)
Automatic mode selection based on question complexity

Appendix: Test Prompts for Validation

Scheduling (Primary Test)

I need to cook pasta, make sauce, and bake garlic bread.
The pasta takes 10 min, sauce takes 20 min, bread takes 15 min.
In what order should I start things so everything is ready at the same time?

Correct: Sauce at 0, Bread at 5, Pasta at 10

Alternative Scheduling Tests

I need to wash the car (25 min), vacuum the interior (10 min),
and wax the exterior (20 min). What time should I start each
so they all finish together?

Correct: Wash at 0, Wax at 5, Vacuum at 15

Math with Verification

A bat and ball cost $1.10 total. The bat costs $1 more than the ball.
How much does the ball cost?

Correct: $0.05 (common wrong answer: $0.10)

Edge Case Detection

I have 3 apples. I buy 2 bags with 6 apples each.
I give away half of all my apples. How many do I have?

Correct: 7 or 8 (model should note you can't give away 7.5 apples)

Prada

Locals-AI