Project: Locals-AI - Browser-based LLM chat application Tech Stack: React, Deno/Val.town, WebLLM (MLC-AI), Tailwind CSS Models Tested: Llama 3.1 8B, Llama 3.2 3B, Phi-3 Mini Date: February 2026
Implement a Claude-like "Extended Thinking" feature that:
- Shows the model's reasoning process in an expandable UI
- Expands while thinking, collapses after completion
- Improves answer quality for complex problems
- Works with browser-based open source LLMs
Approach: Single call with <thinking></thinking> tags parsed from streaming response.
Initial System Prompt:
You are a helpful, friendly AI assistant. When answering complex questions,
first reason through your thinking inside <thinking></thinking> tags,
then provide your final answer.
Results:
- Streaming worked well
- UI expand/collapse worked
- Model showed reasoning but quality was inconsistent
Test Questions Used:
| Question | Purpose |
|---|---|
| "If a shirt costs $25 and is 20% off, and I have a $5 coupon after the discount, what do I pay?" | Multi-step math |
| "I have 3 apples. I buy 2 bags with 6 apples each. I give away half. How many do I have?" | Edge case (7.5 apples) |
| "A bat and ball cost $1.10. The bat costs $1 more than the ball. How much is the ball?" | Classic logic trap (answer: $0.05) |
| "Which is heavier: a pound of feathers or a pound of bricks?" | Trick question |
| "If you pass the person in 2nd place, what place are you in?" | Logic trap |
| "I need to cook pasta (10 min), sauce (20 min), bread (15 min). When to start each so they finish together?" | Scheduling - became primary test |
Key Finding: The scheduling question consistently failed across multiple prompt iterations.
<thinking>
UNDERSTAND: What exactly is being asked?
PLAN: What approach will solve this?
EXECUTE: Work through step-by-step.
VERIFY: Check your answer with actual numbers.
</thinking>
Result: Model attempted structure but made arithmetic errors (10+20=30, not 20) and didn't catch contradictions in VERIFY step.
CRITICAL RULES:
1. For TIMING/SCHEDULING: Work BACKWARDS from the goal
- Find longest task duration
- Calculate: start_time = end_time - task_duration
Result: Model read instructions but didn't follow them. Still started with "shortest task first" intuition.
Example: If tasks take 10, 15, 20 min and must finish at minute 20:
* 20-min task starts at minute 0
* 15-min task starts at minute 5
* 10-min task starts at minute 10
Result: Closer but still made calculation errors. Model understood concept but couldn't execute arithmetic reliably.
Web Search Findings:
"Llama-3.1-8B showed poor ability to solve school-level math problems. The key problem lies in poor reasoning and a tendency to generate incoherent text."
"This limitation is attributed to the autoregressive nature of LLMs, which generate tokens sequentially and can easily propagate errors in reasoning across multiple steps."
Benchmark Data:
- MATH benchmark: GPT-4o-mini (70.2) vs Llama 3.1 8B (~51.9)
- Multi-step reasoning accuracy significantly lower than larger models
What Helps (from research):
- Few-shot examples - More effective than zero-shot CoT
- Self-consistency - Run multiple times, pick most common answer
- Fine-tuned reasoning models - e.g., Llama-3.1-Intuitive-Thinker
- Separate arithmetic from reasoning - Model does better on pure calculation
Sources:
Based on research, implemented two thinking modes:
- Simple, natural reasoning
- Single inference call with streaming
- No rigid structure
Prompt:
When answering questions that benefit from reasoning, think through your
approach inside <thinking></thinking> tags, then provide your answer.
For simple questions, respond directly without thinking tags.
- Structured UNDERSTAND/PLAN/EXECUTE/VERIFY
- Few-shot examples included in prompt
- Self-consistency: 3 runs, vote on best answer
- Combined thinking from all runs shown in UI
Prompt includes worked examples:
EXAMPLE - Scheduling Problem:
Q: Tasks A (10 min), B (15 min), C (20 min) must finish together. When to start each?
<thinking>
UNDERSTAND: 3 tasks, need simultaneous completion.
PLAN: Longest task (C, 20 min) sets total time. Work backwards.
EXECUTE:
- End time = 20 min (set by longest task)
- C: 20-20=0, start at minute 0
- B: 20-15=5, start at minute 5
- A: 20-10=10, start at minute 10
VERIFY: 0+20=20, 5+15=20, 10+10=20. All done at minute 20.
</thinking>
Start C at 0, B at 5, A at 10.
| Mode | Result | Notes |
|---|---|---|
| Off (no thinking) | WRONG | Started pasta first |
| Think (lean) | CORRECT | Natural reasoning worked |
| Deep Think | CORRECT | All 3 runs got it right |
Simple Think mode succeeded where complex structured prompts failed because:
- Model reasoned naturally without rigid format
- Less cognitive load from complex instructions
- "Start longest first, then next longest" intuition worked
This validates the dual-mode approach:
- Think = Default for most problems
- Deep = Reserve for known-difficult problems or when Think fails
| File | Changes |
|---|---|
types.ts | Added ThinkingMode, StreamCallback, updated Message with thinking?: string |
hooks/useWebLLM.ts | Two prompts, streaming parser, self-consistency logic |
components/ThinkingBubble.tsx | Expandable UI with auto-expand/collapse |
components/MessageBubble.tsx | Integrated ThinkingBubble, detects deep mode |
screens/MainScreen.tsx | 3-state cycle button, progress indicator |
function createThinkingParser() {
let state = { phase: "scanning", thinkingBuffer: "", responseBuffer: "", tagBuffer: "" };
return {
feed(chunk: string) {
for (const char of chunk) {
if (state.phase === "scanning") {
state.tagBuffer += char;
if (state.tagBuffer.endsWith("<thinking>")) {
state.phase = "in_thinking";
state.tagBuffer = "";
}
} else if (state.phase === "in_thinking") {
if (state.tagBuffer.endsWith("</thinking>")) {
state.thinkingBuffer += state.tagBuffer.slice(0, -11);
state.phase = "in_response";
}
// ... buffer management
}
}
return { thinking: state.thinkingBuffer, response: state.responseBuffer, phase };
},
finalize() { /* handle incomplete tags */ }
};
}
const pickBestResult = (results: SendMessageResult[]): SendMessageResult => {
// Group by first line of answer (key similarity)
const votes = new Map();
for (const result of results) {
const key = result.content.split('\n')[0].toLowerCase().slice(0, 100);
// ... voting logic
}
// Return answer with most votes + longest thinking
// Combine all thinking for transparency: "[Run 1]\n...\n[Run 2]\n..."
};
- Cycle button: Off (gray) → Think (blue) → Deep (purple)
- Disabled during generation
- Tooltip explains each mode
- Auto-expands while streaming
- Auto-collapses 1 second after completion
- Click to toggle expand/collapse
- Collapsed preview: first 50 chars ellipsed
- Labels: "Thought process" vs "Deep analysis (3 runs)"
- Deep mode shows: "Deep thinking (2/3)..."
- Updates as each run completes
- Lean prompting - Simple CoT often better than complex structure
- Few-shot examples - Critical for pattern-following tasks
- Self-consistency - Multiple runs catch errors
- Streaming - Good UX, shows progress
- Transparent reasoning - Users can verify logic
- Complex structured prompts - Model gets confused
- Expecting reliable arithmetic - Error propagation is real
- Single-shot for hard problems - Need multiple attempts
- Trusting VERIFY step - Model doesn't actually catch errors
- Start with simple prompts, add complexity only if needed
- Implement multiple thinking modes for different use cases
- Use self-consistency for high-stakes answers
- Show reasoning to users for transparency and debugging
- Accept model limitations - some problems need bigger models
- Configurable number of runs for deep mode
- Show all 3 answers in expandable comparison view
- Confidence indicator based on agreement level
- Fine-tuned reasoning model option (Llama-3.1-Intuitive-Thinker)
- Automatic mode selection based on question complexity
I need to cook pasta, make sauce, and bake garlic bread.
The pasta takes 10 min, sauce takes 20 min, bread takes 15 min.
In what order should I start things so everything is ready at the same time?
Correct: Sauce at 0, Bread at 5, Pasta at 10
I need to wash the car (25 min), vacuum the interior (10 min),
and wax the exterior (20 min). What time should I start each
so they all finish together?
Correct: Wash at 0, Wax at 5, Vacuum at 15
A bat and ball cost $1.10 total. The bat costs $1 more than the ball.
How much does the ball cost?
Correct: $0.05 (common wrong answer: $0.10)
I have 3 apples. I buy 2 bags with 6 apples each.
I give away half of all my apples. How many do I have?
Correct: 7 or 8 (model should note you can't give away 7.5 apples)