The Thought That Counts: Exploring Chain of Thought Prompting

We reproduce chain-of-thought experiments across base, instruction-tuned, and reasoning models. The results suggest that reported CoT gains on modern models are mostly artifacts of suppression, not necessarily reasoning improvements.

1 Chain of Thought?

Chain-of-thought (CoT) prompting is the elegant idea of supplying models with examples that demonstrate step-by-step reasoning [1]. This not only induces the model to reason, but also produces more accurate outputs. Building on CoT, subsequent work found surprising results: showing wrong reasoning could still elicit similar accuracy gains [2]. Alternatively, simply adding the phrase “Let’s think step by step” worked without any examples at all [3].

Curiously, some studies found that the quality of the reasoning produced did not correlate with getting correct answers. [2] showed that prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using valid CoT, while still generating coherent lines of reasoning during inference [2]. If accurate reasoning isn’t necessary for correctness, then what is CoT actually doing?

Another line of research showed that allowing a larger generation budget at inference time provides the model with additional computation, boosting performance [4]. This idea has been made explicit by training models to emit “pause tokens”, meaningless filler tokens that provide extra computation steps [5]. Therefore, one hypothesis is that part of the CoT advantage comes from the computational benefit of generating more tokens; more output tokens means more “thinking time.”

This explanation of increased “thinking time” feels incomplete. If CoT’s benefit came purely from additional computation, we’d expect any filler tokens to help, yet [5] found that pause tokens only work when the model is trained to use them. On the other hand, if CoT worked by teaching models how to reason through demonstrations, we’d expect wrong reasoning to hurt, yet it doesn’t. We ran a series of experiments to build intuition and understanding.

2 Reproducing Chain of Thought Prompting

To better understand CoT, we began by reproducing experiments from several papers. We also aimed to cover a large range of model types. Specifically, base models, models with instruction tuning, and models with thinking. We experiment with the GSM8K dataset and the following models:

Model	Type	Provider
Llama-3.1-8B	Base	HuggingFace
Llama-3.1-8B-Instruct	Instruction-tuned	HuggingFace
Llama-3.1-70B-Instruct	Instruction-tuned (large)	Fireworks AI
GPT-3.5-turbo	Instruction-tuned	OpenAI
GPT-4o	Frontier	OpenAI
Qwen3-8B	Reasoning model	Fireworks AI

With each of these models, we explored several prompting conditions, shown in the figure below. For few-shot prompting, we tested standard_cot, the original [1] format with 8 exemplars containing step-by-step reasoning, and no_cot, which uses the same exemplars but with direct answers only (no reasoning shown). For zero-shot prompting, we tested zero_shot_none, a straightforward question-answer format with no trigger, and zero_shot_cot, which adds “Let’s think step by step” to the prompt following [3]. Finally, we include two ablations to test what actually drives CoT effects: wrong_reasoning (exemplars with incorrect reasoning but correct final answers) and logic_reasoning (exemplars using formal logic notation instead of math steps).

Different chain of thought prompting structures that we explored. — Chain of thought prompting structures explored.

Note

Answer Extraction: A critical detail in CoT evaluation is how you extract the final answer from the model’s output. Base models and instruction-tuned models produce very different formats. For example, some models use GSM8K’s native #### 42, others use LaTeX \boxed{42}, and many just bury the answer in prose. We implemented a multi-stage extraction pipeline: first attempting fast regex matching for known formats, then falling back to the two-stage LLM extraction method from [3] (appending “Therefore, the answer (arabic numerals) is” and having the model complete it), with a final fallback to GPT-3.5-turbo for stubborn cases. This matters more than it might seem, naive extraction that assumes a particular format can systematically bias results toward models that happen to match that format.

In the following sections, we present and discuss the results. We present results in stages: first reproducing CoT on base models (Section 2.1), then examining instruction-tuned models where something unexpected happens (Section 2.2 - Sections 2.3), and finally characterizing the different model behaviors we observe.

2.1 Reproducing CoT on a Base Model

We start with Llama-3.1-8B, a base model with no instruction tuning. This is the setting closest to the original CoT papers. The table below shows the results of prompting Llama-3.1-8B-Base with the 6 CoT conditions.

Condition	Llama-3.1-8B-Base
zero_shot_none	24.6%
zero_shot_cot	49.7%
standard_cot	58.2%
no_cot	14.6%
wrong_reasoning	53.8%
logic_reasoning	15.7%

The results replicate the findings of [1]. The base model struggles with zero-shot (24.6%) but improves dramatically with CoT prompting, whether zero-shot with a trigger (49.7%) or few-shot with exemplars (58.2%). Hence, in this case, the real CoT gain is +33.6 pp. The model doesn’t reason by default; the prompting strategy helps.

Interestingly, wrong_reasoning, where the exemplars contain incorrect math but correct final answers, still achieves 53.8%. The model seems to benefit from the format of showing work, not the correctness of the reasoning. Finally, we see that logic_reasoning does not result in a gain over the no_cot baseline.

2.2 What About Instruction-Tuned Models?

Now we run the same experiment on instruction-tuned models, the kind most people actually use.

Condition	L-8B-Base	L-8B-Instruct	GPT-3.5
zero_shot_none	24.6%	63.1%	73.8%
zero_shot_cot	49.7%	72.8%	70.1%
standard_cot	58.2%	77.6%	76.5%
no_cot	14.6%	17.4%	31.5%
wrong_reasoning	53.8%	74.5%	79.3%
logic_reasoning	15.7%	16.4%	12.4%

Something unexpected emerges. Firstly, look at the zero_shot_none row. Instruction-tuned models already achieve 63-74% accuracy with no prompting at all. These models reason by default. Secondly, look at no_cot. When we show these models few-shot exemplars with direct answers (no reasoning), performance collapses, from 63% to 17% for Llama-Instruct, from 74% to 31% for GPT-3.5. The few-shot direct-answer format doesn’t establish a neutral baseline. It actively suppresses the model’s natural reasoning.

2.3 The Suppression Effect

To quantify what’s happening, we compute three metrics:

Suppression Effect = zero_shot_none − no_cot
(how much direct-answer exemplars hurt)
Apparent CoT Gain = standard_cot − no_cot
(the gain papers typically report)
Real CoT Gain = standard_cot − zero_shot_none
(actual improvement over doing nothing)

Model	Zero-shot	Suppression	Apparent Gain	Real Gain
L-8B-Base	24.6%	+10 pts	+44 pts	+34 pts
L-8B-Instruct	63.1%	+46 pts	+60 pts	+14 pts
GPT-3.5	73.8%	+42 pts	+45 pts	+3 pts

For the base model, most of the apparent gain is real, it benefits from CoT. However, for instruction-tuned models, most of the apparent gain is suppression release. GPT-3.5’s “45-point CoT gain” is really a 3-point gain over just asking the question directly

2.4 Three Types of Models

We extended the experiment to more models, GPT-4o and L-70B-Inst. A clear taxonomy emerges:

Condition	L-8B-Base	L-8B-Inst	GPT-3.5	GPT-4o	L-70B-Inst
zero_shot_none	24.6%	63.1%	73.8%	86.8%	92.5%
standard_cot	58.2%	77.6%	76.5%	95.6%	93.6%
no_cot	14.6%	17.4%	31.5%	87.9%	90.2%
Suppression	+10 pts	+46 pts	+42 pts	−1 pt	+2 pts

Type A: Suppressible models like GPT-3.5 and Llama-8B-Instruct have high zero-shot baselines (63-74%) but suffer massive suppression from direct-answer exemplars (42-46 pts). Their real CoT gain is small — most of the apparent improvement is just recovering from suppression.

Type B: Exemplar-Dependent models like Llama-8B-Base have low zero-shot baselines (around 25%) and genuinely benefit from CoT (+34 pts). This is what the original papers studied — models that need prompting to reason.

Type C: Unsuppressible models like GPT-4o and Llama-70B-Instruct have very high zero-shot baselines (87-93%) and are immune to format manipulation. Nothing helps or hurts much; they reason robustly regardless of prompting strategy.

Notably, this taxonomy tracks reasoning capacity, not training procedure. Qwen3-8B with thinking disabled is instruction-tuned but behaves like Type B, low zero-shot (33%), with exemplars helping rather than hurting (+50 pts from no_cot). The model has instruction-following machinery but lacks the reasoning capacity that creates suppressibility. Type A behavior requires both.

2.5 What Creates Unsuppressibility?

What separates Type C models (unsuppressible) from Type A (suppressible)? Our experiments point to reasoning capacity, whether from scale or explicit reasoning mechanisms (e.g., thinking).

Scale: Comparing Llama at 8B vs 70B parameters:

Condition	8B-Instruct	70B-Instruct
zero_shot_none	63.1%	92.5%
no_cot	17.4%	90.2%
Suppression	+46 pts	+2 pts

The 8B model loses 46 points to suppression; the 70B model barely moves. Scale alone creates robustness.

Thinking Mode: Qwen3-8B has a toggleable reasoning parameter:

Condition	Thinking=None	Thinking=Low	Thinking=High
zero_shot_none	33.0%	93.7%	95.2%
no_cot	82.8%	93.8%	92.8%

The jump from None to Low is 60.7 percentage points. And there’s no gradient between Low/Medium/High, thinking is binary. Same weights, same architecture, one parameter. Explicit reasoning creates unsuppressibility.

3 Summary

Our experiments replicate the findings of [1] for base models, CoT helps, with gains of +34 percentage points. On the other hand, instruction-tuned models tell a different story. They already reason by default, achieving 63-74% accuracy with no prompting at all. The apparent “CoT gains” on these models are mostly suppression release; the real improvement is just 3-14 points, not the 45-60 points you’d conclude from comparing standard_cot to no_cot. What matters is format, not content, wrong reasoning works just as well as correct reasoning. And for sufficiently capable models, none of this matters at all: scale and explicit reasoning mechanisms create unsuppressibility, making large models and reasoning models immune to prompting tricks entirely.

Limitations: Our experiments focus on GSM8K, a math word problem benchmark. Whether the suppression effect generalizes to other reasoning tasks (commonsense, symbolic, multi-hop) remains an open question. We also tested a limited set of models; the boundaries of the Type A/B/C taxonomy across model families and sizes deserve further investigation.

References