A 4B model at 86% HumanEval.

Neural-Physical Chain of Thought + a verifier-gated compounding stack lifts Qwen3.5-4B from 58.5% to 85.98% pass@1 — ahead of Qwen3.5-9B baseline — for $0.39 of GPU. Every solve persists into a prompt cache so the next matching request returns at zero cost. No extra training. No larger model.

+27.5 pt· Qwen3.5-4B pass@1
130 KB· WASM runtime
475 KB· native binary
$0.39· autoresearch cost
552· tests passing

One thesis, five subsystems.

A computer can be built out of learned components — and once the whole execution stack is differentiable, programs stop being things you write and become things you search for by gradient descent. NPCoT is one pillar of nCPU, not the whole project.

1 · The neural computer

Every ALU operation is a trained network — 100% exact 32-bit integer arithmetic, exhaustively verified. Neural multiplication is 12× faster than neural addition.

2 · The GPU computer

A complete UNIX machine on a single GPU: 25-command shell, a self-hosting C compiler, real BusyBox and Alpine Linux v3.20, ~1.9M IPS with zero timing variance.

3 · Differentiable synthesis

Programs discovered by gradient descent through a differentiable CPU: Mog at 315/315 and nSynth at 105/105 benchmark coverage. Try it live, or watch it learn Pong.

4 · The coprocessor

The neural ALU injected into a transformer forward pass. Qwen3.5-2B arithmetic: 14.5% → 71.0%. Measured on real HumanEval, not extrapolated.

5 · JEPA machine dynamics

A predictive world model of the computer itself — latent speculation and anomaly detection over an exact execution substrate with unlimited free ground truth.

The headline above — 86% HumanEval from a 4B model — is what happens when pillar 3 is pointed at code. Pillars 1, 2, and 5 are the computer it runs on.

The neural-computer story →

What actually happens.

Conventional chain-of-thought asks a language model to emit reasoning as tokens and trusts it to follow them. NPCoT compiles reasoning into a discrete program inside the forward pass, then caches the program so the next invocation runs without a single gradient op.

1. Train once

A transformer's hidden state drives a differentiable array-reduction head. Gradient descent finds the program that matches the target.

2. Crystallize

When soft-path and discrete-path outputs agree within threshold, the 5-tuple program is cached in the library with a signed fingerprint.

3. Reuse everywhere

Consult the library on new hidden states: 100% hit rate → ~4 ns consult + execute. Runs on CPU, GPU, WASM, or a 475 KB standalone binary.

Verified on real hardware.

No synthetic benchmarks. Every number below comes from an actual run on an NVIDIA RTX A4000 rented on vast.ai (April 18, 2026). Total cost: $0.16.

Library self-consistency check

200 array-reduction problems — NOT a real LLM comparison

ground truth reference100.0%
synthetic noise floor22.0%
NPCoT library consult60.5%

This is a regression test for the library itself. For real LLM numbers see the HumanEval / MBPP runs below (pending vast.ai sweep).

Qwen3.5-4B on HumanEval: +27.5 pt full stack

164 problems, greedy decode, RTX 3090 — same weights, layered wrapper

baseline (no NPCoT)58.5%
+ compounding retry67.68%
+ autoresearch (16-sample)85.98%

30 hard-fails rescued by autoresearch at $0.39 GPU. Beats Qwen3.5-9B baseline (71.3%). Every solve persists into a compounding store so the next run short-circuits matching prompts at zero cost.

Cross-platform reproducibility

Same library, same MAE

macOS MPS0.560
Linux CUDA0.560
Linux CPU0.560
bit-for-bit identical

Shipping artifacts

standalone binary475 KB
WASM runtime130 KB
release tarball224 KB
library on disk2.2 KB

The stack.

Differentiable training

ArrayExecutableThoughtHead learns programs by gradient descent. Coprocessor wraps any HF transformer layer behind a max_gate safety cap.

Discrete library

Cosine-similarity keyed cache of DiscreteArrayProgram 5-tuples. LRU eviction, HMAC signing, (ε,δ)-DP perturbation, fingerprint IDs.

Native runtime

Pure-Rust executor, Metal compute shader, 130 KB WASM. Zero Python, zero PyTorch on the inference path.

Compliance pipeline

Static verifier proves termination, division safety, overflow bounds. Compliance report emits safe/warn/high aggregate for regulated deployments.

Federation

merge_libraries across organizations with conflict resolution. Teacher→student distillation via least-squares projection fit on paired hiddens.

Session lifecycle

ProgramLibrarySession handles load/save + snapshot/diff. Every task produces an audit trail of what skills changed.

Run it on your laptop in 30 seconds.

git clone https://github.com/robertcprice/nCPU
cd nCPU
python3 -m pytest tests/self_optimizing/ -q
python3 -m demos.npcot_scale_practicality

458 tests, Apple Silicon native, real benchmarks. Takes about one minute total.