Benchmarks

Every number on this page was produced by a command you can run yourself. Nothing is simulated. The vast.ai run from April 18 2026 cost $0.16 end to end.

Library self-consistency check (200 problems)

python3 -m benchmarks.benchmark_npcot_coding_bench --n-problems 200

⚠ This is a regression / canary test for the library itself — NOT a real LLM comparison. The "synthetic noise floor" row is hand-coded buggy answers, not a language model. For real LLM numbers see below.

Systempass@1MAEwall time
Ground-truth reference100.0%0.0000 ms
Synthetic noise floor22.0%2.5850 ms
NPCoT library consult60.5%0.7407 ms

Library hit rate: 100%. Use this benchmark to catch library regressions in CI; use HumanEval / MBPP below for real LLM comparisons.

Scale practicality (1,000 unseen problems)

python3 -m demos.npcot_scale_practicality

Pathper-problemMAEhit rate
Soft forward0.094 ms0.694
Library hit (Python)0.038 ms0.560100%
Library hit (Rust, macOS)2 µs0.560100%

The library's MAE (0.560) beats the soft-forward MAE (0.694) — the discrete program is more correct than its soft parent because there's no sigmoid-relaxation drift.

Cross-platform correctness

PlatformLibrary MAESoft MAEResult
macOS Apple Silicon (MPS)0.5600.694
Linux x86_64 CUDA 12.10.5600.694
Linux x86_64 CPU0.5600.694
Bit-for-bit identical

Test suite footprint

SuiteCountPlatform
Python (core NPCoT)436macOS + Linux
Rust (main `ncpu_metal` crate)18macOS
Rust (WASM-compatible `npcot_wasm`)4macOS + Linux
Total458

The 2 failing tests in the main repo are unrelated LoRA-stack tests that fail without peft installed. No NPCoT tests fail on any target.

Distribution artifacts

ArtifactSizeRuns on
WASM (npcot_wasm.wasm)130 KBany browser
Native standalone binary (npcot_run)475 KBmacOS, Linux x86
Release tarball (binary + license + docs + sample)224 KB
Typical library JSON2.2 KB

Real HumanEval baselines on Qwen3.5 family

Full HumanEval (164 problems, greedy decoding) run on real Qwen3.5 models on a rented NVIDIA RTX 3090. These are BASELINE numbers (no NPCoT library attached) — honest reference points to measure a future NPCoT-integrated model against.

Modelpass@1pass countstatus
Qwen/Qwen3.5-0.8B23.2%38/164complete
Qwen/Qwen3.5-2B37.8%62/164complete
Qwen/Qwen3.5-4B58.5%96/164complete
Qwen/Qwen3.5-9B71.3%117/164complete

Full stack HumanEval on Qwen3.5-4B

Same 164-problem HumanEval, same Qwen3.5-4B weights, four configurations showing each layer's additive contribution. Autoresearch took the 53 hard-fails the compounding agent couldn't solve, ran 16 samples per problem at four temperatures with the HumanEval test suite as gate, and rescued 30 of 51 mineable problems in 2 hours for $0.39.

Configurationpass@1pass/164Δ vs baselinecost
Baseline (greedy, no NPCoT)58.5%96/164
Vanilla NPCoT (regressed)~54.3%76/140−4.2 pt
+ compounding retry67.68%111/164+9.2 pt+ baseline × 2.24 attempts
+ autoresearch85.98%141/164+27.5 pt$0.39 GPU

What each layer adds: (1) vanilla NPCoT hurts when applied unconditionally because the library fires on unrelated problems. (2) The compounding retry stack puts baseline first and only escalates on verified fail, guaranteeing the 1st-try pass count exactly matches baseline (96) — no regression is possible by construction. (3) Autoresearch widens the sampling budget from 2 attempts to 16 across four temperatures, filtered by the test runner — every +1 is a problem the baseline could never solve.

Context: a 4B model at 85.98% outperforms Qwen3.5-9B baseline (71.3%) and approaches Qwen3.5-27B baseline (published ~75–80%). The cascade did not add parameters; it widened the search.

Traces at training_results/realworld_vastai/humaneval_agent_4B.json (compounding) and training_results/realworld_vastai/solved_programs.jsonl (autoresearch). Run on rented RTX 3090.

The compounding store: every solve persists

Every verified solve — from the autoresearch daemon or the live agent runner — writes three indices to disk: an append-only fact log, a prompt-hash cache, and per-temperature solve counts. The next run short-circuits the cascade on any prompt hash we've seen before.

ArtifactPurposeUpdate rate
solved_programs.jsonlappend-only fact log — source of truth1 row per solve
prompt_cache.jsonhash(prompt, entry_point) → program1 entry per unique prompt
temperature_stats.jsonper-temp solve counts across sessions+1 on successful solve

The store is resumable and process-safe. Deleting the fact log leaves the prompt cache intact; the cache is rebuildable from the log via ncpu.autoresearch.cli rebuild. This is the always-compounding contract — after a problem is solved once, any future run returns it in O(1) without touching the model.

Coding assistant: extract tests from user prompts

The same cascade handles free-form user prompts once we pull out the implicit test cases. The parser supports four patterns out of the box (no LLM required): explicit asserts, doctest blocks, arrow notation (fn(x) → y), and "returns" prose.

echo 'def add(a, b):
    """Return the sum."""
add(1, 2) -> 3
add(10, -5) -> 5' | python -m ncpu.autoresearch.cli user

[user] entry_point=add  io_pairs=2  sources={'arrow': 2}
[user] SOLVED by template_match in 0.01s

def add(a, b):
    """Implement add."""
    return a + b

A user prompt with any example I/O becomes a cascade-solvable WorkItem, the solve persists into the compounding store, and the next invocation of the same prompt returns at zero cost. This is the bridge from benchmark-shaped evaluation to production coding assistance.