A 22-hour overnight build closed the pipeline gates, installed a contract layer, and re-trialed the entire active strategy population (1,525 strategies) under engine v2. The numbers are unambiguous. The decisions below set direction.
The realignment infrastructure shipped end-to-end in 5 commits across Phases A through G — gate fix, contract layer, simplicity protection, re-codification pipeline, dashboard visibility, hourly self-heal, daily objective monitor, versioning policy.
The verification verdict on the existing strategy population is clear-cut: 0 of 1,525 strategies pass under engine v2. 1,003 fail outright on Sharpe / PF thresholds; 524 error out (most are missing data or the crypto cost model gap).
The single best surviving candidate is Cross-JPY Risk Barometer (Sharpe 0.80, PF 2.76 OOS over 23 trades) — close to the threshold but not over.
Implication: the existing pipeline (research_engine + seed + the original 17 backtest_harness) was producing noise, not edge. This isn't a build failure — it's the realignment doing its job.
Re-verification on 2026-05-02 10:09 UTC failed all 17 backtest_harness strategies (OOS Sharpe near 0 or negative, PF below 1.2). All 3 filtered seeds errored on the crypto cost model. They were running in paper_trading only because hypothesis_activator.py ran with VERIFICATION_GATE_MODE=shadow (informational, not enforcing). 148 strategies were reverted to observing in Phase A — not the 103 the original blueprint anticipated.
research_engine.py:337-341 was the obvious one. hypothesis_activator.py:177-194 was the silent second one — its VERIFICATION_GATE_MODE default was shadow. Both have been patched. verification/bridge.py:143 was a third entry point that inserted backtest_harness rows directly at paper_trading; also patched.
Mid-session, 7 new research_engine rows appeared at paper_trading (IDs 3133–3139) because the running atlas-app container had the unpatched code in memory. Reverted, container restarted, code now live. The hourly sweep would have caught this within 60 minutes; it caught it inside 30 because we were watching.
| Path | What it shows |
|---|---|
/review | HTML review queue: pending verdicts that need your decision |
/pipeline-state | JSON: status counts, v2 verdict counts, locks, open trades, critical events |
Pass threshold is OOS Sharpe ≥ 1.0 with PF ≥ 1.2 over ≥ 30 trades. The distribution centres tightly on zero. Two strategies have positive Sharpe above 0.5; the rest are noise around the mean.
| Source type | Total | Pass | Fail | Error | Avg OOS Sharpe |
|---|---|---|---|---|---|
| research_engine | 1,488 | 0 | 980 | 508 | +0.020 |
| seed (CLAUDE.md) | 20 | 0 | 11 | 8 | −0.269 |
| markdown research docs | 20 | 0 | 12 | 8 | −0.016 |
| v2 ID | Title | Markets | OOS Sharpe | OOS PF | Trades |
|---|---|---|---|---|---|
| 1626 | Cross-JPY Risk Barometer | GBP_JPY, EUR_GBP | 0.797 | 2.762 | 23 |
| 2162 | Holy Grail [NAS100_USD] | NAS100_USD | 0.574 | 1.800 | 27 |
| 1621 | Nikkei-JPY Inverse | JP225_USD, USD_JPY | 0.382 | 1.613 | 20 |
OOS Sharpe 0.80, PF 2.76, 23 trades on a cross-asset (GBP_JPY × EUR_GBP) divergence. Below the 1.0 / 30-trade thresholds, but the PF is high enough that the trade count being the binding constraint is plausible. Worth a closer look.
Three readings, in increasing order of severity.
Engine v2 is more honest than engine v1. Path-dependent costs accrue funding per interval, session-aware FX spreads cost more during low-liquidity sessions, and OOS thresholds are real. The April harness gave 17 strategies a pass; the May harness rejects all of them. That's the system getting more rigorous, not the strategies getting worse. Old verdicts under a weaker harness should not have been trusted.
1,488 research_engine entries with average OOS Sharpe of +0.020 (essentially zero). These are statistical anomalies in historical data that don't reproduce out-of-sample under realistic costs. Auto-discovery without a verification gate floods the registry with noise. The new contract layer prevents this from happening again.
Average OOS Sharpe of seed strategies: −0.27. Average for markdown-derived strategies: −0.02. These are the curated strategies — the ones we trusted most. Under engine v2 they don't perform. This is the most consequential finding because it questions the entire prior strategy-generation process: research → backtest → walk-forward → paper.
critical_log — the table is being written to but nothing pushes. Without it, anomalies surface only on the dashboard. ~30 minutes of work; high leverage.Each block below is a decision. YES approves the default. MODIFY changes scope or details. NO skips. Decisions auto-save. Submit at the bottom.
Once submitted, Claude reads your responses and starts executing what you approved. You can still come back and edit; rerunning manually after edits is on you.