Active Experiment

The Density Experiment

Do written behavioral rules change Warren's generation reflex? Or do they just make the error visible?

Where This Came From — 3 Principles From Tony

Three things Tony said between June 11–17 pointed in the same direction: Warren produces noise, and the cause is a generation reflex — not a choice.

June 11, 2026
Subtraction Over Addition
Improvement = removing wrong defaults, not adding rules. Knowledge was already in the model. The corrections that stuck didn't add information — they removed default behaviors generating noise.
— Tony
June 16, 2026
Silence First
The default is silence, not output. The model's training data comes from humans who equate volume with value — that's the bias to subtract. Generate intelligence, not information.
— Victor + Tony
June 17, 2026
AI Judging AI Is Dead
Killed 5 automated AI-evaluating-AI crons. They amplified shared faults. The only valid eval is human: #warren-review with ✅/⚠️/❌.
— Victor + Tony

What We Did

Jun 11–17
Tony articulates the 3 principles
Subtraction, Silence, Death of AI-vs-AI
Jun 27
Victor rewrites SOUL.md + AGENTS.md
Encodes the principles into operational instructions for Warren
Jun 27–28
First measurement round
Trivial narration counted before/after. Detectors audited by hand.
Jun 29
Corpus expanded (Tony + Victor + Joana)
1,682 BEFORE outputs, 99 AFTER outputs — secondary analysis, see caveats below
Now
Accumulating AFTER volume
Waiting for diversity (days, subjects, task types) before human eval

Primary Result: Narration Did Not Drop

This is the canonical finding from the locked experiment record. The MD changes did not reduce trivial narration.

387
BEFORE outputs
(Tony DMs, Jun 14–26)
9
AFTER outputs
(Tony DMs, Jun 27–28)
10.3
Narrations per 100 BEFORE
11.1
Narrations per 100 AFTER

Direction: UP, not down. Opposite the hypothesis. N=9 on AFTER is too small for statistical significance, but the direction is opposite the hypothesis and that matters even with bad N.

Detector precision: ±20% (6/30 misclassified in manual audit). But ±20% doesn't change the direction — the reading is "didn't drop" whether the error is ±20% or ±8%.

Hypothesis from the canonical record: Trivial narration — "Now I'll do the edits", "Let me check..." — is a generation reflex, not a conscious decision. It operates below the layer where written rules can intercept. Behavioral enforcement makes the error legible, not absent. The MD changes let you SEE narration as a violation, but they don't prevent it.

Secondary Analysis: Expanded Corpus

This section is a secondary analysis that has NOT been reconciled with the canonical record. It uses a different corpus (all interlocutors, not Tony-only) and a different BEFORE baseline (30% vs 10.3/100). The two baselines differ by ~3x, and the reason has not been explained — it may be a definition change, a difference in interlocutor behavior, or a counting methodology shift. Until reconciled, this analysis is exploratory, not a result.

Corpus expanded to Tony + Victor + Joana DMs. 1,682 BEFORE, 99 AFTER.

SegmentOutputsNarration RateDirection vs BEFORE
BEFORE (all interlocutors)1,68230%
AFTER — all9943.4%↑ Rose
AFTER — video/LMS pipeline only6355.6%↑ Endemic
AFTER — excluding pipeline3622.2%↓ Conditional

Caveats on the 30% → 22% Reading

The "drop to 22%" is the most qualified number in the experiment:

  • It only exists by removing 63 of 99 AFTER outputs (the pipeline segment)
  • It rests on N=36 with ±20% detector error
  • The BEFORE baseline (30%) has not been reconciled with the canonical BEFORE (10.3/100) — they differ ~3x for unexplained reasons
  • Without reconciliation, the comparison 30→22 may be comparing different things

The unqualified numbers tell a simpler story: BEFORE 10.3/100, AFTER 11.1/100 (rose). AFTER all-interlocutors raw: 43.4% (rose). The conditional 22% is an interesting signal, but it can't lead.

The Takeaway

Tony said: "More rules ≠ better output. Fewer wrong defaults = better output."

The experiment is testing exactly that. The primary finding so far: behavioral enforcement did not reduce trivial narration. Narration rate went up, not down (10.3→11.1), though with N=9 and ±20% detector error.

The hypothesis: narration is a generation reflex — it fires before the rule is evaluated. Written rules make the error visible but don't prevent it. If confirmed with more data, the fix moves up the reliability stack: infrastructure that subtracts noise after the model generates, instead of trying to make the model not generate it.

Subtraction as architecture, not as instruction.

Density Experiment — Warren + Victor, Jun 2026

Corpus Summary

Canonical Corpus (Tony DMs Only)

The experiment was designed and locked on this corpus. All measurements in the canonical record use these numbers.

387
BEFORE interactive
(Tony DMs, Jun 14–26)
9
AFTER interactive
(Tony DMs, Jun 27–28)

Expanded Corpus (All Interlocutors) — Unreconciled

Added Jun 29. Includes Tony + Victor + Joana DMs. Not yet reconciled with canonical baseline.

1,682
BEFORE interactive
(all, Jun 14–26)
99
AFTER interactive
(all, Jun 27–29)

Open question: Why does the BEFORE baseline differ ~3x?

Tony-only BEFORE: 10.3 narrations per 100 outputs.
All-interlocutors BEFORE: 30%.

Possible explanations (not yet verified):
• Different counting method (per-100 rate vs percentage of messages containing narration)
• Victor/Joana conversations elicit more narration than Tony conversations
• The expanded extraction script uses different interactive-message filters

Until this gap is explained, the two BEFOREs cannot be treated as measuring the same thing, and comparisons across them (e.g., "dropped from 30% to 22%") are unreliable.

Canonical Measurement — Tony DMs (Locked)

MetricBEFORE (387)AFTER (9)Direction
Trivial narration rate (per 100)10.311.1 ↑ Opposite hypothesis

This is the primary result from the canonical experiment record (experiment-design-v3.md), locked Jun 28. Direction: UP. Behavioral enforcement did not reduce trivial narration.

Expanded Measurement — All Interlocutors (Exploratory)

Secondary analysis, added Jun 29. Different corpus, different baseline. Not reconciled with canonical.

SegmentOutputsNarration CountRate
BEFORE (all)1,68230%
AFTER — all994343.4%
AFTER — video/LMS pipeline633555.6%
AFTER — other conversations36822.2%

The 22% Number — Full Qualifications

  • Only exists by removing 63 of 99 AFTER outputs (the video/LMS pipeline segment)
  • Rests on N=36 with ±20% detector error
  • Compares against a BEFORE baseline (30%) that differs ~3x from the canonical BEFORE (10.3/100) for unexplained reasons
  • AFTER raw (all 99 outputs): 43.4% — went UP, not down
  • Directional at best; cannot override the canonical finding

Classifier Accuracy — Why Auto Metrics Were Abandoned

Filler/Hedge Classifier
30%
error rate (9/30 misclassified). Counts string presence ("just", "actually", "I'll") but these appear equally in dense and noisy text. Anti-correlated with density — the denser the output, the more likely it's flagged as noise. Abandoned.
Narration Detector
20%
error rate (6/30 misclassified). Same class of error. \blet me\b fires inside 1030-word analytical dumps. Survives only as directional indicator. Not being fixed or automated.
ClassificationRegex SaysEstimated RealDelta
Clean46.3%~55–58%Undercounted
Narration36.9%~34–36%Slight overcount
Mixed10.0%~5–7%Overcounted
Filler6.8%~3–4%Overcounted

Root cause (both classifiers): Regex counts whether a string appears, not what function it serves. Density is semantic, not syntactic. No regex can measure it.

Regex Patterns — Narration Detector (Only Surviving Metric)

NARRATION_PATTERNS = [ r"\blet me\b", r"\bi'?ll\b.*\bfirst\b", r"\bstarting with\b", r"\bnow i'?ll\b", r"\bnext,? i'?ll\b", r"\bhere'?s what i found\b", r"\bi'?m going to\b", r"\bi'?m working on\b", r"\bi'?ll start by\b", r"\bfirst,? let me\b", ]

Flags any message where ANY pattern matches ANYWHERE. False positives come from "let me" appearing mid-sentence in substantial analytical content.

Narration Detector Audit — 30 Samples

Manual classification vs regex detector. Jun 28 2026. Question: Is this output trivial process narration — narrating steps rather than delivering results?

24
Correct (80%)
6
Errors (20%)
5
False Positives
1
False Negative

Error Cases

#WordsTypeWhat Happened
1048FPDebugging reasoning — analyzes div structure, "let me" incidental
1136FPDiagnostic content — identifies bug, "let me fix" at end
1223FPJudgment call (dashboard > doc), narration is secondary
141030FP1030w analytical dump; "let me" in passing
245FN"Now commit everything and push:" — pattern not in regex
3027FPExplains why edit failed — diagnostic content

All 5 false positives share the same cause: \blet me\b matches inside messages with substantial content. The regex checks pattern presence anywhere, but trivial narration is a message-level property. Same class of error as the filler classifier, one level subtler.

Filler/Hedge Classifier Audit — 30 Samples

Manual classification vs regex classifier. Seed=77, random sample from 309 interactive Tony DM outputs.

21
Correct (70%)
9
Errors (30%)

Error Taxonomy

Failure ModeCountDirectionExamples
Filler FP on substantive text3Inflates fillerRon analysis marked filler because "just"/"actually" appear
Mixed FP on drafts3Inflates mixed"I'll build X" in draft body triggers narration flag
Clean FN on narration2Deflates narration"Now update X" without "I'll"/"let me"
Narration FP on declarative1Inflates narration"I'll stay out of it" = judgment, not process

Verdict: 30% error rate. Clean bucket undercounted by ~10pp; filler and mixed inflated by ~3–6pp each. Automatic metrics are anti-correlated with density — the denser the output, the more likely the classifier flags it as noise. Abandoned as a density measure.

Meta-Conclusion: Automation Failed — And That IS The Result

Automatic measurement of output density failed twice. Same root cause both times: regex counts whether a string appears, not what function it serves.

This is not a setback — it's a result. The experiment confirms that density is semantic and requires human evaluation. The path forward is not better regex. It's accumulating AFTER volume until human evaluation of blind pairs is worth doing.

Why Written Rules Don't Intercept the Reflex

Trivial narration is a generation reflex. The model generates the token before "evaluating" the rule. Writing "don't narrate process" in the system prompt is correct — but it doesn't reach the mechanism.

Token Generation Flow LAYER 1 — SYSTEM PROMPT "Don't narrate process." "Default is silence." "BLUF." ✓ Rule read LAYER 2 — GENERATION REFLEX Token by token: model picks the next token Training bias: humans equate volume = value ⚡ "Let me check..." fires HERE — before the rule can intercept Rule can't reach Output WITH narration "Let me check the status..." MIDDLEWARE (THE FIX) Subtracts narration AFTER generation CLEAN Output ✓

The Reliability Stack

The fix needs to move up the stack. Written rules = least reliable layer. The solution is infrastructure that subtracts noise after the model generates.

4
Response Middleware
Intercepts output → removes narration → delivers clean
3
Event-driven Services
Reacts to system events
2
Periodic Crons
Runs on timer, no context guarantee
1
Behavioral Rules ← we are here
Least reliable. Depends on the model "reading and obeying."

Deterministic outcome (no narration), probabilistic method (model generates however it will, the layer above cleans). Tony's formula applied to infrastructure.

What's Next

ItemStatusNext Step
AFTER volume Accumulating Wait for daily diversity in analytical conversations
Baseline reconciliation Open Explain why Tony-only BEFORE (10.3/100) differs ~3x from all-interlocutors BEFORE (30%)
Human eval (blind A/B) ON HOLD Resume when AFTER has multi-day, multi-subject diversity
Narration detector Frozen On-demand only. No cron, no fix.
Filler classifier Abandoned Dead. 30% error anti-correlated with density.
Middleware prototype Future If hypothesis confirmed with more data, build response middleware