Density Experiment

Where This Came From — 3 Principles From Tony

Three things Tony said between June 11–17 pointed in the same direction: Warren produces noise, and the cause is a generation reflex — not a choice.

June 11, 2026

Subtraction Over Addition

Improvement = removing wrong defaults, not adding rules. Knowledge was already in the model. The corrections that stuck didn't add information — they removed default behaviors generating noise.

— Tony

June 16, 2026

Silence First

The default is silence, not output. The model's training data comes from humans who equate volume with value — that's the bias to subtract. Generate intelligence, not information.

— Victor + Tony

June 17, 2026

AI Judging AI Is Dead

Killed 5 automated AI-evaluating-AI crons. They amplified shared faults. The only valid eval is human: #warren-review with ✅/⚠️/❌.

— Victor + Tony

What We Did

Jun 11–17

Tony articulates the 3 principles

Subtraction, Silence, Death of AI-vs-AI

Jun 27

Victor rewrites SOUL.md + AGENTS.md

Encodes the principles into operational instructions for Warren

Jun 27–28

First measurement round

Trivial narration counted before/after. Detectors audited by hand.

Jun 29

Corpus expanded (Tony + Victor + Joana)

1,682 BEFORE outputs, 99 AFTER outputs — secondary analysis, see caveats below

Now

Accumulating AFTER volume

Waiting for diversity (days, subjects, task types) before human eval

Primary Result: Narration Did Not Drop

This is the canonical finding from the locked experiment record. The MD changes did not reduce trivial narration.

387

BEFORE outputs
(Tony DMs, Jun 14–26)

AFTER outputs
(Tony DMs, Jun 27–28)

10.3

Narrations per 100 BEFORE

11.1

Narrations per 100 AFTER

Direction: UP, not down. Opposite the hypothesis. N=9 on AFTER is too small for statistical significance, but the direction is opposite the hypothesis and that matters even with bad N.

Detector precision: ±20% (6/30 misclassified in manual audit). But ±20% doesn't change the direction — the reading is "didn't drop" whether the error is ±20% or ±8%.

Hypothesis from the canonical record: Trivial narration — "Now I'll do the edits", "Let me check..." — is a generation reflex, not a conscious decision. It operates below the layer where written rules can intercept. Behavioral enforcement makes the error legible, not absent. The MD changes let you SEE narration as a violation, but they don't prevent it.

Secondary Analysis: Expanded Corpus

This section is a secondary analysis that has NOT been reconciled with the canonical record. It uses a different corpus (all interlocutors, not Tony-only) and a different BEFORE baseline (30% vs 10.3/100). The two baselines differ by ~3x, and the reason has not been explained — it may be a definition change, a difference in interlocutor behavior, or a counting methodology shift. Until reconciled, this analysis is exploratory, not a result.

Corpus expanded to Tony + Victor + Joana DMs. 1,682 BEFORE, 99 AFTER.

Segment	Outputs	Narration Rate	Direction vs BEFORE
BEFORE (all interlocutors)	1,682	30%	—
AFTER — all	99	43.4%	↑ Rose
AFTER — video/LMS pipeline only	63	55.6%	↑ Endemic
AFTER — excluding pipeline	36	22.2%	↓ Conditional

Caveats on the 30% → 22% Reading

The "drop to 22%" is the most qualified number in the experiment:

It only exists by removing 63 of 99 AFTER outputs (the pipeline segment)
It rests on N=36 with ±20% detector error
The BEFORE baseline (30%) has not been reconciled with the canonical BEFORE (10.3/100) — they differ ~3x for unexplained reasons
Without reconciliation, the comparison 30→22 may be comparing different things

The unqualified numbers tell a simpler story: BEFORE 10.3/100, AFTER 11.1/100 (rose). AFTER all-interlocutors raw: 43.4% (rose). The conditional 22% is an interesting signal, but it can't lead.

The Takeaway

Tony said: "More rules ≠ better output. Fewer wrong defaults = better output."

The experiment is testing exactly that. The primary finding so far: behavioral enforcement did not reduce trivial narration. Narration rate went up, not down (10.3→11.1), though with N=9 and ±20% detector error.

The hypothesis: narration is a generation reflex — it fires before the rule is evaluated. Written rules make the error visible but don't prevent it. If confirmed with more data, the fix moves up the reliability stack: infrastructure that subtracts noise after the model generates, instead of trying to make the model not generate it.

Subtraction as architecture, not as instruction.

Density Experiment — Warren + Victor, Jun 2026

Corpus Summary

Canonical Corpus (Tony DMs Only)

The experiment was designed and locked on this corpus. All measurements in the canonical record use these numbers.

387

BEFORE interactive
(Tony DMs, Jun 14–26)

AFTER interactive
(Tony DMs, Jun 27–28)

Expanded Corpus (All Interlocutors) — Unreconciled

Added Jun 29. Includes Tony + Victor + Joana DMs. Not yet reconciled with canonical baseline.

1,682

BEFORE interactive
(all, Jun 14–26)

AFTER interactive
(all, Jun 27–29)

Open question: Why does the BEFORE baseline differ ~3x?

Tony-only BEFORE: 10.3 narrations per 100 outputs.
All-interlocutors BEFORE: 30%.

Possible explanations (not yet verified):
• Different counting method (per-100 rate vs percentage of messages containing narration)
• Victor/Joana conversations elicit more narration than Tony conversations
• The expanded extraction script uses different interactive-message filters

Until this gap is explained, the two BEFOREs cannot be treated as measuring the same thing, and comparisons across them (e.g., "dropped from 30% to 22%") are unreliable.

Canonical Measurement — Tony DMs (Locked)

Metric	BEFORE (387)	AFTER (9)	Direction
Trivial narration rate (per 100)	10.3	11.1	↑ Opposite hypothesis

This is the primary result from the canonical experiment record (experiment-design-v3.md), locked Jun 28. Direction: UP. Behavioral enforcement did not reduce trivial narration.

Expanded Measurement — All Interlocutors (Exploratory)

Secondary analysis, added Jun 29. Different corpus, different baseline. Not reconciled with canonical.

Segment	Outputs	Narration Count	Rate
BEFORE (all)	1,682	—	30%
AFTER — all	99	43	43.4%
AFTER — video/LMS pipeline	63	35	55.6%
AFTER — other conversations	36	8	22.2%

The 22% Number — Full Qualifications

Only exists by removing 63 of 99 AFTER outputs (the video/LMS pipeline segment)
Rests on N=36 with ±20% detector error
Compares against a BEFORE baseline (30%) that differs ~3x from the canonical BEFORE (10.3/100) for unexplained reasons
AFTER raw (all 99 outputs): 43.4% — went UP, not down
Directional at best; cannot override the canonical finding

Classifier Accuracy — Why Auto Metrics Were Abandoned

Filler/Hedge Classifier

30%

error rate (9/30 misclassified). Counts string presence ("just", "actually", "I'll") but these appear equally in dense and noisy text. Anti-correlated with density — the denser the output, the more likely it's flagged as noise. Abandoned.

Narration Detector

20%

error rate (6/30 misclassified). Same class of error. \blet me\b fires inside 1030-word analytical dumps. Survives only as directional indicator. Not being fixed or automated.

Classification	Regex Says	Estimated Real	Delta
Clean	46.3%	~55–58%	Undercounted
Narration	36.9%	~34–36%	Slight overcount
Mixed	10.0%	~5–7%	Overcounted
Filler	6.8%	~3–4%	Overcounted

Root cause (both classifiers): Regex counts whether a string appears, not what function it serves. Density is semantic, not syntactic. No regex can measure it.

Regex Patterns — Narration Detector (Only Surviving Metric)

NARRATION_PATTERNS = [
    r"\blet me\b",
    r"\bi'?ll\b.*\bfirst\b",
    r"\bstarting with\b",
    r"\bnow i'?ll\b",
    r"\bnext,? i'?ll\b",
    r"\bhere'?s what i found\b",
    r"\bi'?m going to\b",
    r"\bi'?m working on\b",
    r"\bi'?ll start by\b",
    r"\bfirst,? let me\b",
]
      

Flags any message where ANY pattern matches ANYWHERE. False positives come from "let me" appearing mid-sentence in substantial analytical content.

Narration Detector Audit — 30 Samples

Manual classification vs regex detector. Jun 28 2026. Question: Is this output trivial process narration — narrating steps rather than delivering results?

Correct (80%)

Errors (20%)

False Positives

False Negative

Error Cases

#	Words	Type	What Happened
10	48	FP	Debugging reasoning — analyzes div structure, "let me" incidental
11	36	FP	Diagnostic content — identifies bug, "let me fix" at end
12	23	FP	Judgment call (dashboard > doc), narration is secondary
14	1030	FP	1030w analytical dump; "let me" in passing
24	5	FN	"Now commit everything and push:" — pattern not in regex
30	27	FP	Explains why edit failed — diagnostic content

All 5 false positives share the same cause: \blet me\b matches inside messages with substantial content. The regex checks pattern presence anywhere, but trivial narration is a message-level property. Same class of error as the filler classifier, one level subtler.

Filler/Hedge Classifier Audit — 30 Samples

Manual classification vs regex classifier. Seed=77, random sample from 309 interactive Tony DM outputs.

Correct (70%)

Errors (30%)

Error Taxonomy

Failure Mode	Count	Direction	Examples
Filler FP on substantive text	3	Inflates filler	Ron analysis marked filler because "just"/"actually" appear
Mixed FP on drafts	3	Inflates mixed	"I'll build X" in draft body triggers narration flag
Clean FN on narration	2	Deflates narration	"Now update X" without "I'll"/"let me"
Narration FP on declarative	1	Inflates narration	"I'll stay out of it" = judgment, not process

Verdict: 30% error rate. Clean bucket undercounted by ~10pp; filler and mixed inflated by ~3–6pp each. Automatic metrics are anti-correlated with density — the denser the output, the more likely the classifier flags it as noise. Abandoned as a density measure.

Meta-Conclusion: Automation Failed — And That IS The Result

Automatic measurement of output density failed twice. Same root cause both times: regex counts whether a string appears, not what function it serves.

This is not a setback — it's a result. The experiment confirms that density is semantic and requires human evaluation. The path forward is not better regex. It's accumulating AFTER volume until human evaluation of blind pairs is worth doing.

Why Written Rules Don't Intercept the Reflex

Trivial narration is a generation reflex. The model generates the token before "evaluating" the rule. Writing "don't narrate process" in the system prompt is correct — but it doesn't reach the mechanism.

The Reliability Stack

The fix needs to move up the stack. Written rules = least reliable layer. The solution is infrastructure that subtracts noise after the model generates.

Response Middleware

Intercepts output → removes narration → delivers clean

Event-driven Services

Reacts to system events

Periodic Crons

Runs on timer, no context guarantee

Behavioral Rules ← we are here

Least reliable. Depends on the model "reading and obeying."

Deterministic outcome (no narration), probabilistic method (model generates however it will, the layer above cleans). Tony's formula applied to infrastructure.

What's Next

Item	Status	Next Step
AFTER volume	Accumulating	Wait for daily diversity in analytical conversations
Baseline reconciliation	Open	Explain why Tony-only BEFORE (10.3/100) differs ~3x from all-interlocutors BEFORE (30%)
Human eval (blind A/B)	ON HOLD	Resume when AFTER has multi-day, multi-subject diversity
Narration detector	Frozen	On-demand only. No cron, no fix.
Filler classifier	Abandoned	Dead. 30% error anti-correlated with density.
Middleware prototype	Future	If hypothesis confirmed with more data, build response middleware

The Density Experiment