Do written behavioral rules change Warren's generation reflex? Or do they just make the error visible?
Three things Tony said between June 11–17 pointed in the same direction: Warren produces noise, and the cause is a generation reflex — not a choice.
This is the canonical finding from the locked experiment record. The MD changes did not reduce trivial narration.
Direction: UP, not down. Opposite the hypothesis. N=9 on AFTER is too small for statistical significance, but the direction is opposite the hypothesis and that matters even with bad N.
Detector precision: ±20% (6/30 misclassified in manual audit). But ±20% doesn't change the direction — the reading is "didn't drop" whether the error is ±20% or ±8%.
Hypothesis from the canonical record: Trivial narration — "Now I'll do the edits", "Let me check..." — is a generation reflex, not a conscious decision. It operates below the layer where written rules can intercept. Behavioral enforcement makes the error legible, not absent. The MD changes let you SEE narration as a violation, but they don't prevent it.
This section is a secondary analysis that has NOT been reconciled with the canonical record. It uses a different corpus (all interlocutors, not Tony-only) and a different BEFORE baseline (30% vs 10.3/100). The two baselines differ by ~3x, and the reason has not been explained — it may be a definition change, a difference in interlocutor behavior, or a counting methodology shift. Until reconciled, this analysis is exploratory, not a result.
Corpus expanded to Tony + Victor + Joana DMs. 1,682 BEFORE, 99 AFTER.
| Segment | Outputs | Narration Rate | Direction vs BEFORE |
|---|---|---|---|
| BEFORE (all interlocutors) | 1,682 | 30% | — |
| AFTER — all | 99 | 43.4% | ↑ Rose |
| AFTER — video/LMS pipeline only | 63 | 55.6% | ↑ Endemic |
| AFTER — excluding pipeline | 36 | 22.2% | ↓ Conditional |
The "drop to 22%" is the most qualified number in the experiment:
The unqualified numbers tell a simpler story: BEFORE 10.3/100, AFTER 11.1/100 (rose). AFTER all-interlocutors raw: 43.4% (rose). The conditional 22% is an interesting signal, but it can't lead.
Tony said: "More rules ≠ better output. Fewer wrong defaults = better output."
The experiment is testing exactly that. The primary finding so far: behavioral enforcement did not reduce trivial narration. Narration rate went up, not down (10.3→11.1), though with N=9 and ±20% detector error.
The hypothesis: narration is a generation reflex — it fires before the rule is evaluated. Written rules make the error visible but don't prevent it. If confirmed with more data, the fix moves up the reliability stack: infrastructure that subtracts noise after the model generates, instead of trying to make the model not generate it.
Subtraction as architecture, not as instruction.
The experiment was designed and locked on this corpus. All measurements in the canonical record use these numbers.
Added Jun 29. Includes Tony + Victor + Joana DMs. Not yet reconciled with canonical baseline.
Open question: Why does the BEFORE baseline differ ~3x?
Tony-only BEFORE: 10.3 narrations per 100 outputs.
All-interlocutors BEFORE: 30%.
Possible explanations (not yet verified):
• Different counting method (per-100 rate vs percentage of messages containing narration)
• Victor/Joana conversations elicit more narration than Tony conversations
• The expanded extraction script uses different interactive-message filters
Until this gap is explained, the two BEFOREs cannot be treated as measuring the same thing, and comparisons across them (e.g., "dropped from 30% to 22%") are unreliable.
| Metric | BEFORE (387) | AFTER (9) | Direction |
|---|---|---|---|
| Trivial narration rate (per 100) | 10.3 | 11.1 | ↑ Opposite hypothesis |
This is the primary result from the canonical experiment record (experiment-design-v3.md), locked Jun 28. Direction: UP. Behavioral enforcement did not reduce trivial narration.
Secondary analysis, added Jun 29. Different corpus, different baseline. Not reconciled with canonical.
| Segment | Outputs | Narration Count | Rate |
|---|---|---|---|
| BEFORE (all) | 1,682 | — | 30% |
| AFTER — all | 99 | 43 | 43.4% |
| AFTER — video/LMS pipeline | 63 | 35 | 55.6% |
| AFTER — other conversations | 36 | 8 | 22.2% |
\blet me\b fires inside 1030-word analytical dumps. Survives only as directional indicator. Not being fixed or automated.| Classification | Regex Says | Estimated Real | Delta |
|---|---|---|---|
| Clean | 46.3% | ~55–58% | Undercounted |
| Narration | 36.9% | ~34–36% | Slight overcount |
| Mixed | 10.0% | ~5–7% | Overcounted |
| Filler | 6.8% | ~3–4% | Overcounted |
Root cause (both classifiers): Regex counts whether a string appears, not what function it serves. Density is semantic, not syntactic. No regex can measure it.
Flags any message where ANY pattern matches ANYWHERE. False positives come from "let me" appearing mid-sentence in substantial analytical content.
Manual classification vs regex detector. Jun 28 2026. Question: Is this output trivial process narration — narrating steps rather than delivering results?
| # | Words | Type | What Happened |
|---|---|---|---|
| 10 | 48 | FP | Debugging reasoning — analyzes div structure, "let me" incidental |
| 11 | 36 | FP | Diagnostic content — identifies bug, "let me fix" at end |
| 12 | 23 | FP | Judgment call (dashboard > doc), narration is secondary |
| 14 | 1030 | FP | 1030w analytical dump; "let me" in passing |
| 24 | 5 | FN | "Now commit everything and push:" — pattern not in regex |
| 30 | 27 | FP | Explains why edit failed — diagnostic content |
All 5 false positives share the same cause: \blet me\b matches inside messages with substantial content. The regex checks pattern presence anywhere, but trivial narration is a message-level property. Same class of error as the filler classifier, one level subtler.
Manual classification vs regex classifier. Seed=77, random sample from 309 interactive Tony DM outputs.
| Failure Mode | Count | Direction | Examples |
|---|---|---|---|
| Filler FP on substantive text | 3 | Inflates filler | Ron analysis marked filler because "just"/"actually" appear |
| Mixed FP on drafts | 3 | Inflates mixed | "I'll build X" in draft body triggers narration flag |
| Clean FN on narration | 2 | Deflates narration | "Now update X" without "I'll"/"let me" |
| Narration FP on declarative | 1 | Inflates narration | "I'll stay out of it" = judgment, not process |
Verdict: 30% error rate. Clean bucket undercounted by ~10pp; filler and mixed inflated by ~3–6pp each. Automatic metrics are anti-correlated with density — the denser the output, the more likely the classifier flags it as noise. Abandoned as a density measure.
Automatic measurement of output density failed twice. Same root cause both times: regex counts whether a string appears, not what function it serves.
This is not a setback — it's a result. The experiment confirms that density is semantic and requires human evaluation. The path forward is not better regex. It's accumulating AFTER volume until human evaluation of blind pairs is worth doing.
Trivial narration is a generation reflex. The model generates the token before "evaluating" the rule. Writing "don't narrate process" in the system prompt is correct — but it doesn't reach the mechanism.
The fix needs to move up the stack. Written rules = least reliable layer. The solution is infrastructure that subtracts noise after the model generates.
Deterministic outcome (no narration), probabilistic method (model generates however it will, the layer above cleans). Tony's formula applied to infrastructure.
| Item | Status | Next Step |
|---|---|---|
| AFTER volume | Accumulating | Wait for daily diversity in analytical conversations |
| Baseline reconciliation | Open | Explain why Tony-only BEFORE (10.3/100) differs ~3x from all-interlocutors BEFORE (30%) |
| Human eval (blind A/B) | ON HOLD | Resume when AFTER has multi-day, multi-subject diversity |
| Narration detector | Frozen | On-demand only. No cron, no fix. |
| Filler classifier | Abandoned | Dead. 30% error anti-correlated with density. |
| Middleware prototype | Future | If hypothesis confirmed with more data, build response middleware |