DARK FACTORY — First 5 Posts

§2

Первые 5 постов

arc · editable · англ

Арка: 1 что такое Dark Factory (уровни) → 2 bit-exact = пол, не победа → 3 оракул, который говорит только «нет» (главный сюжет) → 4 чего мы НЕ заявляем (честность как ров) → 5 пруф + provenance + climb live. Каждый пост в трёх голосах: X (тред/рёв), Threads (разговорнее), Reddit (субстанция, scope-честно). Поля редактируемые, сохраняются в браузере.

POST 1Что такое Dark Factory · уровни L0–L5

▸ Ввод. Герой — Dark Factory и лестница уровней (НЕ ffmpeg-репро). Честно ставим себя на L4. X/Threads несут рамку; Reddit — r/artificial (терпит фрейминг).

X / тред

Two builds. Every test green. Neither was correct. One passed 1,207/1,207 conformance tests — then we found 22 invariants it never shipped (conformance is feature-equivalence, a proper subset of correctness; an invariant is a defensive check the suite never asked for — a feature the agent forgot to ship). The other is bit-exact against ffmpeg, which proves the LEAST of all — every conformant decoder matches by definition. That gap is the whole story. The ladder we measure against (Dan Shapiro's L0-L5 for software autopilot, extended to L6-L8): L0 you write it, AI autocompletes. L1 AI writes micro-pieces, you approve each. L2 AI works, you stay in flow. L3 AI writes routine code, you review the diffs. L4 you write the spec and audit; the factory makes the implementation choices and carries a multi-day build. <- us, today. L5 you supply nothing but a verdict — you're the validation point. L6 the review/QA/security/compliance function itself becomes policy-as-code (the supervision tax — review and approvals, industry-anecdotal at a third to half of an eng org — goes toward zero). L7 you hand it a KPI; it picks the sub-goals. L8 an operator-less entity. L0-L5 is autonomy of motion; L6-L8 is autonomy of intent. We're at L4; the climb is L4 -> L5 (motion) -> L6 (intent). Where it still needed a human: twice — which is exactly why it's L4 and not yet L5. Receipts all week. kultrun + openfga-rust. And this week ends with the factory attempting one of those rungs on camera, unedited. 🧵

Threads

We ran an agent factory for six days against two targets where passing the tests proves the least: a clean-room H.264 decoder whose only oracle says "no" and nothing else, and a 50k-line authorization-server port (openfga-rust) that passed all 1,207 conformance tests yet was missing 22 invariants. Both went green. Green didn't mean correct in either — and that gap is the whole story. Conformance is feature-equivalence, a proper subset of correctness; an invariant is a defensive check the agent forgot to port, present in the Go original and lost in translation. There are autonomy levels to "AI builds software" — Dan Shapiro's L0-L5, which I extend to L6-L8 (L0-L5 = autonomy of motion, L6-L8 = autonomy of intent). We run at L4 today: a human writes the spec and audits, the factory makes the implementation choices and carries a multi-day build from it. The climb is L4 -> L5 (you stop supplying anything but a verdict) -> L6 (the review/QA/security/compliance function itself becomes policy-as-code — the supervision tax, review and approvals at roughly a third to half of an eng org on our read, goes toward zero). This week I'll show what both builds made — and the two moments the decoder still needed a human, which is exactly why it's L4 and not yet L5. It ends with the factory attempting one of those rungs on camera, unedited.

Redditr/artificial

Title: Where's the L4/L5 line for agent-built software? Two builds that passed every test and still weren't correct Two agent-built artifacts, same lesson from opposite directions, both ran the tests green and neither was correct: 1) A clean-room Rust port of OpenFGA (an authorization server), ~50,000 lines (github.com/sahajamoth/openfga-rust). It passed all 1,207 conformance tests — and a self-audit then found 22 invariants present in the Go original that we lost in translation. (A separate reverse-audit of the Go found invariants it lacks too — different scope, not a scoreboard.) Conformance is feature-equivalence, a proper subset of correctness; an invariant is a defensive check the conformance suite never asked for. Green was the floor, not the proof. 2) A clean-room H.264 decoder, bit-exact on the subset we claim (CAVLC Constrained-Baseline I+P, CABAC I-slice + intra-in-P). Same trap, harder oracle: H.264 fixes the decoded samples to the bit, so every conformant decoder matches by definition — matching is the definition of "decoder," not an achievement. So before the discussion, a ladder I've found useful — Dan Shapiro's L0-L5 for software autopilot, extended with L6-L8. The three rungs that matter here: - L4 you write the spec and audit; the system makes the implementation choices and carries a long-horizon, multi-day build. <- where both builds are, today. - L5 you supply nothing but a verdict — you're the validation point. - L6 the review/QA/security/compliance function itself becomes policy-as-code; the supervision tax goes away. L0-L5 is autonomy of motion; L6-L8 is autonomy of intent. The climb we care about now is L4 -> L5 -> L6. Both builds I'll claim at L4, not L5 — because for the decoder two decisive unlocks came from a human (the per-bin execution-trace method that cracked a CABAC drift, and forcing serial execution). At L5 the human supplies nothing but a verdict; here the human still supplied the method, which is exactly the L4/L5 line. Where would you draw the L4/L5 line — and given green tests demonstrably aren't correctness, what evidence would convince you a build was actually agent-driven rather than a fork?

DEEP DIVE → интро · deep-dive по лестнице

Substack: «The autonomy ladder of software delivery — a complete map»

alekseidereviankin.substack.com/p/the-autonomy-ladder-of-software-delivery · публикация: —

POST 2Bit-exact — это пол, а не победа

▸ Инверсия (статья A). Совпадение байт-в-байт = ОПРЕДЕЛЕНИЕ декодера, спека форсит. Дифференцирующее = автономность, её нет в байтах.

X / тред

We built a bit-exact H.264 decoder. To anyone who's actually written a codec, that's the single most worthless thing I could tell you — and it's the headline everyone else would have led with. So we're burying it. Here's why. Same trap, the receipt you can check today: our 50k-line clean-room OpenFGA port, openfga-rust, passed all 1,207 conformance tests, then an audit found 22 invariants it never shipped. Conformance is feature-equivalence, a proper subset of correctness; an invariant is a feature the agent forgot to ship, present in the Go original, lost in translation. Same trap, opposite reason: there the spec never asked for the invariant; here the spec fixes every bit, so matching IS the definition. Why bit-exact is worthless as a flex: H.264 locks the entire reconstruction loop to exact integer arithmetic — inverse quantization, the integer inverse transform, prediction addition, sub-pel interpolation, in-loop deblocking, the final clip. No tolerance band. Given a conforming stream, the standard determines the output to the bit. So every correct decoder on Earth produces identical samples. Matching them is the definition of the word "decoder," not a hard-won result. So the floor isn't the story. The story is whether an agent built the decoder, kultrun, from the spec — and that claim isn't in the output bytes at all. The bytes are blind to how they were made; a fork produces the same pixels. It lives in the build record, and we cap it honestly at L4, not L5 — two human unlocks were decisive. Why climb at all: review and approvals run, on our read, a third to half of an eng org — the supervision tax — and L6 (review as policy-as-code) is where it goes toward zero. Tomorrow: the bug that stalled the factory ~12 waves, and why its only oracle could ever say "no."

Threads

Here's the part that's awkward to admit: the more you know about codecs, the LESS impressed you should be that ours matches ffmpeg byte-for-byte. Given a conforming stream, H.264 fixes the decoded samples to the bit — every correct decoder produces identical pixels by definition. So "bit-exact" only proves you have *a decoder*. The output is blind to its own origin; a clean-room build and a literal fork make the same bytes. A senior codec engineer hears "byte-for-byte" and quietly stops listening. Same trap from the other side: our 50k-line OpenFGA port passed all 1,207 conformance tests, then an audit found 22 invariants it never shipped. Conformance is feature-equivalence, a proper subset of correctness; an invariant is a defensive check the agent forgot to port — present in the Go original, lost in translation. Green is the floor in both domains, for opposite reasons. The claim we actually care about — an agent built these from the spec, capped honestly at L4 not L5 because two human unlocks were decisive — isn't in the bytes at all. It lives in the build record, to be weighed and attacked, never positively proven. We're at L4 today; the climb that matters is L4 -> L5 (still autonomy of motion) -> L6 (the crossover into autonomy of intent, where the review function itself becomes policy-as-code and the supervision tax — review and approvals, a third to half of an eng org on our read — goes toward zero). So here's what actually keeps me up: the output can't witness who made it. A fork makes the same pixels. What would convince YOU a build was agent-driven, if not the bytes?

Redditr/programming

Title: Bit-exact vs ffmpeg is the definition of a decoder, not a result - and "all conformance tests pass" isn't correctness either A correctness note that gets muddled in a lot of "AI wrote X" claims, including ones I'll make this week, so I want to put it plainly first. The thesis: green tests are not correctness, and I can show it in two domains that fail the same way for different reasons. Domain 1, a codec. H.264 (unlike MPEG-2's tolerance-band IDCT) locks the whole reconstruction loop to exact integer arithmetic: inverse quantization, the integer inverse transform, prediction addition, final clip. No tolerance. Given a conforming bitstream the standard determines the reconstructed samples to the bit. So every conformant decoder produces identical output - matching a reference byte-for-byte is the definition of conformance, not evidence of anything hard. (And matching our own fixtures byte-for-byte is not a conformance certification; we never ran the JVT/ITU suite or FATE.) Here green is the floor because the spec fixes every bit. Domain 2, an authorization server. We did a clean-room Rust port of OpenFGA, ~50,000 lines (github.com/sahajamoth/openfga-rust). It passed all 1,207 conformance tests - and a self-audit then found 22 defensive invariants present in the Go original that were lost in translation. (A reverse-audit of the Go found 63 invariants the Go itself never had.) Here green is also the floor, for the opposite reason: conformance is feature-equivalence, a proper subset of correctness. An invariant is a feature the agent forgot to ship - and the conformance suite never asked for it. One footnote on the codec side: "byte-identical to a file" also depends on the output harness - planar vs packed, stride/padding, crop, bit depth. The spec fixes the samples; file identity needs the dump format pinned too. Which means: if the differentiated claim is "an agent built these from the spec, mostly unsupervised," the passing tests can't witness that. Output is process-blind - a clean-room build and a literal fork produce the same pixels and pass the same suite. So the real engineering question moves off "did the tests pass" and onto (1) can an agent build a conformant decoder across the subset we actually claim (CAVLC Constrained-Baseline I+P and CABAC Main I-slice / intra-in-P, not arbitrary streams), and (2) how do you debug it when the only oracle says "no" and nothing else. That second one is the next post. And I want to be straight about the process ceiling too: the decoder was a supervised build, L4 not L5 - two of the decisive unlocks came from a human, not the agent crews - so I'm not claiming push-button autonomy. Given the tests can't witness origin, what would convince you a build was actually agent-driven?

ARTICLE A → статья для узкой аудитории

Bit-Exact Is the Floor; Autonomy Is the Claim

corvin-sh/marketing · …/expert-articles/…-A-floor-and-autonomy.md · публикация: Нед.1 · Ср ~12:00 GMT+3 (до P2)

POST 3Оракул, который говорит только «нет»

▸ ГЛАВНЫЙ сюжет (статьи A/C). Тотальный вердикт, ноль диагностики; CABAC дрейф на MB 85/99; прорыв = построить информативный оракул (per-bin трейс). + фантом (harness-баг = подпись деблока). Сильнейший виральный + credible пост. Reddit → r/programming.

X / тред

Same wall, ~12 times. Macroblock 85 of 99. codIRange=354, codIOffset=352, every single time (engine state, not pixels). And the only test we had could say one word: no. The hardest part of building a decoder isn't the decoder. It's that the oracle only ever says "no." A conformance check is a total verdict with zero diagnostics: the frame matches or it doesn't. In CABAC — H.264's arithmetic coder — one wrong bin (a single arithmetic-decode decision, e.g. a bad context selection) desyncs the engine state for the rest of the slice. Output turns to garbage from that macroblock on, and the oracle just says no. Spec-grounded guess after guess ruled out, because pass/fail can't point at the culprit. The gut-punch: for ~5 of those waves the bug didn't exist. The harness compared post-deblock output against pre-deblock — we were debugging a ghost; the <=6/channel delta was just the deblocking filter's signature, two correct outputs at different pipeline stages. What finally broke it wasn't a smarter guess, and it wasn't the factory finding its own way out. A human handed it a method it didn't devise: a per-bin execution trace from our decoder and from a patched reference, diffed bin-by-bin to the first divergent decision. That located a chain of real bugs. Then CABAC I-slice went bit-exact, and the same trace carried the crews through intra-in-P (CABAC B-slice we do NOT claim). (Trace scripts committed; the shipped decoder has zero ffmpeg dependency, but debugging used libav internals as a behavioral oracle — so we say that, not "zero reference to libav.") The openfga port proves the identical point from the opposite end — that clean-room authorization server passed all 1,207 conformance tests, then an audit found 22 invariants the agent never shipped (present in the Go original, lost in translation). Green is the floor in both, for opposite reasons: there the spec never asked for the invariant; here the spec fixes every bit, so matching IS the definition. Bit-exact is the floor; an invariant is a feature the agent forgot to ship. That a human had to supply the oracle is exactly why we say L4, not L5. At L5 the human supplies nothing but a verdict; here the human still supplied the method. That's the part nobody films.

Threads

Our agents hit the same wall ~12 times in a row. Same macroblock every time, 85 of 99. And the only test we had could say exactly one word: no. Here's the part that actually makes agentic codec work hard — and it isn't the codec. It's the oracle. A conformance check is a total verdict with zero diagnostics: the frame matches or it doesn't. In CABAC — H.264's arithmetic coder — one wrong bin (a single arithmetic-decode decision, e.g. a bad context selection) desyncs the engine state for the rest of the slice. Everything downstream turns to garbage and the oracle still just says no. The receipt: drift at macroblock 85 of 99, codIRange=354 / codIOffset=352 every time (engine state, not pixels; the exact registers land in the committed trace scripts when the repo ships). Spec-grounded guess after spec-grounded guess ruled out, because pass/fail can't point at the culprit. The deadlock didn't break with a smarter guess. A human stepped in and supplied the oracle the spec refuses to give you: a per-bin execution trace, ours vs a patched reference, diffed to the first divergent decision. The crews then fixed a chain of real bugs against that trace, CABAC I-slice went bit-exact — then the same trace carried the crews through intra-in-P (CABAC B-slice we do NOT claim). (The trace scripts are committed; the shipped decoder has zero ffmpeg dependency, but the debugging used libav internals as a behavioral oracle — we say so, not "zero reference to libav.") And the gut-punch: for ~5 of those waves the bug didn't exist. The harness was comparing post-deblock output against pre-deblock — we were debugging a ghost. The <=6/channel delta that looked like a decode bug was just the deblocking filter's signature: two correct outputs at different pipeline stages. A correct oracle with zero diagnostics will happily manufacture a failure. Green was always the floor here — every conformant decoder matches by definition. Same lesson the openfga port taught from the other side: it passed all 1,207 conformance tests, then an audit found 22 invariants it never shipped. Green is the floor in both, for opposite reasons — there the spec never asked for the invariant; here the spec fixes every bit, so matching IS the definition. An invariant is a feature the agent forgot to ship. That human unlock is exactly why we say L4, not L5: at L5 the human supplies nothing but a verdict; here the human still supplied the method. The unglamorous truth — the hard part wasn't writing the decoder, it was building the microscope to see why it was wrong. Anyone else ever debugged a thing whose only feedback was "no"?

Redditr/programming

Title: The real difficulty in agentic codec work isn't the codec — it's that the only oracle you have says "no" and nothing else This is the most interesting thing we hit building an H.264 decoder with an agent system, and it generalizes well beyond codecs. One honesty note up front: the decoder repo isn't public yet, so the trace and register details below are narrative for now — they'll land verbatim in the committed trace scripts when the repo ships. What's inspectable today is the sibling build, openfga-rust (a clean-room Rust port of an authorization server). So read this as a war story with receipts to follow, not a runnable artifact. A conformance test is a *correct but uninformative* oracle: it gives a total verdict (the frame matches or it doesn't) and zero diagnostic information about where or why it diverged. CABAC (the arithmetic entropy coder, defined in H.264 §9.3) makes this brutal — it's serial within a slice, every bin's probability model depends on the bins before it, so a single wrong bin desynchronizes the engine state for the rest of the slice. Everything downstream is garbage and the oracle still just says "no." Our build sat against a *stable* failure fingerprint for ~12 waves: drift at the 85th of 99 macroblocks, with the CABAC arithmetic-decoder registers reading codIRange=354 / codIOffset=352 (those are engine state, not pixel coordinates). Spec-grounded hypothesis after hypothesis got ruled out, because pass/fail can't localize. The deadlock broke only when a human operator handed the crews a method they hadn't found themselves: stop guessing, replace the oracle. Emit a per-bin execution trace from our decoder and from a *patched reference decoder*, and diff them bin-by-bin to the first divergent decision. That localized a chain of real bugs in the CABAC I-slice context-modeling path, each fixed against the trace, not a hunch. Then CABAC I-slice went bit-exact — and the same per-bin trace harness then carried the crews through intra-in-P (the harder path: P-slice CABAC context, but intra-coded prediction), which is the row we actually claim alongside I-slice. The B-slice CABAC path we do NOT claim bit-exact. The honest framing of that result: green was the floor it was always going to reach once localized — every conformant decoder matches by definition. This is the same shape as the other build — openfga-rust passed all 1,207 conformance tests yet was missing 22 invariants (defensive checks present in the Go original, lost in translation, and the conformance suite never asked for them); there the floor was feature-equivalence, here it is bit-exactness, and in both the green is where correctness work starts, not where it ends. Building the informative oracle the standard refuses to give you was the achievement. And that the crews couldn't build it unaided — a human supplied the trace methodology — is exactly the L4/L5 line. "Builds its own informative oracle unaided" is the L5 boundary, and we didn't cross it. (The second decisive human call, for the record, was forcing serial execution when parallel crews thrashed.) Two honest footnotes, because they're the actual lesson: - For ~5 of those waves we were debugging a *phantom*: the harness compared a post-deblock output against a pre-deblock output. The ≤6/channel delta that looked like a decode bug was the deblocking filter's signature — two correct outputs at different pipeline stages. A correct oracle with zero diagnostics will happily manufacture a failure. - Using a patched reference decoder as a *behavioral oracle* (diffing traces) is not copying its code — the shipped binary has zero ffmpeg dependency and the trace scripts will ship committed. But it does mean "clean-room with zero reference to libav" would be too strong, so we say it plainly. The durable takeaway: localization is a *second* engineering effort layered on top of building the thing — you have to build the informative oracle the standard refuses to give you, and here a human is what got us over that wall. Curious how others have attacked diagnostic-free oracles (differential tracing, bisection on state, etc).

ARTICLE C → статья для узкой аудитории

The Hardest Parts: CABAC, DPB, Interlace, Deblock

…-C-hardest-parts.md · публикация: Нед.1 · Чт ~12:00 GMT+3 (до P3)

POST 4Чего мы НЕ заявляем

▸ Честность как ров (статьи C/B/D). Декодер, не энкодер; портативный Rust, без asm → нет checkasm; не прогонялся против JVT/FATE; нет «first». Not-claimed строки делают claimed строки правдоподобными. Reddit → r/programming или codec-саб.

X / тред

The most persuasive thing about this launch is the list of things we refuse to claim. Start with the one that should scare you: it is NOT L5. Two of the moves that cracked the hardest bug were human, not the factory. Here's the whole not-claimed list — it's the only reason to believe the rest. 🧵 — Not fully autonomous, not L5. Two decisive interventions were human: the per-bin execution-trace method that cracked the CABAC drift, and forcing serial execution. Supervised L4, not push-button. — A decoder, not an encoder. Decode is normative and deterministic; encode is the real frontier (non-normative, no byte oracle). We didn't ship it. — No "first." Independent from-spec decoders exist (JM, openh264). And the rest in one breath: portable Rust so no checkasm, our own x264 fixtures not the JVT suite so not a certification, clean-room = copyright not patents. Green != correct, the other artifact: openfga-rust, a clean-room authorization server, passed all 1,207 conformance tests, then an audit found 22 invariants it never shipped — a feature the agent forgot to ship, present in the Go original, lost in translation. Green is the floor in both — opposite reasons: there the spec never asked for the invariant; here the spec fixes every bit, so matching IS the definition. What we DO claim splits in two: the bytes match (process-blind — a fork makes the same pixels) and an agent built them (a process claim, argued from the build record, survives-attack not proven, capped at L4). Why the rung matters: review/QA/approvals run a third to half of an eng org on our read (industry-anecdotal — no hard denominator) — the supervision tax. We're on L4 today; the climb is L4 -> L5 (motion: you supply only a verdict) -> L6 (intent: review itself becomes policy-as-code), and L6 is where that tax goes toward zero. Next: the factory attempts L4 -> L5 LIVE — 7 days on camera, test- and cost-counters running, failures included, nothing cut. Plus the provenance bundle, built to be attacked.

Threads

The most load-bearing thing I can tell you about this launch is what it is NOT. It's L4, not L5. Agent crews built it under light steering, but two human unlocks were decisive — the per-bin execution-trace method that cracked the CABAC drift, and forcing serial execution when the crews thrashed. Not push-button. We cap at L4 on purpose: the rung that pays off is L6 — the leap from autonomy of motion into autonomy of intent, where review becomes policy-as-code and the supervision tax goes toward zero — and it only counts if we're honest about standing on L4. We built a decoder, not an encoder (encode is the non-normative frontier, no byte oracle). Portable Rust, zero assembly — so no checkasm. Bit-exact on our own x264 fixtures, not the official JVT suite or FATE — so not a conformance certification. And no "first": independent from-spec decoders already exist (JM, openh264). Same fence on the other artifact, and it's the cleanest version of our whole thesis: openfga-rust passed all 1,207 conformance tests, then an audit found 22 invariants it never shipped. An invariant is a feature the agent forgot to ship — present in the Go original, lost in translation. Green is the floor in both, for opposite reasons: there the spec never asked for the invariant; for the decoder the spec fixes every bit, so matching IS the definition. And even the claimed bit-exact rows — CAVLC and CABAC (I-slice Main + intra-in-P) — are the spec-mandated floor: every conformant decoder matches by definition, so they attest a decoder exists, not who or what built it. Which is exactly why the L4 process claim, not the bytes, is the thing we're actually defending. Tomorrow: the provenance bundle, published to be attacked — and the public L4->L5 attempt, 7 days, unedited, failures included.

Redditr/programming

Title: Conformance pass != correctness: every conformant H.264 decoder is bit-exact by definition, and an OpenFGA port passed 1,207 tests with 22 invariants still missing A correctness boundary that gets muddled whenever someone says "the tests pass, so it's done." I can show it failing the same way in two very different domains, for opposite reasons. Two artifacts make the point; treat the names as case studies, not a pitch. Domain 1, a codec. H.264 (unlike MPEG-2's tolerance-band IDCT) locks the whole reconstruction loop to exact integer arithmetic: inverse quantization, the integer inverse transform, prediction addition, in-loop deblocking, final clip. No tolerance. Given a conforming bitstream the standard determines the reconstructed samples to the bit. So every conformant decoder produces identical output — matching a reference byte-for-byte is the *definition* of conformance, not evidence of anything hard. Here green is the floor because the spec fixes every bit. Domain 2, an authorization server. A clean-room Rust port of OpenFGA, ~50,000 lines (github.com/sahajamoth/openfga-rust). It passed all 1,207 conformance tests — and a self-audit then found 22 defensive invariants present in the Go original that were lost in translation. Here green is also the floor, for the opposite reason: conformance is feature-equivalence, a proper subset of correctness. An invariant is a defensive check the agent forgot to port — present in the Go original, lost in translation — and the conformance suite never asked for it. (For honesty about state: the 22 are tracked in-tree with fixes in flight, not a standing pile of known-broken auth.) Same thesis, two failure modes: where the spec pins every bit, matching is trivially mandated; where the spec only fixes observable behavior, everything it doesn't observe (defensive invariants) can quietly go missing while every test stays green. The scoping that goes with each, because conformance claims are only as good as their declared subset: - The decoder is bit-exact on CAVLC Constrained-Baseline I+P (skip, integer/sub-pel, partitions, deblock) and CABAC I-slice (Main) + intra-in-P — on self-generated x264 fixtures under a pinned raw-YUV harness, NOT the JVT/ITU-T suite, NOT FATE. Passing those fixtures is necessary, not sufficient; it is not a conformance certification. CABAC B-slice and DPB/POC/MMCO are partial and not claimed. - It's a decoder, not an encoder (encode is non-normative — motion estimation, rate-distortion — with no byte oracle). Portable Rust, no asm/SIMD, so no checkasm differential exists by construction. Clean-room covers copyright, not H.264 patents (MPEG-LA / Via LA). No "first" claim — JM and openh264 are independent from-spec decoders. One thing the bytes and tests cannot witness either way: how the code was produced. Output is process-blind — a clean-room build and a literal fork produce identical pixels and pass the same suite — so any "who built this" claim has to be argued from a build record and provenance bundle, never read off a diff. And I want to be straight about the process ceiling: this was a supervised build, L4 not L5 (a human writes the spec and audits; the factory carries the multi-day build). Two of the decisive unlocks came from a human, not the agent crews — the per-bin execution-trace method that cracked a CABAC drift the factory couldn't localize, and forcing serial execution when parallel crews thrashed — which is exactly the L4/L5 line, so this is not push-button autonomy. The question I'm actually curious about: how do you scope-bound a conformance claim so it stays honest? Where the spec fixes every bit, "passes" means almost nothing; where it doesn't, "passes" hides the invariants. What's your rule for stating what a green suite does and doesn't prove? One motivation note, since it's the actual point. Where the spec fixes every bit, the residual cost of building isn't the code — it's the supervision: review, QA, and approvals run, on our read, something like a third to half of an eng org. The rung past L5 (L6) is where review itself becomes policy-as-code and that cost goes toward zero — that's why the L4/L5 line matters and not just as a correctness puzzle. Next week we attempt the L4->L5 step in public, unedited, with test- and cost-counters on screen and failures visible. A stream isn't proof — you could stage one off camera — it's just the loudest place to fail if we cheat.

ARTICLE B → статья для узкой аудитории

Decode Is Deterministic; Encode Is the Frontier

…-B-decode-deterministic-encode-frontier.md · публикация: Нед.1 · Пт ~12:00 GMT+3 (до P4)

POST 5Пруф + черта, которую мы не перейдём + climb live

▸ Пруф-дроп (статьи D/E). CAVLC+CABAC bit-exact на pinned commit, diff it yourself. «result, not how» = граница раскрытия (output process-blind) → provenance-бандл делает автономность проверяемой. Мост на 7-дневный live-стрим (L4→L5). Reddit → r/programming + HN.

X / тред

Next week an AI agent writes ffmpeg from scratch, live on Twitch — 24/7. Test-counter and cost-counter on screen. Every failure visible. Nothing cut. It can fail in front of everyone — that's the entire point. First, the receipts that earn the right to try it: kultrun — a from-scratch H.264 decoder, bit-exact on CAVLC Constrained-Baseline I+P and CABAC I-slice (Main) + intra-in-P at a pinned commit. The bytes match — but bytes can't tell you who wrote them. Every conformant decoder matches by definition; that's the floor. openfga-rust — a 50k-line clean-room authorization server that passed all 1,207 conformance tests; an audit then found 22 invariants it never shipped. Not broken auth — conformance is feature-equivalence, a proper subset of correctness, and an invariant is a feature the agent forgot to ship: present in the Go original, lost in translation. Tracked in-tree, fixes in flight. github.com/sahajamoth/openfga-rust Two domains, one lesson: green is where correctness starts, never where it ends. So we don't say "diff it yourself." We published the build record FOR attack: no-GPL dependency manifest, clause-cited table origins (two honest nuances), the committed patched-reference trace scripts, a similarity scan vs JM / openh264 / libavcodec. That bundle defends non-fork only — survives-attack, not proven; clean-room covers copyright, not H.264 patents. Autonomy is in no artifact; we argue it from the record and cap it honestly at L4, not L5: two human unlocks were decisive (the per-bin trace method that cracked the CABAC drift, and forcing serial execution). Why bother: review/QA/approvals run a third to half of an eng org's payroll on our read (industry-anecdotal — no hard denominator) — the supervision tax. L5 is the top of autonomy of motion (you supply only a verdict); L6 is its removal — review itself as policy-as-code — and that's the org restructuring the whole climb is for, not just a cost cut. The horizon past it (L7-L8) is an operator-less build org — my bet, not a forecast: one L6 leader per region by ~2027. So the stream is the L4 -> L5 attempt specifically: when it hits the next diagnostic-free wall, does the factory build its OWN microscope this time — the per-bin trace a human had to hand it before — or stall at the same MB-85 drift like kultrun did? A stream isn't proof; you could stage one off camera. It just raises the cost of faking, because the counters and the failures are public in real time. openfga-rust is public now — come attack it. The decoder's bundle ships with its repo, and the climb is where you watch it live.

Threads

Proof's up: an agent-built H.264 decoder, bit-exact on CAVLC Constrained-Baseline I+P and CABAC I-slice (Main) + intra-in-P at a pinned commit. And a diff against ffmpeg proves almost nothing - here's the honest order of what it does and doesn't prove. The diff only proves one thing: it's a real decoder. H.264 fixes the decoded samples to the bit, so matching byte-for-byte is the floor every conformant decoder meets by definition - not the win. Same floor from the other side: openfga-rust, our 50k-line clean-room port, passed all 1,207 conformance tests, then an audit found 22 invariants it never shipped. An invariant is a defensive check the agent forgot to port - present in the Go original, lost in translation. (openfga-rust is public now; kultrun's provenance bundle ships with its repo.) Two domains, one lesson: green is where correctness starts, never where it ends. So we don't say "diff it yourself." Two separate claims live above that floor, on different rails: 1. Non-fork. We didn't copy a reference. That can't be positively proven, only attacked - so we publish a provenance bundle built for attack (no-GPL dependency manifest, clause-cited table origins with the two honest nuances, the disclosed debugging-oracle trace scripts, a similarity scan vs JM/openh264/libavcodec) and invite you to break it. Survives-attack, not proven. 2. Agent-built. This one is in NO artifact at all - not the bytes, not even the provenance bundle. It's a process claim you weigh from the build record. And we cap it honestly: L4, not L5. Two decisive unlocks were human - the per-bin trace method that cracked the CABAC drift, and forcing serial execution. And here's why any of this matters past a cool demo: on our read, review/QA/approvals run something like a third to half of an eng org's payroll - the supervision tax. The next rungs are L5 (you supply only a verdict - the top of autonomy of motion), then L6 (review itself as policy-as-code - the first rung of autonomy of intent), where that tax goes toward zero. That's the thing I'm actually building toward. Next: we ATTEMPT the L4->L5 step in public - 7 days, unedited, counters live. Open question, and it's the literal L4->L5 line: does the factory build its OWN informative oracle this time (the per-bin trace a human had to hand it before), or stall at the same MB-85 drift? It may fail on camera. We won't cut that.

Redditr/programming + Hacker News

Title: A clean-room Rust port of OpenFGA passed all 1,207 conformance tests — then an audit found 22 invariants it never ported Posting the artifact that's public and inspectable right now. The point it makes cleanest: passing tests is not the same as being correct, and I can show it in one repo you can pull today. --- openfga-rust --- Repo: github.com/sahajamoth/openfga-rust. A clean-room Rust port of OpenFGA (an authorization server), ~50,000 lines. It passed all 1,207 conformance tests. A self-audit then found 22 invariants — defensive checks present in the Go original, lost in translation. (A separate reverse-audit of the Go found invariants it lacks too — two different audits, different scopes, not comparable, not a scoreboard.) Conformance is feature-equivalence, a proper subset of correctness. An invariant is a feature the agent forgot to ship — present in the Go original, lost in translation — and the conformance suite never asked for it, so all 1,207 stayed green while it was absent. Status, because an authorization server with "missing" invariants deserves a straight answer: the 22 are tracked in the repo and being added; the audit is in-tree. So this isn't a standing pile of known-broken auth — it's a snapshot of what green did and didn't witness, with the fixes in flight. 1,207/1,207 was the floor, not the proof. Green is where the correctness work started, not where it ended. --- sibling build, repo not yet public --- There's a second build — a from-scratch H.264 decoder, bit-exact on CAVLC Constrained-Baseline I+P and CABAC I-slice (Main) + intra-in-P; CABAC B-slice and DPB/POC/MMCO are partial and NOT claimed. Even unshipped, it makes the identical point from the opposite direction: H.264 fixes the decoded samples to the bit, so every conformant decoder matches by definition — bit-exact is the spec-mandated floor, not the achievement. openfga's floor is feature-equivalence; the decoder's floor is bit-exactness; in both, green is where correctness work starts, not where it ends. That repo isn't up yet, so I won't ask you to attack a bundle you can't download. The full provenance bundle (a no-GPL dependency manifest, a clause-cited table-origin manifest, the committed patched-reference trace scripts, a similarity scan vs JM/openh264/libavcodec at named SHAs) ships *with* the repo. Full stop on that one until it lands. --- three claims, kept apart on purpose --- 1) FLOOR (decisively checkable, and for openfga checkable today): conformant behaviour over the named scope. For the decoder this is bit-exact ONLY on the rows above, on ~a dozen self-generated x264 fixtures under a pinned raw-YUV harness, NOT the JVT/ITU suite, NOT FATE — necessary, not sufficient, not a certification. The bytes/tests prove exactly this and nothing more — every correct decoder matches by definition over that scope, and 1,207/1,207 is feature-equivalence. The floor, not the achievement. 2) NON-FORK (survives-attack only, never proven): carried by the provenance bundle, not a diff. For openfga-rust the repo and audit are inspectable now; for the decoder the bundle ships with the repo. One honesty note that travels with it: the decoder's debugging used a patched libavcodec as a behavioral oracle (trace scripts committed), so "zero reference to libav" would be too strong — we say so. Non-derivation can't be positively proven from artifacts; the posture is survives-attack. Clean-room covers copyright, not H.264 patents (MPEG-LA/Via LA). 3) PROCESS (agent-built): lives in no artifact — not the bytes, not the bundle. Output is process-blind, so "result, not how" is a disclosure boundary, not evidence about process. The autonomy case is argued from the build record and capped at L4, not L5: two unlocks were decisively human — the per-bin execution-trace method that cracked a CABAC drift the factory could not localize, and forcing serial execution when parallel crews thrashed. A motivation note, tagged as vision not measured result, since it's the actual point of the climb. Where the spec fixes every bit, the residual cost of building isn't the code — it's the supervision: review, QA, and approvals run, on our read, something like a third to half of an eng org (industry-anecdotal, I have no hard denominator for this). L5 is the top of autonomy of motion (you supply only a verdict); L6 is the first rung of autonomy of intent, where review itself becomes policy-as-code and that cost goes toward zero — and the horizon past it (L7-L8) is an operator-less build org. Next week we attempt the L4->L5 step in public — 7 days, unedited, with test- and cost-counters on screen and failures visible. A stream isn't proof; you could stage one off camera. It's just the loudest place to fail if we cheat. If you can name an invariant still missing from openfga-rust, or break a specific claim against the public repo, link it and I'll correct the post. The decoder attack I'll explicitly defer to publication — asking you to break a bundle you can't see would be theatre, not rigor. What's your rule for stating what a green suite does and doesn't prove?

ARTICLE E → статья для узкой аудитории

The Proof, and the Withheld How (+ D — Falsifiable Definition + the Provenance Bar)

…-E-proof-and-withheld-how.md (+ …-D-falsifiable-provenance.md) · публикация: Нед.2 · Вт ~10:00 GMT+3 (до P5/ HN-дропа)

First five
shots.

Что такое Dark Factory

Первые 5 постов

Статьи (узкий слой) + даты публикации

План на 2 недели

▸ Расписание

▸ Что писать в реплаях (скрипты под атаки)

▸ Как поднимать посты

Источники и честность