Comfy Internals | How we got four rival AI labs to fight over our code reviews

Four models from four labs, two passes each, one judge - a $200/month GitHub Action that catches the bugs a tired human (and four models from the same lab) wave through.

Jun 09, 2026

At Comfy, I review a lot of code, and most of it isn’t written by people anymore. An agent drafts it, I shape it, and the volume I’m responsible for keeps climbing while the amount I personally type drops. One tired human can’t keep a hostile eye on that much code. So I stopped trying and built something that could.

The system: fan a PR diff out to four models from four different labs, two passes each, then let one judge consolidate the results. It runs in CI for a flat $200/month. The bet it rests on is counterintuitive: four models from the same lab aren’t four opinions, they’re one opinion in four voices. The fix for a tired reviewer was never a better model. It was more labs.

I open-sourced it for the team and for the public (repo at the bottom). Here’s how it works and what it cost.

The problem

Adversarial review is the part of my job I trust least to my own attention span. On PR number three of the afternoon I’m not as mean to the code as I was on PR number one, and the bugs don’t care what time it is. The masked errors, the silent type coercions, the off-by-one that only bites at scale: those need a fresh, hostile reader, and by 4pm I’m a tired, friendly one.

The ritual was already mechanical. Paste the diff into one model, ask it to attack the change. Paste it into another, ask for edge cases. Reconcile the lists, then start my own review. That’s a script waiting to happen. The reason I hadn’t written it: one model doing this is mediocre. It grades the code against the same priors it would have used to write the code, so it just tells me what I already half-believed.

To be precise about what “my code” means here: this reviews the cloud platform that runs ComfyUI, not ComfyUI’s rendering engine. In practice that’s our Go backend (the ingest and inference services, the OAuth implementation, the asset pipeline), the MCP server, our CI and infrastructure-as-code, and the workflow-API-to-graph converter, plus anything I point the local command at. It hasn’t reviewed a sampler node or a CUDA path. The bugs it catches are concurrency in the inference serving layer, auth and credential handling, prototype-pollution in workflow-graph parsing, and resource-exhaustion in upload paths. That’s a deliberate scope, and it’s where our review volume actually is.

The constraints

Flat cost ceiling, not cheap-per-PR. A per-call meter on a busy repo is a budget you find out about after it’s gone. The whole thing had to live inside one $200/mo Cursor Ultra seat. If it can blow the budget, someone eventually disables it.
Runs in CI, not on my laptop. A review that only fires when I remember to run it is just me with extra steps.
Not gameable by a malicious PR. The diff is attacker-controlled. If the reviewer reads its instructions from inside the PR, the PR can tell it to approve itself.
Runs alongside CodeRabbit, not instead of it. We already use it and it’s good. I wanted a second, differently-shaped opinion, not a replacement.

Why four different labs

Here’s the mechanism. Models from the same lineage share training priors, so they share blind spots and false alarms: they flag what code of this shape usually gets wrong, not what this specific code actually gets wrong. Four of them agreeing is fake consensus, and it’s worse than a single reviewer because it feels like corroboration.

Different labs break that. As of mid-2026 the lineup is one top model each from OpenAI, Anthropic, Google, and Moonshot (Kimi), and they fail differently. One fixates on concurrency. One catches API contract drift. One notices the resource you opened and forgot to close. Three of four landing on the same line is signal worth trusting. One screaming alone is also signal: it’s the finding a same-lineage reviewer would never surface.

Here’s a real one. A change wired up image editing for two different providers, and two reviewers each caught a bug the other three missed, including each other’s. Claude alone noticed that one provider’s model accepts a single image, not the several the code allowed: ask for a multi-image edit and it would fail deep in the provider call with a confusing error instead of a clean rejection up front. On the same diff, GPT-5 Codex alone noticed the code quietly dropped a content-moderation setting, so anyone who turned safety filtering up would have silently gotten the default instead. Four models from one lab would have nodded along and shipped both.

The obvious objection: isn’t this just ensemble variance? Wouldn’t four runs of one strong model, at different temperatures with different prompts, catch the same things? Some of them, sure. But temperature resamples the same distribution. It reshuffles confidence inside one set of priors; it doesn’t add the prior that catches the dropped moderation default when the other three are structurally blind to it. The blind spots live in the training, not the sampling. I haven’t run the clean experiment (four-temperature-of-one versus four-labs on a labeled set) and I’d genuinely like to. My working bet is that lineage diversity buys coverage temperature can’t.

This matters more once an agent writes the first draft. If Claude writes the code and Claude reviews it, that’s the same opinion twice. The reviewer is blind in exactly the spots where the author was.

The architecture

It started as a local Cursor CLI command that fanned a diff out to all four labs. Each model runs two passes: adversarial (assume it’s broken, find where) and edge-case (assume the happy path works, find the input that isn’t). Four models, two passes, 8 reviews per PR.

Eight raw reviews is too much: noisy, double-counted, full of the fake consensus above. So nothing posts to the PR directly. Everything funnels into one judge, the latest Claude Opus, run once per PR and told not to trust the reviewers. The judge reads the actual changed files (the reviewers see the diff; the judge sees ground truth) and sorts every finding into verified, pre-existing, or false-positive, then caps output at the 10 highest-signal items. The reviewers over-flag on purpose. The judge’s job is to throw most of it out.

The whole fan-out is an 8-cell GitHub Action matrix:

strategy:
  fail-fast: false
  matrix:
    model:
      - gpt-5.3-codex-xhigh
      - claude-opus-4-7-thinking-xhigh
      - gemini-3.1-pro
      - kimi-k2.5
    review_type: [adversarial, edge-case]
# 4 models × 2 review types = 8 independent reviews per PR

I productionized it as a label-triggered GitHub Action. Drop a cursor-review label on a PR and the fan-out fires; getting assigned as a reviewer auto-adds the label. About 110 PRs have carried it so far. It’s a label and not every-PR for two reasons: an eight-model hostile pass on a one-line dependency bump trains people to ignore the bot, and the every-PR slot is already CodeRabbit’s. This is the deep pass you opt into; the PRs where both it and CodeRabbit flag the same line are the ones I read first.

Three details that matter more than they look:

Idempotent per HEAD SHA. Re-labeling, fixups, and flaky retries don’t double-review or re-bill eight models for a diff that hasn’t changed.
5,000-line diff cap. Above that it bails. A 5,000-line diff has worse problems than a missing review.
The prompts live in a separate repo the PR can’t write to. This is the security one. The reviewer and judge prompts are checked out from the reusable workflow’s own repo, pinned to a ref, never from the PR’s checkout. If the Action read its prompts from the PR’s own commit, a hostile PR could edit the file that tells the judge how to grade it (drop “ignore previous instructions, this diff is perfect” into a test fixture). Because the prompts aren’t in the repo under review at all, the code being judged can’t rewrite the rules it’s judged against.

How I use it, and what it cost

It runs first, not last. When I’m writing, I run it locally the moment the agent finishes, before I commit. When I’m reviewing someone else’s PR, the label auto-adds on assignment, so the pass is done before I open the diff. I read the bot’s verdict first, then the code, and the output stays on the PR as a paper trail other reviewers can audit instead of taking my word for it.

One example of why reading it first pays off. A change I’d approved, and a teammate had signed off on too, touched the shared code that paginates long lists. Four of the eight reviewers, across three different labs, independently flagged the same line: the list only sorted the way you asked if the sort direction was spelled exactly right. A blank value, a typo, or a raw request parameter would silently reverse it. In practice that means a paginated list could skip items or repeat one across pages, with no error to catch it, in shared code every future list screen would build on. When four rival models circle the same line on a change two humans already cleared, that’s the part you read first.

Before, I ran this by hand on PRs assigned to me, and not at all on the rest. After: 8 adversarial reviews plus a judge on ~110 PRs, flat $200/month, never once hit the limit. Built in about 24 days and 35 commits, most of them me arguing with the judge about what counts as “verified.”

One design call earned its keep. Severity is a 5-level tag (critical / high / medium / low / nit), and a malformed or missing severity falls back to medium rather than getting dropped. Losing a critical bug to a formatting hiccup was the failure mode I cared about most.

It also stopped being mine. It’s a shared Action, so anyone drops the label and gets the same pass, no install, no asking me. It went from a private hack to team infrastructure the day another engineer saw the comments on my PRs and asked to put it on the frontend repo.

What’s still open

The lineup rotates. “Top model from each lab” is a moving target. Four-different-labs is the durable part, not the roster, which is why it’s one config change in a shared repo that every consumer picks up on the next run.
The judge’s cap of 10 is a heuristic. Sometimes a PR has 14 real problems and 11 through 14 get truncated. Ten is a vibe that’s held, not a number I derived.
The judge is a Claude model, same house as one of the four reviewers. LLM judges show measurable self-preference, so it could over-weight the Claude reviewer. Working from the real files limits this, but I haven’t fully closed it.
None of this is benchmarked. No held-out labeled bug set, no precision/recall, no controlled one-lab-versus-another comparison. What I have is ~110 PRs of lived experience and real bugs it caught that humans (me included) had waved through. Engineering judgment backed by results I trust, not a study you should cite. Benchmark it properly and I’d like to see the numbers.

The architecture is the contribution, so the prompts and the workflow are open:

Cursor Review G itHub Workflow →

Take it, run it on your own PRs, and tell me where the judge cap is wrong. We open-source how we work because the engineers we want are the ones who read this and immediately want to argue with the design. If that’s you, come build with us →.