ComfyUI Newsletter: Engineering

I built a native Comfy Cloud mobile app on nothing but the public API

Matt Miller — Mon, 22 Jun 2026 16:42:19 GMT

This is a Comfy Vibe — vibe-coded, lightly tested apps we ship to show, not just tell, what you can build on Comfy. They're not hardened products, and we're not building them to gather data or prove a point — just things we wanted to exist, shipped fast and thrown into the world, rough edges and all. The whole point isn't any one app — it's the invitation. Pick an idea, vibe-code it on Comfy, and put it in the world. Here's one of ours.

Open Comfy Go on your phone. Type a prompt. Pick a model from the list. Tap generate; the job queues, runs, and an image — give it a moment, a video — comes back into the same gallery, on the phone. No laptop open somewhere. No browser tab. No node graph to wire up. The thing on your phone is a native app, and the picture came back from the same Comfy Cloud that runs the web product.

That's the whole pitch, and I want you to feel it before I explain anything: generating on Comfy Cloud now fits in your pocket — camera roll in, Photos out, the model picker a list you tap instead of a string you paste.

Comfy Go is a native SwiftUI app — Comfy Cloud's mobile edition. It's not the full node-graph editor; it's four curated generation flows — the ones you'd reach for from a phone. It shipped to TestFlight on June 12. There are two reasons I'm writing this up, and they're the same idea pointed in two directions: how accessible this is to use, and how accessible the API underneath is to build on. Both turn out to be true, and the second is why the first exists at all.

What's at your fingertips

Four generation flows — the full image/video by text/image matrix:

Text to image — type a prompt, pick from any of the 18 models, get a picture.
Image to image — bring a photo from your camera roll, restyle or edit it against the same catalog.
Text to video — a prompt becomes a clip, no keyframes to hand-place.
Image to video — hand it a still from your Photos and it starts moving.

Across those flows you pick from 18 models, the same catalog the web product draws on. You sign in with Sign in with Comfy — one tap, your own account, and you're in. Results save straight to your Photos, and everything you've made lives in an in-app gallery you can scroll back through. It's phone-native end to end: the camera roll is the input, the camera roll is the output, and the model menu is a list you tap, not a string you copy.

That's the user-facing half. Here's the half that makes it interesting to anyone who builds things.

The spine: it's all the public API

Comfy Go does not have a private back door into Comfy Cloud. It rides the public Comfy Cloud API — the same one anyone can call. There is no internal endpoint, no special-cased mobile shortcut, no privileged handshake. Everything the app does, you can do.

And "everything the app does" is smaller than it looks, because the client SDK underneath — a thin Swift layer called ComfySwiftSDK — comes down to two methods and one event you care about:

The whole contract: submit a workflow, stream its events, get your outputs.

Submit a workflow.
Stream the job's events as it runs (queued, running, progress, done).
The stream hands you the outputs when it completes — the finished images or clip arrive as the last event, not a separate call you have to make.

Submit, stream, and the stream gives you the result. That's the contract. The SwiftUI layer on top never learned what an HTTP status code is; it knows "this job is queued," "this one is 40% through," "this one produced two images." Every screen in the app — all four flows, the gallery, the live progress bars — is built out of that one small surface applied in different shapes.

In Swift, that's the whole client — two calls and a switch:

The entire client surface — sign in, submit, stream, outputs.

Grab the copy-pasteable version from the SDK's Quick Start.

Add it to your Package.swift:

.package(url: "https://github.com/Comfy-Org/ComfySwiftSDK.git", from: "0.1.0")

It's 0.1.0, Apache-2.0, and dependency-free — Foundation and CryptoKit only, iOS 17+ / macOS 14+.

Here's the part that makes "accessible to build on" a fact and not a slogan: one person built this whole app, solo, in roughly a week. The repo opened April 6 with the BMAD plan; the first Swift scaffold landed the next day, and I shared it with the team within the week. The work was driven agentically through Claude Code on that plan — 8 epics, 57 stories — with one human steering. The output was around 17,200 lines of app plus SDK code: a native iOS app, four working pipelines, a model picker, a gallery, real sign-in.

The reason that's reachable, and not heroic, is the size of the surface you have to learn. If the API were forty endpoints with subtle ordering rules, a solo week-long native app would be a tall story. It's submit, stream, done. So the credible claim is just arithmetic: if a solo developer can ship a real native app on this API in about a week, your idea — the bot, the plugin, the side project — is reachable on the same surface.

The hard part of building on Comfy Cloud isn't the API. It's deciding what to make.

Try it / what's open

The beta. Comfy Go is an open TestFlight beta. It's a real app you can sign in to and generate from. If you want in, that's the whole step: join the beta on TestFlight.

The API. This is the part that needs no waiting list. The Comfy Cloud API that Comfy Go is built on is public — submit, monitor, retrieve. The full OpenAPI spec is the source of truth if the docs ever lag. The "submit, stream, and the stream gives you the result" shape is exactly as small in practice as it is on the slide. If a native iOS app fits in that surface, your thing probably does too. Go build.

The repo. The SDK is open source — ComfySwiftSDK on GitHub. That submit-and-stream boundary is the part most worth handing to other people, so it's what we opened up first: if you want to build on Comfy Cloud from Swift, start there. The app is following. Star it, file issues, send PRs — it's pre-1.0, so feedback on the API surface genuinely steers where it goes.

The numbers

Generation surface — 18 models across 4 flows
SDK surface — submit, stream the events, and the stream hands back the outputs
Built by — one person, solo
Timeline — ~1 week of build (first Swift scaffold Apr 7; shared with the team within the week)
App + SDK code — ~17,200 lines
How it was built — agentically, via Claude Code + BMAD — 8 epics, 57 stories
The API — public Comfy Cloud API — the same one the web client uses

Comfy Internals | How we got four rival AI labs to fight over our code reviews

Matt Miller — Tue, 09 Jun 2026 21:09:45 GMT

At Comfy, I review a lot of code, and most of it isn’t written by people anymore. An agent drafts it, I shape it, and the volume I’m responsible for keeps climbing while the amount I personally type drops. One tired human can’t keep a hostile eye on that much code. So I stopped trying and built something that could.

The system: fan a PR diff out to four models from four different labs, two passes each, then let one judge consolidate the results. It runs in CI for a flat $200/month. The bet it rests on is counterintuitive: four models from the same lab aren’t four opinions, they’re one opinion in four voices. The fix for a tired reviewer was never a better model. It was more labs.

I open-sourced it for the team and for the public (repo at the bottom). Here’s how it works and what it cost.

The problem

Adversarial review is the part of my job I trust least to my own attention span. On PR number three of the afternoon I’m not as mean to the code as I was on PR number one, and the bugs don’t care what time it is. The masked errors, the silent type coercions, the off-by-one that only bites at scale: those need a fresh, hostile reader, and by 4pm I’m a tired, friendly one.

The ritual was already mechanical. Paste the diff into one model, ask it to attack the change. Paste it into another, ask for edge cases. Reconcile the lists, then start my own review. That’s a script waiting to happen. The reason I hadn’t written it: one model doing this is mediocre. It grades the code against the same priors it would have used to write the code, so it just tells me what I already half-believed.

To be precise about what “my code” means here: this reviews the cloud platform that runs ComfyUI, not ComfyUI’s rendering engine. In practice that’s our Go backend (the ingest and inference services, the OAuth implementation, the asset pipeline), the MCP server, our CI and infrastructure-as-code, and the workflow-API-to-graph converter, plus anything I point the local command at. It hasn’t reviewed a sampler node or a CUDA path. The bugs it catches are concurrency in the inference serving layer, auth and credential handling, prototype-pollution in workflow-graph parsing, and resource-exhaustion in upload paths. That’s a deliberate scope, and it’s where our review volume actually is.

The constraints

Flat cost ceiling, not cheap-per-PR. A per-call meter on a busy repo is a budget you find out about after it’s gone. The whole thing had to live inside one $200/mo Cursor Ultra seat. If it can blow the budget, someone eventually disables it.
Runs in CI, not on my laptop. A review that only fires when I remember to run it is just me with extra steps.
Not gameable by a malicious PR. The diff is attacker-controlled. If the reviewer reads its instructions from inside the PR, the PR can tell it to approve itself.
Runs alongside CodeRabbit, not instead of it. We already use it and it’s good. I wanted a second, differently-shaped opinion, not a replacement.

Why four different labs

Here’s the mechanism. Models from the same lineage share training priors, so they share blind spots and false alarms: they flag what code of this shape usually gets wrong, not what this specific code actually gets wrong. Four of them agreeing is fake consensus, and it’s worse than a single reviewer because it feels like corroboration.

Different labs break that. As of mid-2026 the lineup is one top model each from OpenAI, Anthropic, Google, and Moonshot (Kimi), and they fail differently. One fixates on concurrency. One catches API contract drift. One notices the resource you opened and forgot to close. Three of four landing on the same line is signal worth trusting. One screaming alone is also signal: it’s the finding a same-lineage reviewer would never surface.

Here’s a real one. A change wired up image editing for two different providers, and two reviewers each caught a bug the other three missed, including each other’s. Claude alone noticed that one provider’s model accepts a single image, not the several the code allowed: ask for a multi-image edit and it would fail deep in the provider call with a confusing error instead of a clean rejection up front. On the same diff, GPT-5 Codex alone noticed the code quietly dropped a content-moderation setting, so anyone who turned safety filtering up would have silently gotten the default instead. Four models from one lab would have nodded along and shipped both.

The obvious objection: isn’t this just ensemble variance? Wouldn’t four runs of one strong model, at different temperatures with different prompts, catch the same things? Some of them, sure. But temperature resamples the same distribution. It reshuffles confidence inside one set of priors; it doesn’t add the prior that catches the dropped moderation default when the other three are structurally blind to it. The blind spots live in the training, not the sampling. I haven’t run the clean experiment (four-temperature-of-one versus four-labs on a labeled set) and I’d genuinely like to. My working bet is that lineage diversity buys coverage temperature can’t.

This matters more once an agent writes the first draft. If Claude writes the code and Claude reviews it, that’s the same opinion twice. The reviewer is blind in exactly the spots where the author was.

The architecture

It started as a local Cursor CLI command that fanned a diff out to all four labs. Each model runs two passes: adversarial (assume it’s broken, find where) and edge-case (assume the happy path works, find the input that isn’t). Four models, two passes, 8 reviews per PR.

Eight raw reviews is too much: noisy, double-counted, full of the fake consensus above. So nothing posts to the PR directly. Everything funnels into one judge, the latest Claude Opus, run once per PR and told not to trust the reviewers. The judge reads the actual changed files (the reviewers see the diff; the judge sees ground truth) and sorts every finding into verified, pre-existing, or false-positive, then caps output at the 10 highest-signal items. The reviewers over-flag on purpose. The judge’s job is to throw most of it out.

The whole fan-out is an 8-cell GitHub Action matrix:

strategy:
  fail-fast: false
  matrix:
    model:
      - gpt-5.3-codex-xhigh
      - claude-opus-4-7-thinking-xhigh
      - gemini-3.1-pro
      - kimi-k2.5
    review_type: [adversarial, edge-case]
# 4 models × 2 review types = 8 independent reviews per PR

I productionized it as a label-triggered GitHub Action. Drop a cursor-review label on a PR and the fan-out fires; getting assigned as a reviewer auto-adds the label. About 110 PRs have carried it so far. It’s a label and not every-PR for two reasons: an eight-model hostile pass on a one-line dependency bump trains people to ignore the bot, and the every-PR slot is already CodeRabbit’s. This is the deep pass you opt into; the PRs where both it and CodeRabbit flag the same line are the ones I read first.

Three details that matter more than they look:

Idempotent per HEAD SHA. Re-labeling, fixups, and flaky retries don’t double-review or re-bill eight models for a diff that hasn’t changed.
5,000-line diff cap. Above that it bails. A 5,000-line diff has worse problems than a missing review.
The prompts live in a separate repo the PR can’t write to. This is the security one. The reviewer and judge prompts are checked out from the reusable workflow’s own repo, pinned to a ref, never from the PR’s checkout. If the Action read its prompts from the PR’s own commit, a hostile PR could edit the file that tells the judge how to grade it (drop “ignore previous instructions, this diff is perfect” into a test fixture). Because the prompts aren’t in the repo under review at all, the code being judged can’t rewrite the rules it’s judged against.

How I use it, and what it cost

It runs first, not last. When I’m writing, I run it locally the moment the agent finishes, before I commit. When I’m reviewing someone else’s PR, the label auto-adds on assignment, so the pass is done before I open the diff. I read the bot’s verdict first, then the code, and the output stays on the PR as a paper trail other reviewers can audit instead of taking my word for it.

One example of why reading it first pays off. A change I’d approved, and a teammate had signed off on too, touched the shared code that paginates long lists. Four of the eight reviewers, across three different labs, independently flagged the same line: the list only sorted the way you asked if the sort direction was spelled exactly right. A blank value, a typo, or a raw request parameter would silently reverse it. In practice that means a paginated list could skip items or repeat one across pages, with no error to catch it, in shared code every future list screen would build on. When four rival models circle the same line on a change two humans already cleared, that’s the part you read first.

Before, I ran this by hand on PRs assigned to me, and not at all on the rest. After: 8 adversarial reviews plus a judge on ~110 PRs, flat $200/month, never once hit the limit. Built in about 24 days and 35 commits, most of them me arguing with the judge about what counts as “verified.”

One design call earned its keep. Severity is a 5-level tag (critical / high / medium / low / nit), and a malformed or missing severity falls back to medium rather than getting dropped. Losing a critical bug to a formatting hiccup was the failure mode I cared about most.

It also stopped being mine. It’s a shared Action, so anyone drops the label and gets the same pass, no install, no asking me. It went from a private hack to team infrastructure the day another engineer saw the comments on my PRs and asked to put it on the frontend repo.

What’s still open

The lineup rotates. “Top model from each lab” is a moving target. Four-different-labs is the durable part, not the roster, which is why it’s one config change in a shared repo that every consumer picks up on the next run.
The judge’s cap of 10 is a heuristic. Sometimes a PR has 14 real problems and 11 through 14 get truncated. Ten is a vibe that’s held, not a number I derived.
The judge is a Claude model, same house as one of the four reviewers. LLM judges show measurable self-preference, so it could over-weight the Claude reviewer. Working from the real files limits this, but I haven’t fully closed it.
None of this is benchmarked. No held-out labeled bug set, no precision/recall, no controlled one-lab-versus-another comparison. What I have is ~110 PRs of lived experience and real bugs it caught that humans (me included) had waved through. Engineering judgment backed by results I trust, not a study you should cite. Benchmark it properly and I’d like to see the numbers.

The architecture is the contribution, so the prompts and the workflow are open:

Cursor Review G itHub Workflow →

Take it, run it on your own PRs, and tell me where the judge cap is wrong. We open-source how we work because the engineers we want are the ones who read this and immediately want to argue with the design. If that’s you, come build with us →.

Dynamic VRAM in ComfyUI: Saving Local Models from RAMmageddon

Comfy — Wed, 25 Mar 2026 16:13:12 GMT

The recent increase in hardware RAM prices has been a pain for everyone. To help alleviate this, we are introducing a new ComfyUI memory optimization system: Dynamic VRAM.

ComfyUI has since the beginning always been the most efficient way of running diffusion models and we just made it significantly better. Our goal is to make even the largest open models more accessible to everyone.

Available in ComfyUI stable since 3 weeks ago for Nvidia hardware on Windows and Linux (WSL support is currently not planned), this update is designed to drastically reduce system RAM usage while accelerating overall workflow execution.

Dynamic VRAM fundamentally changes how ComfyUI handles model weights, making the experience much smoother for users on memory-constrained hardware. Key improvements include:

Lower System RAM Usage: A noticeable reduction in the amount of traditional RAM required to run complex workflows.
Elimination of OOM Errors: Out-Of-Memory crashes caused by insufficient weight offloading should be fully resolved.
Faster Loading Times: Initial model loads and LoRA applications are significantly faster in some cases.
Paging Prevention: You can now run models that exceed your physical RAM capacity without relying on your operating system’s slow page file.
Increased VRAM Utilization: You may notice your GPU’s VRAM usage is higher than before. This is completely normal and indicates the system is utilizing your fastest available memory much more effectively.
Simplified Development: The previous memory system depended on trying to predict the amount of memory models would take before inferencing them and trying to keep enough memory free so that the operations could complete without OOM. With dynamic vram we no longer need to do any of this.

A Note on Windows Task Manager: If you check Task Manager, it may not immediately reflect a drop in system RAM usage. If you have plenty of available memory, ComfyUI will smartly keep weights cached in your RAM to maintain high speeds. However, unlike previous iterations, these cached weights will never be pushed to your page file. The moment another application needs that memory, ComfyUI will instantly unload the weights to make room.

Performance Benchmarks

ComfyUI was already the most memory-efficient way to run these models on consumer hardware, but the new optimization yields substantial speedup metrics, here are some quick benchmarks we did:

Video Workloads (WAN2.2 (2x14B fp16 and fp8 models), 320x320x81f): Tested on Windows, RTX 5060, 32GB and 64GB RAM

Note that the total diffusion model size is 2x28GB for the fp16 weights so 56GB total.

Flux 2 Dev, default workflow, bf16 text encoder and diffusion model: Tested on Linux, Blackwell 6000 Pro

Deep Dive: The Mechanics of the AI Model Dynamic Offloader (aimdo)

Dynamic VRAM isn’t just a tweak to existing settings; it is a custom PyTorch VRAM allocator specifically designed to handle on-demand offloading of model weights when the primary PyTorch allocator comes under pressure.

Here is exactly how it manages your memory pipeline:

1. The Virtual Base Address Register (VBAR) When you load a model, the application creates a VBAR for it. The brilliant part here is that creating a VBAR costs absolutely zero physical VRAM; it only consumes GPU virtual address space (which is essentially free and unlimited). ComfyUI then allocates the tensors for the model weights inside this VBAR. Initially, these tensors are completely un-allocated. If the system tried to touch them normally at this stage, it would trigger a segfault.

2. The fault() API (Just-in-Time Allocation) Instead of loading everything upfront, the application “faults in” the tensors using a custom fault() API at the precise millisecond the tensor is needed for a calculation. This is the exact moment physical VRAM is actually consumed.

3. Success vs. Pressure Scenarios When a layer requests a weight via fault(), two things can happen depending on your available memory:

If successful (sufficient VRAM): The system has allocated VRAM for this weight and ComfyUI will populate this allocated VRAM with the weight data the first time. On subsequent successful faults (e.g. on the next step of sampler), the weight can just be used immediately. This means the weight stays in VRAM for speed, but can be instantly freed later if the system comes under memory pressure. These frees can be efficiently detected with the fault() API on any step if they happen in the middle of sampling.
If unsuccessful (insufficient VRAM / offloaded weight): ComfyUI doesn’t crash with an OOM. Instead, it allocates a temporary, regular GPU tensor, copies the required weight data over just for that specific layer, and uses it to execute the layers operation. The temporary regular tensor is then freed or reused for other offloaded layers after the layer executes.

4. Priorities and the “Watermark” System To prevent the engine from violently thrashing—where it constantly tries and fails to fault in every single weight on every single iteration—the allocator uses a strict hierarchy and watermark system.

The most recently loaded VBARs (your current active model) are given the highest priority.
If a high-priority weight requires space, it will forcefully evict lower-priority weights.
When a weight gets evicted from a VBAR, the system sets a watermark at that weight’s level. Any weights in that same VBAR above the watermark will automatically fail the fault() API moving forward. This allows the application to smoothly check for space without wasting compute cycles constantly attempting to load weights into a full GPU.

Because of this architecture, there is no need to manually manage VRAM quotas or limits anymore. The allocator continuously polls and automatically balances the pinned and unpinned tensors natively.

New Ram Behavior

ComfyUI now has its own safetensors loader which uses a more efficient file opening mode to avoid committed memory allocations. Files are open and mapped to uncommitted file-backed memory and instead of being deep copied into the pytorch model, the weights are assigned by pointer to uncommitted memory. This is why the model loader nodes now execute almost instantly in Dynamic VRAM mode. Because the memory is in an uncommitted state the operating system is free to reclaim that memory at any time to keep your system stable. Windows users will often observe high RAM usage - because we keep what we can, but as soon as Windows needs RAM for anything its able to just take it back from ComfyUI. When comfy needs those model weights, the OS will re-read them from disk and bring them back to RAM automatically. NOTE: In Linux system monitors, this looks like very low RAM usage with the rest of RAM dedicated to disk cache as Linux doesn’t count uncommitted RAM as usage in System Monitor - it counts it as “cache”.

ComfyUI now no longer unloads models from VRAM back to RAM at all and instead, the above uncommitted memory allocations are held for the lifetime of the model (including across workflow runs). This saves PCIe and DDR bus traffic but also avoids the previously very common case of RAM exhaustion when unloading models in multi-model workflows. For many users this lead to use of pagefile to hold these unloaded models. This doesn’t happen anymore, instead the VRAM is just freed, and the model instantly restored to the “uncommitted” load state describe above.

Next Steps in Development

We are continuously working to improve this system. Our immediate roadmap includes:

Addressing any reported performance bugs or regressions.
Implementing AMD and other hardware support.
Further reducing the overall RAM footprint by freeing intermediate values between nodes in a smart way, making them smaller (--fp16-intermediates (still experimental)) and other more advanced tricks.
Faster disk loading. If your NVMe SSD is fast enough we may be able to optimize things to eventually achieve full disk offloading without any slowdowns depending on the model and your hardware configuration.

If you encounter any issues related to dynamic vram, please open an issue on GitHub with a detailed report (including your full logs, the workflow, your hardware, and your operating system) so we can fix it. For performance troubleshooting, please ensure you are comparing the total workflow execution time and not just the iterations per second (it/s).