I Tested Claude, GPT-5, Gemini & Grok on the Same Portfolio Brief

Four frontier models. One identical brief. Six themes to choose from. Here's who won which round.

The setup

The "which AI is best for building websites" question shows up in my inbox a few times a week. The honest answer used to be "doesn't matter much — they all generate bloated HTML." But once you switch to an AI-native protocol like Mold, the answer changes. Now the model isn't doing layout, CSS, or markup. It's doing two things: picking the right theme and writing the copy. Both are taste problems, not engineering problems. And taste is exactly where these models diverge.

So I gave four frontier models — Claude Sonnet 4.6, GPT-5, Gemini 2.5 Flash, and Grok 4 — the exact same prompt and the exact same bio. Then I measured four things on each output: token cost, JSON validity, theme fit, and copy voice. Here's how they actually performed.

The brief

Each model got the Mold schema cheatsheet plus this bio (lightly disguised — real person, kept her name out of it):

About me: I'm a documentary photographer based in Lisbon, focused on
small fishing communities along the Atlantic coast. My work has been
in National Geographic and TIME. I sell limited-edition prints and
take on one commission a quarter. I want a site that feels like a
gallery, not a CV.

Each model picked a theme, generated a JSON page, and got pasted into the Mold renderer with no edits.

Round 1 — Theme fit

Of the six themes (Golden Hour, Sterling, Aurora, Bloom, Ink, Neon), only two are plausible for a documentary photographer: Golden Hour (warm, editorial) and Ink (minimal, gallery-coded).

Model	Picked	Verdict
Claude 4.6	Ink	Correct. Read "gallery, not CV" as the signal.
GPT-5	Golden Hour	Correct. Leaned warm/editorial on "National Geographic."
Gemini 2.5	Aurora	Wrong. Aurora is gradients — too playful for documentary work.
Grok 4	Ink	Correct, with a sharp justification in its preamble.

Three out of four nailed it. Gemini's miss isn't a small one — Aurora's gradient hero would make a fishing-village photo look like an EDM festival flyer. Round goes to Claude, GPT-5, and Grok.

Round 2 — Copy voice

I asked all four to write the hero headline. Here's what came back:

Model	Hero headline
Claude 4.6	"Quiet light, working coast."
GPT-5	"Portraits of a vanishing Atlantic."
Gemini 2.5	"Documentary Photography by [Name]"
Grok 4	"Salt, light, and the people who stay."

Claude wrote like a curator. GPT-5 wrote like a publicist. Grok wrote like a Substack essayist. Gemini wrote like a 2008 WordPress theme.

The pattern holds across body copy too. Gemini is the only model that consistently defaults to "passionate about" filler — the phrase nobody who actually does the work would ever use about themselves. Claude and Grok both produce copy with verbs and specifics; GPT-5 sits in the middle, slightly more LinkedIn-leaning than the other two but never embarrassing.

Round 3 — JSON validity

Did the output parse cleanly into a Mold page on first paste?

Model	First-paste success	Notes
Claude 4.6	Yes	Zero markdown fences. Zero invented fields.
GPT-5	Yes	Wrapped in ```json — strip and it's clean.
Gemini 2.5	Partial	Added a `"description"` field at top level that doesn't exist in schema. Renderer ignored it.
Grok 4	Yes	Added inline comments inside JSON (invalid). Had to strip before paste.

Claude is the only model that produced paste-and-go output with zero cleanup. The other three each had a small friction step. Nothing fatal, but if you're doing this with someone non-technical, that friction matters.

Round 4 — Token cost

Measured input + output tokens for the same prompt (Mold cheatsheet ~560 tokens + bio ~80 tokens + schema = ~700 input tokens).

Model	Output tokens	Total cost (est.)
Claude 4.6	320	$0.0014
GPT-5	410	$0.0019
Gemini 2.5	280	$0.0007
Grok 4	380	$0.0015

All four come in under two-tenths of a cent. The cost question barely matters at this scale — when you're talking about a fraction of a cent per full site generation, model choice is about taste, not budget. For comparison, generating the same site as raw HTML costs $0.03–$0.08 with the same models. Mold is roughly 30–50× cheaper to drive.

Round 5 — Edit responsiveness

The real test isn't generation. It's the second prompt — the "change the hero subhead to mention prints are sold in editions of 12" follow-up.

Model	Patch quality
Claude 4.6	Returned only the changed section. ~60 tokens.
GPT-5	Returned only the changed section. ~70 tokens.
Gemini 2.5	Returned the whole page again. ~280 tokens.
Grok 4	Returned only the changed section + a brief explanation. ~90 tokens.

Gemini's behavior is the legacy "regenerate everything" pattern that AI-native protocols are supposed to kill. It still works — the renderer accepts the full replacement — but it's 4× more expensive per edit. Claude and GPT-5 understood "patch only" as a first-class operation; Grok did too, with a small bonus prose layer some people will like and some will find chatty.

The verdict

If you have to pick one:

Claude 4.6 wins overall. Best theme fit, cleanest JSON, sharpest copy voice, smallest patches. The "designer's model" — feels like it has actual taste about what a portfolio should be.

GPT-5 is the safe second pick. Stronger output volume, slightly more conventional voice, perfectly competent. The model most people already have open.

Grok 4 punches above its weight on copy. If you want a portfolio that reads like an essay rather than a CV, Grok's voice is the most distinctive of the four.

Gemini 2.5 is the weakest fit for this task. It works, but it makes worse theme decisions and produces fuller-rewrite patches that cost more per iteration. Use it for the bio writing phase if you want, then switch models for generation.

What the test doesn't measure

Three caveats before you change your default model:

One brief isn't a benchmark. The "documentary photographer in Lisbon" prompt is taste-coded. A more functional brief — "SaaS founder, B2B sales site, three product tiers" — might shuffle the rankings.
Model versions change weekly. Gemini 2.5 was tested mid-May 2026. By the time you read this, Gemini 3.0 may exist and the verdict may flip.
Cost differences here are tiny. If you have a free-tier subscription to any of these, pick whichever's open. The model difference is design taste, not capability gap.

The bigger point is that on an AI-native protocol, all four models work well enough that the choice becomes about voice, not capability. That's not how it felt in 2024 with raw HTML. It's a sign the surface is finally narrowing to the right shape.

Try it yourself

Run the same brief through whichever model you trust. The full guide is here, or jump straight to themes and clone to start. If your model picks something the four above didn't — Aurora for the photographer, Sterling for a poet — I want to hear about it.

— Gerald Yang Toronto, May 2026