Gemini, Veo, Runway, Pika: An Honest Comparison of the Four Models We Tested for Ad Generation

26 May 2026

We get the same question from every prospect: "which AI model is actually the best for ad creative?" The honest answer is that there is no single best model — there is the right model for the job. Here is the unvarnished comparison after running each of the four leading models through the same eight creative briefs.

We tested Google's Gemini 2.5 Flash Image and Veo 3.0, Runway Gen-3 Alpha, and Pika 1.5 across two video briefs and six banner briefs designed to stress different capabilities: photorealism, brand consistency, text rendering, motion coherence, and aspect ratio flexibility. Every brief was run three times per model. What follows is the take that we wish someone had published before we spent four months testing these ourselves.

The eight briefs we tested

To make the comparison meaningful, the briefs had to span the realistic range of what marketing teams ask for. We settled on these:

  1. Pack shot — a single product (a serum bottle) on a clean background with directional lighting.
  2. Lifestyle — a person actively using the product in a real environment, candid framing.
  3. Hero collage — three products together with a graphic-design feel, suitable for a hero banner.
  4. Text-heavy banner — a sale graphic with legible typography ("50% off this weekend only").
  5. Brand-consistent variant — generated from a reference product photo, preserving exact bottle proportions and label.
  6. Multi-aspect-ratio set — same concept output to 1:1, 4:3, 9:16, and 16:9 in one batch.
  7. 8-second product video — slow camera dolly around a product on a turntable.
  8. 8-second narrative video — a person opening the product, smiling, and using it once.

Gemini 2.5 Flash Image — the workhorse for banners

Gemini is the model we run by default for banner work, and the testing reinforced why. It absolutely dominated the first six briefs.

On the pack shot, Gemini produced photorealistic, well-lit images on the first try in nine out of nine attempts. Lighting was directional and consistent across variants. No weird artifacts on glass or chrome surfaces, which is where lesser models fall apart. On the lifestyle brief it was less consistent — about six out of nine usable — but the failures were obvious composition problems rather than uncanny-valley horror, so they were easy to spot and skip.

The genuinely impressive result was on brief 5: brand-consistent variants from a reference image. When given a clean reference photo of the serum bottle, Gemini preserved bottle proportions, label position, and even the readable text on the label across eight variants. This is the use case that makes a generative model viable for real brand work. Three years ago you could not have asked any model to do this.

Where Gemini struggles: text. Brief 4 was a test, and Gemini failed it consistently. The model can render approximate text shapes but the characters are wrong, kerning is off, and any number greater than two digits comes out garbled. For text-heavy work, you generate the visual in Gemini and overlay text in a real graphics tool. Do not fight this — it's a known limitation of all image diffusion models, not a Gemini-specific problem.

Cost and speed: Gemini is fast and cheap. Most generations finish in two to four seconds and the API pricing is by far the friendliest. This is why we use it as the default — the iteration loop stays tight.

Veo 3.0 — the only video model that produces watchable output today

Veo is what we route every video brief to. After the testing we are not even close to neutral on this — Veo is in a different league from the alternatives for ad video work right now.

The 8-second product video (brief 7) came back from Veo with the camera move we requested (slow dolly, 360 degrees), the product staying in frame the entire time, accurate reflections of the studio environment, and zero physics violations. Two of the three runs were directly usable. The third had a small artifact in the final second.

The narrative video (brief 8) was harder. Veo produced a coherent 8-second story in two of three runs: hand reaches in, picks up product, opens it, takes a sip (we tested with a beverage prop). The character's hand looked like a real hand. The product label stayed readable. The lighting was consistent across the cut. The one failure had a momentary flicker where the product label swapped to a different design halfway through, which we caught only because we were looking carefully.

The catch: Veo is slow and expensive. An 8-second video takes 90-180 seconds to generate, and the per-generation cost is roughly 4x what a banner costs. For most marketing teams this is fine — you are not generating hundreds of videos per day. But it does mean you cannot use video the way you use banner variants. You write a tighter brief, you generate fewer takes, and you live with what comes back.

Runway Gen-3 Alpha — strong second for video, distinctive aesthetic

Runway has been the public-facing leader in generative video for years and the technology shows. Gen-3 produced beautiful, cinematically lit footage on brief 7 — arguably more beautiful than Veo in pure aesthetic terms. The lighting felt like a real cinematographer had been involved.

Where Runway lost the comparison: brand fidelity. Asked to render the same product across three variants, Runway tended to drift. The bottle shape would shift slightly, the label color would warm or cool, and on the narrative brief the product's silhouette changed between cuts. For art-direction-heavy work this is wonderful. For brand work where the product has to be recognizable, it is a problem.

The other issue is the aspect-ratio handling. Runway has historically been optimized for 16:9 and the 9:16 outputs we tested still felt like cropped 16:9 frames rather than purpose-composed vertical shots. If your campaign lives on Reels, this matters.

Verdict: we keep Runway in our toolkit for brand-storytelling videos where the aesthetic carries the work. We do not use it for product-recognition-critical campaigns.

Pika 1.5 — best for short, punchy social loops

Pika is the youngest of the four and it shows in the most basic creative tasks. Brief 7 (product video) came back wobbly in all three runs — the camera motion stuttered and the reflections did not track the implied light source. Pika is not where you go for cinematography.

What Pika is excellent at: short, weird, attention-grabbing loops. The kind of three-second clip that pattern-interrupts in a feed. We ran an unscientific bonus brief — "a sneaker bouncing on a trampoline made of clouds" — and Pika produced the most engaging result of the four models by a wide margin. The other three made something more polished but less interesting.

If your brand voice is playful, irreverent, or experimental, Pika earns its place in your stack. If you are selling enterprise software, skip it.

The decision framework we use

After enough testing, we landed on a simple routing rule that we now build into the platform:

  • Banner work → Gemini. Always. The cost and speed advantage is too large to give up unless you have a specific reason.
  • Product-recognition video → Veo. The brand fidelity is unmatched right now.
  • Brand-story video → Runway. When aesthetic matters more than recognition, Runway is worth the premium.
  • Pattern-interrupt loops → Pika. Weird beats polished in the scroll feed.
  • Anything with legible text → none of them. Generate the visual in any model, overlay text in Figma.

What none of these models do well yet

Three honest limitations to flag before you over-invest:

Hands and fingers. All four models still produce occasionally broken hands, especially when fingers are interacting with small objects. If your product is held in close-up, expect a 20-30% rejection rate on that specific frame. Plan extra generations.

Brand color exactness. Models will produce a color "in the family" of your brand palette, but they will not match a specific hex code reliably. If brand-compliance demands exact #4361EE, plan to color-correct in post.

Multi-character coherence. Generating two specific people consistently across a video is still essentially impossible. Single-character work is solid. Two characters and one of them will drift.

What we expect to change in the next twelve months

The pace of improvement is faster than any of us are mentally pricing in. Twelve months ago Veo did not exist publicly and Gemini's text rendering produced gibberish. Today we are arguing about brand fidelity in 8-second narratives. Twelve months from now we expect:

  • Text rendering that is finally usable for sale graphics directly out of the model.
  • Multi-character coherence in short videos, opening up actual scripted ad work.
  • Video lengths past the current 8-10 second ceiling.
  • Brand-palette exactness via in-model conditioning rather than post-processing.

That trajectory is why we build the platform to be model-agnostic. The right model for your job in 2026 is not necessarily the right one for the same job in 2027, and you should not have to rebuild your workflow every time a new model ships.

How to test this for yourself

If you want to run your own bake-off, use the same brief structure we did: a pack shot, a lifestyle, a brand-consistent variant, and an 8-second video. Run each through three models, generate three times per model, and rank the outputs blind (cover the model name with a piece of paper if you have to). The blind rank will surprise you — at least one of your team's strong opinions about a model will not survive contact with actual outputs.

That is the real value of running a structured test. Not picking a winner once, but building the institutional reflex of testing before committing. The models will keep changing. The discipline of structured comparison is the durable part.

Related posts

img

The marketing team of 2023 and the marketing team of 2026 might have the same headco...

img

The sticker price on most AI ad tools is the smallest line item in the real total co...

img

We build a generative ad platform, so the expected message from us is that you shoul...