So during my Nano Banana Pro experiments I wrote a very fun prompt that tests th...

MrManatee · 2026-04-22T07:59:35 1776844775

Prompts like this feel like it's using the wrong abstraction. The "obvious" thing to do with something like this would be to generate some code that generates the image and then run that code.

Inspired by this, I tried something much simpler. I asked it to draw 12 concentric circles. With three tries it always drew 10 instead. https://chatgpt.com/share/69e87d08-5a14-83eb-9a3b-3a8eb14692...

LeifCarrotson · 2026-04-22T11:36:51 1776857811

I think prompts like this are where agentic workflows come in to play. If you asked it to do generate the first 64 prime numbers, AI tools could do that. If you asked it to draw a charcoal image of Pokemon 13, it could do that. If you asked it to add a white Menlo 13 on a black background to the top left corner of that image, it could do that. If you asked it to do that 63 more times, it could do those things, and if you asked it to assemble those into a grid, it could.

It can't get that in a one-shot. Perhaps, though, it could figure out when it needs to break a problem into individual tasks to delegate to itself and assemble them at the end.

wahnfrieden · 2026-04-22T15:40:00 1776872400

That's what makes it a fair evaluation of its limits

fennecfoxy · 2026-04-22T16:45:38 1776876338

I mean asking these transformers to do maths has always been the wrong task. It's like we're now considering "it doesn't have x tools built with traditional code built in".

Though I suppose we're testing their model + agent harness here as well. It really _should_ have all of those tools/reasoning available to accomplish a task like the above without issue.

wahnfrieden · 2026-04-22T17:26:33 1776878793

It's only been the wrong task because they've been deficient at it and expensive to use, so we had workarounds. They are getting better at these tasks and cheaper (sometimes). It's fair to evaluate even if there are more economical and accurate alternatives available.

fjdjshsh · 2026-04-23T13:10:19 1776949819

You can evaluate the limits of a spoon by trying to cut meat with it.

The point is what are the typical use cases for the tool / what are the agreed upon areas of application?

Making the LLM do math with large numbers, I would argue, is not in its typical use case, thought it's at the border.

Asking an image generator model to calculate numbers before running an image sounds definitely NOT like a reasonable use case (do people need it? Will people try using it for this purpose?)

dvt · 2026-04-22T01:23:36 1776821016

This is an amazing test and it's kinda' funny how terrible gpt-2-image is. I'd take "plagiarized" images (e.g. Google search & copy-paste) any day over how awful the OpenAI result is. Doesn't even seem like they have a sanity checker/post-processing "did I follow the instructions correctly?" step, because the digit-style constraint violation should be easily caught. It's also expensive as shit to just get an image that's essentially unusable.

the_arun · 2026-04-22T02:55:09 1776826509

This is from Gemini - https://lens.usercontent.google.com/banana?agsi=CmdnbG9iYWw6...

fblp · 2026-04-22T04:13:56 1776831236

Did it correctly follow the instructions? Don't know my pokemon well enough.

minimaxir · 2026-04-22T04:49:46 1776833386

Essentially yes (bottom got distorted), but Gemini uses Nano Banana Pro or Nano Banana 2 so it's not a surprising result. The image I linked uses the raw API.

thih9 · 2026-04-22T07:37:11 1776843431

Note that the styles are different; there are two digit images rendered in color.

Color charcoal drawings do exist, but it’s not what’s usually meant by “charcoal drawing”.

podgietaru · 2026-04-22T15:27:59 1776871679

Plusul and Minun sit next to each other in the Pokedex, 311 and 312. There's two 307s.

tsukurimashou · 2026-04-22T23:47:32 1776901652

> Create a 8x8 contiguous grid

It failed at the very first instruction

anshumankmr · 2026-04-22T02:42:04 1776825724

that is interesting cause I feel gpt-image-1 did have that feature.

(source: https://chatgpt.com/share/69e83569-b334-8320-9fbf-01404d18df...)

weird-eye-issue · 2026-04-22T04:12:34 1776831154

You are comparing ChatGPT to a raw image model. These are two completely different things. ChatGPT takes your input, modifies the prompt and then passes it to the image model and then will maybe read the image and provide output. The image model like through the API just takes the prompt verbatim and generates an image.

minimaxir · 2026-04-22T04:42:06 1776832926

Nano Banana Pro and ChatGPT Images 2.0 also tweak the prompt because they can think.

weird-eye-issue · 2026-04-22T04:48:04 1776833284

Yes exactly, "ChatGPT Images 2.0" is in ChatGPT. That is not a model.

hyperadvanced · 2026-04-22T02:46:00 1776825960

I wouldn’t say it’s terrible. I wouldn’t say it’s a huge step forward in terms of quality compared to what I’ve seen before from AI

podgietaru · 2026-04-22T15:26:09 1776871569

How is it that a model can produce what must be near 1:1 images ripped straight out of Pokemon Fire Red (The first ones) for profit and not be infringing copyright.

I know that's the game, but it seems CRAZY to me that they can do this.

dragonwriter · 2026-04-22T16:46:58 1776876418

Training a model on a corpus which includes copyrighted images but which is not focussed primarily or exclusively on applications which violate copyright might be fair use in the US (so far, it seems that way.)

But that doesn't mean that producing outputs using the model so trained which are based on copyright-protected ones in ways which would violate copyright if produced by any other means doesn't still violate copyright. DMCA safe harbor might apply to the system owner (IIRC, the exact boundaries are fuzzy with UGC generated on the site by the provider’s systems rather than generated elsewhere and posted), so Google may not be liable for the infringement (though if it is actively searching for references online at generation and not relying on what is trained into the model, that would seem to weaken the case for that), but it's still an infringement.

minimaxir · 2026-04-22T20:12:06 1776888726

So the sprites aren't what I considered plagiarism since to my surprise they are sufficiently different even though it's a similar design to the FR/LG sprites.

For Charmeleon, the sprite is closest to the B/W sprite, but not exact: https://bulbapedia.bulbagarden.net/wiki/Charmeleon_(Pokémon)...

For Squirtle, the sprite is much closer to the FR/LG sprite but still some differences: https://bulbapedia.bulbagarden.net/wiki/Squirtle_(Pokémon)#S...

The other images, however, crib from official artworks a bit too close for comfort.

In my original analysis I hypothesized this is due to token scarcity that reduces the ability for the model to be created: I believe that NBP images used 1.5k tokens for that image while the gpt-2-image used 7k tokens, but this is hard to test.

dd8601fn · 2026-04-22T16:24:32 1776875072

The funny this is the main complaint I’ve heard so far is that it repeatedly refused to operate on original content… because it might violate copyright.

minimaxir · 2026-04-22T20:15:00 1776888900

In testing the ChatGPT interface it appears to be looser on copyright than expected given the legal trouble.

DrewADesign · 2026-04-22T16:37:23 1776875843

Yeah, the CSAM generated by grok proves the guardrails are only really good for stymieing benign uses.

pixel_popping · 2026-04-23T16:34:22 1776962062

Well, Anthropic did download torrents massively right, they clearly knew this was completely illegal and decided to not care.

DrewADesign · 2026-04-22T16:34:34 1776875674

It can’t. It violates copyright. The big players are the only ones with the money to pursue these things, but they’re interested in replacing artists with AI trained on their models so they settle and set up some sort of agreement. The little guys have no presidential case law to help them along, and nowhere close to the resources to push it that far, so they get steamrolled. I know artists famous enough for people— even commercial entities — to regularly blatantly rip them off by name with “in the style of” prompts, but there’s no realistic path to pursue it. Fame doesn’t pay legal bills.

DrewADesign · 2026-04-23T13:54:43 1776952483

Precedential. Stupid autocorrect.

Jensson · 2026-04-22T16:03:23 1776873803

Gemini uses google search to find references when making images, so it probably found the pokemon images online to do this.

> I know that's the game, but it seems CRAZY to me that they can do this.

Its not crazy that a search can find existing pokemon images. Maybe google should show which images it used as references to be more transparent here.

AussieWog93 · 2026-04-22T08:30:52 1776846652

For what it's worth, NBP made some mistakes too.

Artistic oddities aside (why are the 8-bit sprites 16-bit, why do the charcoal drawings have colour, why does the art of specifically the Gen 1 Pokemon look so off.), 271 is Lombre, not Lotad.

Razengan · 2026-04-22T04:37:30 1776832650

Even a few months ago, ChatGPT/Sora's image generation performed better than Gemini/Nano Banana for certain weird prompts:

Try things like: "A white capybara with black spots, on a tricycle, with 7 tentacles instead of legs, each tentacle is a different color of the rainbow" (paraphrased, not the literal exact prompt I used)

Gemini just globbed a whole mass of tentacles without any regards to the count

vincentbuilds · 2026-04-22T06:18:38 1776838718

banana Pro gets the logic and punts on the art; gpt-2-image gets the art and punts on the logic. Feels like instruction-following and creativity sit on opposite ends of the same slider.

sdwr · 2026-04-22T18:20:04 1776882004

Yeah it's fascinating to have an alternate source for intelligence, feels like a mental Rosetta Stone

dieortin · 2026-04-22T09:29:35 1776850175

This feels incredibly AI generated

doginasuit · 2026-04-22T12:13:58 1776860038

The random accusations of AI generated comments are the most annoying part of the unfolding AI dystopia.

fennecfoxy · 2026-04-22T16:47:45 1776876465

Dogs don't wear suits! You must be AI too ;)

doginasuit · 2026-04-22T17:24:59 1776878699

Damn, busted!

rrr_oh_man · 2026-04-22T00:37:08 1776818228

Why would you consider this a good prompt?

minimaxir · 2026-04-22T00:40:35 1776818435

Because both Nano Banana Pro and ChatGPT Images 2.0 have touted strong reasoning capabilities, and this particular prompt has more objective, easy-to-validate criteria as opposed to the subjective nature of images.

I have more subjective prompts to test reasoning but they're your-mileage-may-vary (however, gpt-2-image has surprisingly been doing much better on more objective criteria in my test cases)

o10449366 · 2026-04-22T01:33:00 1776821580

[flagged]

minimaxir · 2026-04-22T01:44:59 1776822299

"Quirky and obscure" has the functional benefit of ensuring the source question is not in the training data/outside the median user prompt, and therefore making the model less likely to cheat.

We have enough people complaining about Simon Willison's pelican test.

o10449366 · 2026-04-22T06:44:06 1776840246

When you program, do you consider using your prior knowledge of programming cheating?

Bjartr · 2026-04-22T02:45:49 1776825949

What would make the prompt a better actual evaluation in your judgement?

leptons · 2026-04-22T06:18:59 1776838739

Not focusing on pokemon for a start. Maybe use something more people can recognize and evaluate. I have zero knowledge of pokemon, I see it as a niche thing for ultra-nerdy people, and not something everyone is familiar with. Nothing about that test can be evaluated by anyone but a pokemon expert. Sorry, but pokemon isn't as mainstream as some people might think it is.

Bjartr · 2026-04-22T16:37:07 1776875827

I think you underestimate how popular Pokemon is.

By most objective measures it's the largest entertainment franchise in all of history.

Would you also object to any other pop-culture reference for the same reason?

leptons · 2026-04-22T22:52:58 1776898378

>I think you underestimate how popular Pokemon is.

No, I think you are overestimating how popular pokemon is.

>By most objective measures it's the largest entertainment franchise in all of history.

I don't care? Only a small set of pokemon fans would be able to gain anything from this "test".

>Would you also object to any other pop-culture reference for the same reason?

Yes.

tailscaler2026 · 2026-04-22T02:32:47 1776825167

still #opentowork huh

beepbooptheory · 2026-04-22T03:51:03 1776829863

Where does one even use that hashtag?

minimaxir · 2026-04-22T05:04:05 1776834245

It's a LinkedIn joke.

codemog · 2026-04-22T02:08:36 1776823716

Ah yes, also known as C++ enjoyers.

Palmik · 2026-04-22T08:47:54 1776847674

I do not think this is a good prompt or useful benchmark, but nonetheless, it seems to work better for me: https://chatgpt.com/share/69e88a94-ded8-8395-b5dc-abceb2f44d...

minimaxir · 2026-04-22T15:32:50 1776871970

Huh, that is indeed better. If ChatGPT Images 2.0/gpt-2-image is more nondeterministic than usual, than that is in itself a useful data point.

Palmik · 2026-04-22T16:38:52 1776875932

Did you enable thinking for your experiment? Are you sure you were on the 2.0 rather than 1.5 version?

minimaxir · 2026-04-22T21:18:29 1776892709

That experiment image was directly through the API on high. (no Thinking parameter like the Web UI)

Melatonic · 2026-04-23T16:00:50 1776960050

But where is the one for Missing No ?!

pfortuny · 2026-04-22T08:43:04 1776847384

Just try a 23-sided plane convex polygon.

razorbeamz · 2026-04-22T08:21:44 1776846104

Neither of them drew them in an 8-bit style either. It's way too many colors.

podgietaru · 2026-04-22T15:28:58 1776871738

They made the same mistake a lot of people do, 8-bit meaning Retro style. But they're from the 16bit(?) GBA games.

dodslaser · 2026-04-22T08:41:55 1776847315

Maybe they're so advanced they learned to write to the palette registers mid-scanline.

m3kw9 · 2026-04-22T05:11:35 1776834695

Prob a very unscientific way to test an image model. This would me likely because they have the reasoning turned down and let its instant output takeover

minimaxir · 2026-04-22T05:32:35 1776835955

There's no good scientific way to test a closed-source model with both nondeterministic and subjective output.

This example image was generated using the API on high, not the low reasoning version. (it is slow and takes 2 minutes lol)

crustaceansoup · 2026-04-22T05:36:04 1776836164

If the results are quantifiable/objective and repeatable it's scientific, how is it not scientific?

The reasoning amount is part of the evaluation isn't it?

TeMPOraL · 2026-04-22T06:40:54 1776840054

This is the best kind of science there is: direct, empirical test.