AI & ML

Gemini 2.5 Flash vs Claude 3.7 Sonnet: 4 Production Constraints That Made the Decision for Me

May 2026 8 min read

An evaluation of the Gemini 2.5 flash and Claude 3.7 Sonnet model for an agent engine.

I had a simple rule when choosing an LLM for Ozigi: don't choose based on reference league tables. After the release of my version 2, upon receiving feedback, a user suggested I use the Claude models as they were better for content generation than Gemini. While the suggestion seemed tempting, I had to choose a model based on the four limitations that my production line could not overcome.

Most "Gemini vs Claude" comparisons evaluate general-purpose abilities like coding, reasoning, and creative writing. This is useful if you are creating a general purpose product.

I wasn't.

Ozigi is a content engine. It gives you a URL, a PDF, or raw notes. Returns a structured 3-day social media campaign as a JSON payload that the front-end maps directly to the UI cards.

That specificity made the evaluation easier than I expected: two models, four constraints. A clear winner in three of the limitations.

This is the third post in the Ozigi Changelog series. If you want the backstory of why Ozigi exists, start with how I vibration-coded the internal tool that became it and the v2 changelog that introduced the modular architecture this decision was built on.

Here is the complete Architectural Decision Record.

The Setup: What the Pipeline Really Does

The main API route in Ozigi does this:

- Accept a

multipart/form data

Payload containing a URL, plain text, and/or a file (PDF or image): Constructs a message with strict editorial restrictions injected at the system level.

- Send everything to the LLM via the Vertex AI Node.js SDK

- Returns the response in plain text directly to the client.

The interface then does this:

const parsed = JSON.parse(responseText);

setCampaign(analyzed.campaign);

No middleware. No schema validation. There is no error recovery on the happy path. Raw analysis, straight to React state.

That single line is why model selection was important.

Constraint 1: Comparison of Gemini and Claude models for JSON output stability

The requirement: The model must return a valid JSON object, every time, without wrapping it in markdown barriers, without adding a conversational preamble, and without hallucinating a trailing comma that breaks JSON.parse()

The target schema looks like this:

{

"campaign": [

{ "day": 1, "x": "...", "linkedin": "...", "discord": "..." },

{ "day": 2, "x": "...", "linkedin": "...", "discord": "..." },

{ "day": 3, "x": "...", "linkedin": "...", "discord": "..." }

]

}

It features nine posts across three platforms in a span of three days, with all required fields.

The user interface presents each field on a separate card with edit, copy and publish actions. A missing key does not throw a visible error: it silently shows an empty card.

I ran 500 generations of automated tests against both models targeting this schema, measuring the percentage of responses that JSON.parse()

accepted without exceptions.

The 11.5% gap maps directly to broken UI states for real users. To me that was not acceptable as a main feature.

Using the Gemini response scheme

close this completely. According to Google's controlled generation documentation, the feature physically prevents the model from returning results that do not conform to its schema. It is not a guidance at the warning level, it is applied at the decoding layer. This is what the production deployment for Ozigi looks like: The schema is defined once at the top of the stream and attached directly to the model configuration:

constant distribution scheme = {

write: "OBJECT" as constant,

properties: {

campaign: {

write: "ARRAY" as constant,

description: "A list of 3 daily social media posts.",

articles: {

write: "OBJECT" as constant,

properties: {

day: { type: "INTEGER" as constant, description: "Day number (1, 2, or 3)" },

x: { type: "STRING" as constant, description: "Content for X/Twitter." },

Lin

Gemini 2.5 Flash vs Claude 3.7 Sonnet: 4 Production Constraints That Made the Decision for Me

Related Coverage

DumbQuestion.ai - Self-Awareness, Prompt Injection, Search Intent... and darkness

I Made Claude Code Think Before It Codes. Here's the Prompt.

Your Agent Is a Small, Low-Stakes HAL