The Open vs Closed AI Gap Is Closing. 6 Charts.

The two-year head start is gone

For years there was a simple rule about AI models. The good ones, GPT, Claude, Gemini, were closed. You rented them through an API, your data went to someone else’s servers, and the free open models you could download and run yourself were a noticeable step behind. Maybe two years behind.

That rule is breaking. Open-weight models, the ones anyone can download, run on their own hardware, and use for free, have closed most of the gap. On some tasks the gap is now zero. On the hardest, it is single digits and shrinking every quarter.

This matters for anyone, but it matters especially for a school, a college, or anyone who cares where their data goes. Because “almost as good as the frontier, free, and running on your own machine” is a genuinely different proposition from “rent it and send your data away.”

Here is where things actually stand, in six charts. A note first: exact scores move every month as new models launch, so treat the numbers as a snapshot, not gospel. The trend is the point, and the trend is not in doubt.

Chart 1: The overall gap has closed dramatically

The cleanest single measure of convergence is the Chatbot Arena leaderboard, where real users blind-vote between models. It captures overall preference, not one narrow skill.

The overall gap, closing fast

Chatbot Arena: best closed minus best open (% points)

Source: Stanford AI Index 2025

Chart 1 — The overall gap, closing fast A line chart over time. The performance gap between the best closed model and the best open model on the Chatbot Arena leaderboard fell from 8.0% in January 2024, to roughly 4.2% by mid-2024, to just 1.7% by February 2025. Source: Stanford AI Index 2025 (Chatbot Arena).

In early 2024, the best closed model led the best open model by 8 percentage points on the Chatbot Arena leaderboard. A year later that lead had shrunk to 1.7 points. The gap did not narrow slowly, it collapsed in about twelve months. On broad knowledge tests the convergence is now so complete that older benchmarks like MMLU have been retired from serious leaderboards, because nearly every capable model scores above 90 and the test no longer separates anyone.

Chart 2: On hard science, a real gap remains

It would be dishonest to stop at the benchmark everyone aces. On harder tests, the closed models still lead.

Where things stand now, by task

Best closed (cream) vs best open (orange), % score

Source: Vellum LLM & open-LLM leaderboards, 2026

Chart 2 — Where things stand now, by task A grouped bar chart comparing the best closed model (cream) against the best open model (orange) across four benchmarks, % score. Math (AIME 2025): closed 100 vs open 99.1. Science (GPQA Diamond): 94.2 vs 87.6. Coding (SWE-bench): 88.6 vs 76.8. Hardest exam (Humanity’s Last Exam): 57.9 vs 44.9. Source: Vellum LLM and open-LLM leaderboards, 2026.

On GPQA Diamond, a test of graduate-level physics, chemistry and biology reasoning, the frontier closed models sit in the low-to-mid 90s. The best open models reach the mid-80s. That is a real, single-digit-to-low-double-digit gap, and it is the honest counterweight to “open has caught up.” On the genuinely hard reasoning, it hasn’t, quite. But “a few points behind the best in the world” is a very different sentence from “two years behind.”

Chart 3: The gap, task by task

The single-number gap hides the real story, which is that it varies enormously by task. Software engineering is the benchmark most worth watching, because it is practical and hard to fake.

The gap, benchmark by benchmark

Percentage points open trails closed

Source: derived from Vellum leaderboards, 2026

Chart 3 — The gap, benchmark by benchmark A horizontal bar chart showing how many percentage points open trails closed on each task. Math: 0.9 points. Science: 6.6 points. Coding: 11.8 points. Hardest exam: 13.0 points. The bars are colour-graded from green (small gap) to red (large gap). Source: derived from chart 2, Vellum leaderboards, 2026.

On SWE-bench Verified, which tests whether a model can resolve real software bugs end-to-end, the strongest open models now land within roughly eight points of the best closed ones, and one March 2026 analysis found the very top models clustered within about a single point of each other. For everyday coding work, the open-vs-closed distinction has stopped being the thing that decides quality.

Chart 4: The cost difference is enormous

Here is the chart that changes the decision for most schools and budget-conscious teams.

The cost difference is enormous

US$ per 1M output tokens (log scale)

Source: Vellum pricing, 2026. Open models run free on your own hardware.

Chart 4 — The cost difference is enormous A horizontal bar chart on a log scale of price per 1M output tokens. Open: Gemma 3 27b $0.07, GPT-oss-20b $0.35, DeepSeek V3 $1.10. Closed: Claude Opus 4.8 $25, GPT-5.5 $30. A capable open model can cost roughly 25 to 400 times less per token, and runs free on your own hardware. Source: Vellum pricing, 2026.

A leading open model can deliver something close to frontier quality at a fraction of the price, in some comparisons close to frontier quality at a tiny fraction of the cost. And that is when you pay someone else to host it. Download it and run it yourself and the per-token cost goes to zero. You pay for the hardware and the electricity, nothing else.

Chart 5: Cheap and capable at the same time

The quieter revolution is size. Models small enough to run on a single ordinary GPU now score what only giant models could a year ago.

Cost vs capability

Science score (GPQA %) vs output $/1M tokens. Up-and-left is better value. Closed=cream, open=orange

Source: Vellum, 2026

Chart 5 — Cost vs capability A scatter plot of science-reasoning score (GPQA %, vertical) against output cost per 1M tokens (log scale, horizontal). Closed models (cream diamonds) score 92–94% but sit far to the right (expensive). Open models (orange circles) score 74–88% but sit far to the left (cheap). Up-and-left is the best-value corner, where the open models cluster. Source: Vellum, 2026.

A current open model can activate as few as three billion parameters per token through an efficient mixture-of-experts design, run on a single consumer graphics card, and still post mid-80s science-reasoning scores. The thing that needed a data centre last year fits under a desk this year. For a school computer lab, that is the whole game.

Chart 6: The lag is shrinking each cycle

Pull back and the pattern across every benchmark is the same shape.

How long open trails closed

Months behind the frontier (directional estimate)

Source: trend synthesis, Stanford AI Index & llm-stats, 2026

Chart 6 — How long open trails closed A line chart showing the release lag of open models behind the closed frontier, in months: roughly 24 months in 2023, falling to about 18 in 2024, 12 in 2025, and 6–12 months by 2026. Plotted as a dashed line to signal it is a directional estimate, not a single published figure. Source: trend synthesis from Stanford AI Index 2025 and llm-stats, 2026.

Open-weight releases still tend to lag the absolute frontier, but the lag keeps shrinking, from around two years down to somewhere between six and eighteen months, and narrowing with each release cycle. Extrapolating is risky, but the direction has been consistent for three years.

What this actually means for a school or college

Strip away the benchmark numbers and the practical takeaway is this.

If your institution has been holding off on AI because the good models are expensive and send your students’ data to a company’s servers, that trade-off has changed. You can now run a genuinely capable model on your own hardware, for the cost of the hardware, with your data never leaving the building. For a school thinking about student privacy, or a college that wants AI in its labs without a per-seat bill that scales forever, the open option is no longer the compromise it was a year ago.

The honest caveat: for the very hardest reasoning, the frontier closed models still lead, and if you need the absolute best, you still rent it. But “good enough to be genuinely useful, free, private, and running on a machine you own” describes the open models of 2026 in a way it simply did not describe the open models of 2024.

The one thing to remember

The question used to be “can we afford the good AI?” Increasingly the question is “do we even need to rent it?”

For a growing share of real work, the answer is no. The best AI is no longer only behind a paywall and someone else’s servers. A large and fast-improving slice of it is free, private, and small enough to run yourself. The gap that defined the field is closing, and the people who notice first get the advantage.