Google AI Overview: Study Finds Gemini 3 Errors, Ungrounded Sources

Google AI Overview’s Accuracy Gap: Analysis Flags Hundreds of Millions of Incorrect or Ungrounded Summaries Daily

An analysis finds Google AI Overview can produce hundreds of millions of incorrect or ungrounded summaries daily at scale, challenging search reliability.

FACTUAL ACCURACY

What Oumi’s analysis measured and why it matters
Google AI Overview — the feature that returns an AI-generated summary in response to search queries — is the subject of a fresh, high-profile audit that raises questions about how often those summaries are correct and whether they accurately cite supporting sources. An AI startup, Oumi, ran 4,326 Google searches using the SimpleQA benchmark and evaluated the resulting AI Overviews for factual correctness and source grounding on behalf of The New York Times. The analysis found difference in raw answer accuracy between the two Gemini models that have powered the feature: 85% accuracy with Gemini 2 and 91% accuracy when Gemini 3 was used.

How that accuracy translates at scale
Percentages in a lab are one thing; scale changes the stakes. Google handles more than 5 trillion searches each year, and Oumi’s reporting notes that if AI Overviews were produced for half of those searches, a 9% inaccuracy rate would translate to roughly 225 billion false or misleading summaries annually — about 616.4 million per day. Those headline conversion figures are not Oumi’s technical output but are calculated from the search-volume and production-share assumptions noted in the reporting and are included in the study’s broader framing of scale.

Grounding: when correct answers still lack proper sources
Oumi’s evaluation separated two related but distinct problems: whether a summary’s claims were correct, and whether the summary cited sources that actually supported those claims. More than half of the answers that Oumi adjudged correct were nonetheless “ungrounded,” meaning the overview cited sources that did not substantiate the statement being made. Overall, 56% of accurate answers were classified as ungrounded. The analysis also reported that when AI Overview was running on Gemini 2, ungrounded summaries appeared in 37% of accurate instances. The report characterizes grounding as a persistent issue and indicates that Oumi observed changes in grounding behavior as the underlying Gemini models changed.

Source mix and citation patterns in AI Overviews
The dataset behind the AI Overviews referenced thousands of individual sources: Oumi’s tally identified 5,380 distinct sources cited across the runs it analyzed. Among those, social platforms figured prominently: Facebook and Reddit were the second- and fourth-most-cited sources in the AI summaries Oumi examined. Oumi’s breakdown shows Facebook was used as a source in 5% of cases where a summary was accurate and in 7% of cases where a summary was inaccurate. Those frequencies raise questions about how the feature weighs different domains when assembling a short, synthesized answer.

Google’s public response and its own testing data
Google acknowledged that its AI models can make mistakes and pushed back on Oumi’s methodology. A company spokesperson argued that the SimpleQA benchmark used for the analysis — developed by OpenAI — has its own flaws and that the Oumi study “doesn’t reflect what people are actually searching on Google.” Google’s test data, cited in the same reporting, shows a different error signal: internal results indicate Gemini 3 produced incorrect information in 28% of queries in the company’s own evaluations, but those evaluations also suggest combined use with Google’s search index improves accuracy relative to the model operating alone.

Human behavior and verification gaps
Independent survey data cited in the analysis underscores a behavioral factor that compounds technical problems: people are less likely to click through to source pages when an AI-generated summary is present. A Pew Research Center survey from July 2025 found that users who saw an AI Overview clicked on a traditional search result link in only 8% of visits; users who did not see an AI Overview clicked through at 15%. The same survey found that when an AI Overview included a link, users clicked that included link in only 1% of instances. Those figures suggest that even when summaries are incorrect or ungrounded, a substantial share of users may not follow up by checking primary sources.

Reach and deployment of AI Overviews
Google’s own figures, cited in the reporting, place AI Overviews at scale: the feature had about 2 billion monthly users as of July 2025 and was available across more than 200 jurisdictions and in 40 languages. That breadth of deployment helps explain why even modest error rates become a large-volume concern for misinformation and user trust.

Implications for search, publishers, and developer ecosystems
The Oumi analysis and the surrounding data illuminate tensions at the intersection of large language models, search experiences, and content ecosystems. For publishers and site owners, increased reliance on synthesized answers that do not reliably ground claims could reduce direct traffic: if users accept an AI Overview as authoritative and do not click through, publishers lose the engagement that fuels ad, subscription, and analytics models. For developers and product teams building on search and AI tools, the findings underscore the importance of transparency about source provenance and conservative design choices that encourage verification rather than passive consumption. For marketers and CRM platforms that depend on accurate, discoverable content, the mix of aggregated answers and reduced click-through rates complicates measurement and attribution.

How Google AI Overview works in practice and who it reaches
At the user level, AI Overview is designed to synthesize an immediate response to a query rather than present a list of links. The feature is powered by Google’s Gemini family of large language models; Oumi tested outputs produced when the feature used Gemini 2 and when it used Gemini 3. The analysis shows that while raw answer accuracy improved between those model generations in the test sample, issues around grounding and source selection persisted. Google’s deployment metrics indicate broad availability to users worldwide by mid-2025, implying the feature is not experimental for a small group but a default behavior encountered by many searchers.

What the findings mean for verification and product design
Two practical patterns follow from the data presented. First, the combination of imperfect model accuracy and low rates of link-clicking suggests product design must nudge users toward verification. Features that surface clear provenance, encourage source inspection, or make uncertainty explicit would address the behavioral gap Pew documented. Second, quality signals for source selection need scrutiny: the fact that social platforms such as Facebook and Reddit appear frequently among top-cited sources—sometimes in inaccurate summaries—raises questions about how relevance and credibility are being weighted in the summary-generation pipeline.

Potential policy and editorial responses
Publishers, platforms, and regulators each have levers that this problem touches. Editorial teams can prioritize structured metadata and clearer labeling to help automated systems identify authoritative content. Platform product teams can revise ranking and citation heuristics to favor primary, verifiable sources for factual claims. Policymakers concerned with misinformation could focus on transparency requirements that make it easier for independent auditors to evaluate how generative features select and present sources. Each of these responses would alter the incentives driving both model behavior and downstream consumption.

Developer considerations and integration with broader AI ecosystems
For engineers and product managers building AI features, the Oumi analysis is a reminder to treat synthesized answers as components within a broader information architecture. Integrations with verification tools, fact-checking services, and systems that expose model confidence or provenance are practical steps consistent with responsible deployment practices. The findings are also relevant to adjacent software categories—search APIs, knowledge-management tools, developer tools, and automation platforms—that may either consume or produce summarized content; each must consider how grounding and citation quality affect downstream trust and utility.

Business risk and user trust
From a commercial perspective, trust is a fragile asset. The reporting shows that even as model accuracy metrics can improve between versions, other dimensions—like grounding and source selection—can undermine confidence. Businesses that surface AI-generated content to customers must weigh short-term engagement gains against longer-term reputational risk, especially when incorrect or ungrounded statements can affect decisions in high-stakes domains.

Auditing models: benchmarks, methodology, and pushback
The exchange between independent auditors and Google in this episode also highlights a recurring debate over benchmarking and methodology. Google criticized Oumi’s reliance on a specific benchmark, arguing it did not reflect real-world query patterns; Oumi’s analysis, by contrast, uses the SimpleQA benchmark as an industry-standard tool to test question-answering precision and citation behavior. The two perspectives point to a larger methodological question for the industry: how best to construct public, reproducible evaluations that meaningfully reflect everyday search behavior and the range of factual claims users make.

Practical takeaways for everyday users
The data summarized in the analysis suggests simple user practices remain valuable: when a search returns an AI-generated summary, users who need reliable information should click through to the original sources rather than treat the summary as definitive. The Pew survey numbers indicate many users do not do so, which increases the need for product-level signals that encourage browsing and verification.

Looking ahead, the arrival and rapid rollout of AI-summarization features in mainstream search has shifted attention from whether language models can generate plausible answers to how those answers are sourced and presented. The Oumi findings and Google’s responses make clear that incremental improvements in model accuracy do not by themselves solve issues of provenance and user verification. As Google continues to evolve the Gemini models and refine how those models interact with its search index, observers inside and outside the company are likely to press for clearer provenance, more conservative sourcing when the evidence is thin, and product affordances that steer users toward primary material rather than a single synthesized interpretation. These choices will shape whether AI Overviews ultimately become a trusted research assistant or a convenient-but-risky front door to information online.