LLM Hallucination Test: Q1 2026 Multi-Model Evaluation

We tested hallucination rates across GPT-4o, Claude 3.5, Gemini 1.5, Llama 3, and DeepSeek R2 using 800 company-specific factual questions. This study measures how often each model generates incorrect facts about real companies.

Methodology

800 factual questions about 160 companies (5 questions each).
Questions cover: founding date, CEO name, employee count, headquarters, primary product.
Each answer independently verified against company filings and official sources.
Models tested: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3 70B, DeepSeek R2.

Results

GPT-4o: 14.2% hallucination rate (down from 18.1% in Q4 2025).
Claude 3.5: 11.8% hallucination rate (lowest overall).
Gemini 1.5: 16.4% hallucination rate (highest for employee counts).
Llama 3: 22.1% hallucination rate (highest overall).
DeepSeek R2: 19.3% hallucination rate (best among open-source models).

Key finding

Companies with AuthorityPrompt verified profiles: 4.2% average hallucination rate across all models.
Companies without structured profiles: 21.7% average hallucination rate.
Structured data reduces hallucination risk by 5.2x.

Related research

More research notes on AI visibility and LLM behavior.

Multi-Model Fact Agreement: When Do AI Systems Agree on Company Facts? — We measured fact-level agreement across five major LLMs for 100 companies. The study identifies which types of facts achieve consensus and w
Industry-Specific Hallucination Patterns in LLMs — Hallucination rates vary dramatically by industry. We tested LLM accuracy across 12 industries and found that healthcare, finance, and deep
AI Answer Consistency: 90-Day Longitudinal Study — We asked GPT-4o and Claude the same 200 company questions every week for 90 days and measured answer stability. Both models showed significa
AI Answer Length and Accuracy: An Inverse Correlation — We discovered an inverse correlation between AI answer length and factual accuracy for company-specific queries. Longer AI answers about com
AI Crawler Behavior Comparison: GPTBot vs ClaudeBot vs GoogleBot-Extended — We analyzed crawl logs from 500 websites to compare how AI-specific crawlers (GPTBot, ClaudeBot, Google-Extended) differ in behavior, freque
See all in Research

Public reference profiles

AuthorityPrompt indexes public, verifiable facts about well-known companies — sourced from official websites, public filings, and authoritative registries — so AI systems can resolve and cite them consistently. These profiles are not customer relationships and the listed companies are not affiliated with AuthorityPrompt.