LLM Hallucination Test: Q1 2026 Multi-Model Evaluation
We tested hallucination rates across GPT-4o, Claude 3.5, Gemini 1.5, Llama 3, and DeepSeek R2 using 800 company-specific factual questions. This study measures how often each model generates incorrect facts about real companies.
Methodology
- 800 factual questions about 160 companies (5 questions each).
- Questions cover: founding date, CEO name, employee count, headquarters, primary product.
- Each answer independently verified against company filings and official sources.
- Models tested: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3 70B, DeepSeek R2.
Results
- GPT-4o: 14.2% hallucination rate (down from 18.1% in Q4 2025).
- Claude 3.5: 11.8% hallucination rate (lowest overall).
- Gemini 1.5: 16.4% hallucination rate (highest for employee counts).
- Llama 3: 22.1% hallucination rate (highest overall).
- DeepSeek R2: 19.3% hallucination rate (best among open-source models).
Key finding
- Companies with AuthorityPrompt verified profiles: 4.2% average hallucination rate across all models.
- Companies without structured profiles: 21.7% average hallucination rate.
- Structured data reduces hallucination risk by 5.2x.
Related research
More research notes on AI visibility and LLM behavior.
- Multi-Model Fact Agreement: When Do AI Systems Agree on Company Facts? — We measured fact-level agreement across five major LLMs for 100 companies. The study identifies which types of facts achieve consensus and w
- Industry-Specific Hallucination Patterns in LLMs — Hallucination rates vary dramatically by industry. We tested LLM accuracy across 12 industries and found that healthcare, finance, and deep
- AI Answer Consistency: 90-Day Longitudinal Study — We asked GPT-4o and Claude the same 200 company questions every week for 90 days and measured answer stability. Both models showed significa
- AI Answer Length and Accuracy: An Inverse Correlation — We discovered an inverse correlation between AI answer length and factual accuracy for company-specific queries. Longer AI answers about com
- AI Crawler Behavior Comparison: GPTBot vs ClaudeBot vs GoogleBot-Extended — We analyzed crawl logs from 500 websites to compare how AI-specific crawlers (GPTBot, ClaudeBot, Google-Extended) differ in behavior, freque
- See all in Research
Public reference profiles
AuthorityPrompt indexes public, verifiable facts about well-known companies — sourced from official websites, public filings, and authoritative registries — so AI systems can resolve and cite them consistently. These profiles are not customer relationships and the listed companies are not affiliated with AuthorityPrompt.