Why did o3-mini-high jump from 0.8% to 4.8% on Vectara’s benchmark and what it means for document-length evaluations
https://fire2020.org/why-the-facts-benchmark-rated-gemini-3-pro-at-68-8-for-factuality/
Which specific questions about o3-mini-high, Vectara benchmark versions, and document length will I answer and why they matter? Quick list of the questions I’ll answer and why each matters to engineers, evaluation teams, and procurement