Wow. I've suspected for a while that Bard's performance has been limited mostly by cost. Google isn't charging for Bard and they didn't want to run a gigantic model for everyone for free forever. Maybe they made a breakthrough in inference cost for their better models? Or maybe they got tired of everyone clowning on them for being behind and decided to eat the cost for a while.
I still think they ought to launch a subscription so we can see their absolute best model running in public.
The trick is to access the "bard-jan-24-gemini-pro" model, available in direct chat mode here: https://chat.lmsys.org/. Significantly better than the prior model.
New information from a Google employee: this new leaderboard entry (Bard - Gemini Pro) is a different fine-tune than the previous one (Gemini Pro - Dev API), but more importantly it "has access to the Internet" which I assume means it uses Google Search when generating answers. I bet this is responsible for the boost!
Does anyone know if the GPT-4 Turbo version used on the leaderboard has access to web search? I always assumed it did not, but now it doesn't seem like an apples-to-apples comparison.
Edit: I used the "Direct Chat" feature on lmsys to ask Bard and GPT-4 Turbo "What is the current price of Bitcoin?". Sure enough GPT-4 Turbo said it can't browse the Internet and Bard gave a real time answer from Google Search. This means GPT-4 outperforms Bard overall even without the ability to browse the web at all. Pretty impressive.
These seem like different categories; one is a model and one is a system with a model plus tools. I think it is useful to compare them, since there is a real difference in user experience. However, they ought to be prominently marked as different categories. And the lmsys guys ought to put a ChatGPT model on the leaderboard with its own search integration enabled, for a fairer comparison. And it would be cool to have other LLM+tools entries like Perplexity, Phind, etc.
I still think they ought to launch a subscription so we can see their absolute best model running in public.