Wow. I've suspected for a while that Bard's performance has been limited mostly ...

j-b · on Jan 26, 2024

The trick is to access the "bard-jan-24-gemini-pro" model, available in direct chat mode here: https://chat.lmsys.org/. Significantly better than the prior model.

jxy · on Jan 26, 2024

how odd! What exactly is lmsys using? Some hidden API that google give them so they can have a better ranking there?

j-b · on Jan 26, 2024

Most likely through this platform: https://console.cloud.google.com/vertex-ai

jxy · on Jan 26, 2024

Thanks. I managed to google and get two different API endpoints.

From the vertex ai:

    API_ENDPOINT="us-central1-aiplatform.googleapis.com"
    PROJECT_ID="test00"
    MODEL_ID="gemini-pro"
    LOCATION_ID="us-central1"
    
    curl \
    -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${API_ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${LOCATION_ID}/publishers/google/models/${MODEL_ID}:streamGenerateContent" -d '@request.json'

and from the makersuite:

    curl \
      -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:generateContent?key=${API_KEY} \
      -H 'Content-Type: application/json' \
      -d '@request.json'

j-b · on Jan 26, 2024

Created a simple app to test Gemini here:

https://github.com/dssjon/gemini/blob/main/app.py

declaredapple · on Jan 26, 2024

> Some hidden API that google give them so they can have a better ranking there?

I don't know about that second part - but it would make sense that google (and others) may want to use lmsys's arena to benchmark their models.

After all, Human A/B tests are far better then the current automated benchmarks.

I would like more info from lmsys as to how they're accessing these though.

bazmattaz · on Jan 27, 2024

Thanks for sharing. Is this a free way to access GPT4-turbo then or are there some limitations?

modeless · on Jan 26, 2024

New information from a Google employee: this new leaderboard entry (Bard - Gemini Pro) is a different fine-tune than the previous one (Gemini Pro - Dev API), but more importantly it "has access to the Internet" which I assume means it uses Google Search when generating answers. I bet this is responsible for the boost!

Does anyone know if the GPT-4 Turbo version used on the leaderboard has access to web search? I always assumed it did not, but now it doesn't seem like an apples-to-apples comparison.

https://x.com/asadovsky/status/1750983142041911412?s=20

Edit: I used the "Direct Chat" feature on lmsys to ask Bard and GPT-4 Turbo "What is the current price of Bitcoin?". Sure enough GPT-4 Turbo said it can't browse the Internet and Bard gave a real time answer from Google Search. This means GPT-4 outperforms Bard overall even without the ability to browse the web at all. Pretty impressive.

These seem like different categories; one is a model and one is a system with a model plus tools. I think it is useful to compare them, since there is a real difference in user experience. However, they ought to be prominently marked as different categories. And the lmsys guys ought to put a ChatGPT model on the leaderboard with its own search integration enabled, for a fairer comparison. And it would be cool to have other LLM+tools entries like Perplexity, Phind, etc.

ysofunny · on Jan 26, 2024

I think their play is the same as always

it's better to let more people interact with it because this will help training the model (get more data) so it must be free to use.

cavisne · on Jan 26, 2024

Google do have an inference advantage with TPUs.

Everyone else needs to pay nvidia margins.

Training is murkier as it’s more about the total performance and scalability of the system.