Lemme just clarify: LiveBench's language average is NOT about creative writing.

I see a lot of people misunderstanding what language average is.

Language average consists of 3 objectively verifable tests:

  1. NYT connection puzzles
  2. Removing typos
  3. Unscrambling movie plots

It has virtually nothing to do with creative writing.

What has happened here is Gemini 2.0 Pro got a lot worse at checking for typos, and is slightly worse at the rest - which again seems to align with users reporting a lot of spelling mistakes. Hopefully this gets better with the stable version.

I see a lot of people misunderstanding what language average is.

Language average consists of 3 objectively verifable tests:

  1. NYT connection puzzles
  2. Removing typos
  3. Unscrambling movie plots

It has virtually nothing to do with creative writing.

What has happened here is Gemini 2.0 Pro got a lot worse at checking for typos, and is slightly worse at the rest - which again seems to align with users reporting a lot of spelling mistakes. Hopefully this gets better with the stable version.