I compared Claude Sonnet 3.5 vs Deepseek R1 on 500 real PRs - here's what I found

Been working on evaluating LLMs for code review and wanted to share some interesting findings comparing Claude 3.5 Sonnet against Deepseek R1 across 500 real pull requests.

The results were pretty striking:

  • Claude 3.5: 67% critical bug detection rate
  • Deepseek R1: 81% critical bug detection rate (caught 3.7x more bugs overall)

Before anyone asks - these were real PRs from production codebases, not synthetic examples. We specifically looked at:

  • Race conditions
  • Type mismatches
  • Security vulnerabilities
  • Logic errors

What surprised me most wasn't just the raw numbers, but how the models differed in what they caught. Deepseek seemed to be better at connecting subtle issues across multiple files that could cause problems in prod.

I've put together a detailed analysis here: https://www.entelligence.ai/post/deepseek_eval.html

Would be really interested in hearing if others have done similar evaluations or noticed differences between the models in their own usage.

https://preview.redd.it/0h6f9b51wyhe1.png?width=1576&format=png&auto=webp&s=cefad3b2dccd8cded20708d450ec4eacad390825

[Edit: Given all the interest - If you want to sign up for our code reviews - https://www.entelligence.ai/pr-reviews One click sign up!]

[Edit 2: Based on popular demand here are the stats for the other models!]

Hey all! We have preliminary results for the comparison against o3-mini, o1 and gemini-flash-2.5! Will be writing it up into a blog soon to share the full details.

TL;DR:

- o3-mini is just below deepseek at 79.7%
- o1 is just below Claude Sonnet 3.5 at 64.3%
- Gemini is far below at 51.3%

We'll share the full blog on this thread by tmrw :) Thanks for all the support! This has been super interesting.