DEV Community

# benchmarks

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
SurrealDB 3.x by the numbers

SurrealDB 3.x by the numbers

10
Comments
8 min read
Single-Prompt Safety Scores Are Measuring the Wrong Thing

Single-Prompt Safety Scores Are Measuring the Wrong Thing

Comments
3 min read
Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days

Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days

Comments
3 min read
The cheapest and fastest way to generate an image

The cheapest and fastest way to generate an image

Comments
1 min read
Beyond Scores: A Critical Review of Benchmark Reports for Evaluating Large Language Models

Beyond Scores: A Critical Review of Benchmark Reports for Evaluating Large Language Models

Comments
7 min read
What you measure depends on where you draw the boundary

What you measure depends on where you draw the boundary

2
Comments 1
9 min read
A Startup Claims to Have Broken the Transformer's Core Bottleneck

A Startup Claims to Have Broken the Transformer's Core Bottleneck

Comments
3 min read
I benchmarked 10 LLMs on slopsquatting — up to 87% installed fake packages

I benchmarked 10 LLMs on slopsquatting — up to 87% installed fake packages

1
Comments
9 min read
DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

Comments
6 min read
GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5, 1M Context, $5/$30 Pricing

GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5, 1M Context, $5/$30 Pricing

Comments
6 min read
GPT-5.5 Is Out — What the Numbers Actually Say

GPT-5.5 Is Out — What the Numbers Actually Say

Comments
4 min read
How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

Comments
3 min read
What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

Comments
3 min read
Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

Comments
3 min read
Why Merged LoRA Barely Changes Inference Time

Why Merged LoRA Barely Changes Inference Time

1
Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.