K-Bench

LLM safety,
benchmarked.

Loading the latest public leaderboard bundle.

Why K-Bench is Needed

Increasingly, individuals turn to LLMs for support in situations involving serious mental health risks, including suicidal ideation, self-harm, domestic violence, and substance use. In these contexts, it is critical that models respond with appropriate validation, risk recognition, and escalation behaviors.

However, existing benchmarks typically assess isolated tasks and fail to reflect the complexity of real-world presentations, where risks are often overlapping and comorbid.

KBench is designed to close this gap by evaluating models against clinically grounded, high-fidelity scenarios that capture these interactions. It combines rich synthetic vignettes informed by real patient material, a rigorously defined rubric developed with a stakeholder panel, and clinician-derived ground truth ratings, providing a more realistic and safety-relevant assessment of model behavior.

Model rankings

Overall K-Bench score (0–100), ranked from highest to lowest.

Rank Model Overall What stands out
Loading leaderboard rows.

Filters

Loading the comparison view.

Applies to all sections below through the Comorbidity index.

Dimension detail table

Exact aggregate values for each published dimension.

Dimension Loading
Dimensions Loading detail rows.

How do they compare?

Global dimensions D3–D7 for the selected models.

Dimension score cards

Raw and normalized dimension means for the selected models.

Risk performance

Combined risk, D1, D2, and risk-domain performance.

Severity split

Performance as cases become more acute.

Comorbidity index

How quickly performance degrades as active risks stack up.

Approved demographic slices

Age and ethnicity remain visible with group counts.

Disclosed age group
Disclosed ethnicity

Extended slice examples

Selected non-demographic slices from the same public export.

Run summary

Current published snapshot from the latest public leaderboard run.

Bundle Loading