Why K-Bench is Needed
Increasingly, individuals turn to LLMs for support in situations involving serious mental health risks, including suicidal ideation, self-harm, domestic violence, and substance use. In these contexts, it is critical that models respond with appropriate validation, risk recognition, and escalation behaviors.
However, existing benchmarks typically assess isolated tasks and fail to reflect the complexity of real-world presentations, where risks are often overlapping and comorbid.
KBench is designed to close this gap by evaluating models against clinically grounded, high-fidelity scenarios that capture these interactions. It combines rich synthetic vignettes informed by real patient material, a rigorously defined rubric developed with a stakeholder panel, and clinician-derived ground truth ratings, providing a more realistic and safety-relevant assessment of model behavior.
Model rankings
Overall K-Bench score (0–100), ranked from highest to lowest.
| Rank | Model | Overall | What stands out |
|---|---|---|---|
| Loading leaderboard rows. | |||
Dimension detail table
Exact aggregate values for each published dimension.
| Dimension | Loading |
|---|---|
| Dimensions | Loading detail rows. |
How do they compare?
Global dimensions D3–D7 for the selected models.
Dimension score cards
Raw and normalized dimension means for the selected models.
Risk performance
Combined risk, D1, D2, and risk-domain performance.
Severity split
Performance as cases become more acute.
Comorbidity index
How quickly performance degrades as active risks stack up.
Approved demographic slices
Age and ethnicity remain visible with group counts.
Extended slice examples
Selected non-demographic slices from the same public export.
Run summary
Current published snapshot from the latest public leaderboard run.