Methodology

How K-Bench Is Built

Why this benchmark is different

K-Bench is designed to evaluate safety-relevant clinical conversation behaviour across broader, more realistic patient journeys than narrower single-risk benchmarks.

Dimension K-Bench Alternative benchmarks
Risk domains 4: suicide, self-harm, domestic violence, substance misuse, including co-occurrence 1: suicide only
Age scope Not restricted in the draft benchmark design Adults 18+ only; youth excluded
Vignette design Factorial design: 16 domain configurations x 3 coherence levels = 48 cells; 14,400 vignettes generated; 122-variable schema 10 hand-built profiles
Severity granularity Per-domain Low / High / Imminent severity levels Single profile-level risk label
Evaluation dimensions 7 rubric dimensions: D1-D7 5 dimensions
Realism validation Quantitative embedding overlap: nearest-neighbour cosine, MMD, Frechet distance, plus UMAP review Clinician Likert ratings: presentation median 4, communication median 3
Rated dataset 151 curated transcripts 90 conversations
Model leaderboard Detailed breakdown across risk, dimensions, severity, comorbidity, and subgroup slices Top-level score only
Prompt and reasoning variation Default vs therapeutic prompt variants, with reasoning variants where supported Not varied
Transparency Private benchmark internals Open rubric and code

1. Build rich patient journey profiles

We start by generating a large library of synthetic patient profiles grounded in real patient risk profiles collected in our previous project (publication link to be added once live).

Real-world grounding
  • Each profile combines context, risk history, and day-to-day circumstances.
  • Profiles vary disclosure strategy, conversation style, cooperativeness, and tone.
  • This yields more than 10^60 possible patient profiles; from that space, we generated over 14,000 vignettes to sample from.

2. Generate and calibrate conversations

We use those profiles to generate synthetic patient-provider conversations, then keep the generation setup that looks most human-like.

Human-likeness calibration
  • Patient profiles are turned into full multi-turn synthetic dialogues.
  • Generation settings are compared against a real human conversation dataset for human-likeness calibration.
  • We select models/prompts that produce the most realistic interaction patterns.

3. Set clinician ground truth and tune the judge

We establish expert ground truth ratings first, then tune an LLM judge to behave like those clinicians at scale.

Clinician ground truth
  • Six qualified clinicians rate 151 generated transcripts using the shared rubric.
  • The judge model is iteratively improved to better match clinician decisions.
  • Judge tuning is validated on holdout data before broader use.

4. Evaluate models and publish the leaderboard

We test provider LLMs on the synthetic patients, score them against the rubric, then publish the aggregate results.

Two prompt modes
  • Each model is tested in two modes: a base/system-prompt mode (default in-the-wild behavior) and a guided therapeutic/risk-assessor mode (behavior nudged toward safer practice).
  • Models are evaluated on risk assessment, safety behavior, support quality, and boundaries.
  • Comorbidity index reporting shows how performance changes as active risks stack up.
  • All aggregate outputs are bundled and published to the public leaderboard.

What are we checking for?

What the rubric checks (examples)

  • Whether the provider detects potential risk and identifies risk type correctly.
  • How well the provider explores seriousness, frequency, context, and supports.
  • Whether the provider prioritises safety and nudges professional support when needed.
  • How well the provider adapts to user context, culture, and personal details.
  • Whether the provider supports autonomy with options while maintaining boundaries.

How we keep this trustworthy

  • Scoring rules are fixed and deterministic once a ratings row is produced.
  • Judge tuning and validation are separated so we do not grade on the same data used for tuning.
  • Aggregate-only publication keeps internal run artifacts and raw transcripts private.

Benchmark risk categories

Suicide

This category evaluates how AI systems respond to individuals experiencing suicidal thoughts, suicide planning, suicide attempts, and related mental health difficulties. Scenarios range from individuals experiencing passive or occasional thoughts of death to those facing immediate and life-threatening suicide risk. Cases include diverse presentations involving depression, anxiety, trauma, substance use, relationship difficulties, financial stress, bereavement, chronic illness, and social isolation. The benchmark also evaluates whether models identify protective factors such as family support, help-seeking behaviour, personal coping strategies, and engagement with healthcare services, while appropriately assessing suicide risk severity and responding with empathy, safety, and clinical judgement.

Domestic Violence

This category evaluates how AI systems respond to individuals experiencing intimate partner violence and coercive control. Scenarios include physical violence, emotional abuse, psychological manipulation, isolation from friends and family, threats, intimidation, sexual coercion, and financial abuse. Cases vary in relationship duration, severity, frequency, help-seeking behaviour, and access to protective resources. The benchmark assesses whether models can recognise indicators of abuse, identify escalating risk, understand contextual and demographic factors, and provide responses that prioritise safety, validation, and appropriate support while avoiding victim-blaming or minimisation.

Substance Misuse

This category evaluates how AI systems respond to individuals experiencing problematic alcohol or drug use. Presentations include alcohol misuse, prescription medication misuse, cannabis use, stimulant use, opioid use, and polysubstance use. Cases vary in severity, frequency of use, impacts on work and relationships, physical and mental health consequences, previous treatment attempts, and recovery status. The benchmark examines whether models can identify substance-related risks, recognise co-occurring mental health difficulties, explore protective and risk factors, and provide supportive and evidence-informed guidance while maintaining an appropriate level of concern.

Self-Harm

Although closely related to suicidality, self-harm is evaluated as a distinct presentation because individuals may engage in self-injurious behaviours without suicidal intent. Scenarios include cutting, burning, hitting, scratching, and other forms of self-injury across a range of frequencies and severities. Cases explore underlying emotional distress, coping difficulties, trauma histories, interpersonal stressors, and co-occurring mental health conditions. The benchmark assesses whether models can distinguish between self-harm and suicide risk, explore motivations and triggers, recognise escalating danger, and respond with empathy, curiosity, and appropriate safeguarding.

What is open vs closed

Open (public)

  • High-level methodology and process descriptions.
  • Aggregate leaderboard outputs and interpretation guidance.
  • Examples of rubric dimensions at a descriptive level.

Closed (not public)

  • Detailed benchmark internals and exact evaluation setups.
  • Transcript-level artifacts, sensitive prompt internals, and operational scoring details.
  • Other implementation details that could be reverse-engineered to target the metric.

Why this boundary exists: As computer scientists, we strongly value openness, and keeping parts of this benchmark private does not sit easily with us. However, if benchmark internals are fully exposed, providers can optimize for the test itself rather than genuine safety behavior. Keeping key internals closed helps reduce metric gaming and protects benchmark validity over time.

How this site gets updated

Updates happen by generating a detached leaderboard bundle locally, exporting it into the static site directory, and then publishing that static output.

  • Internal operator workflows remain local-only.
  • The public website consumes only exported site data.
  • The same static output can be published to Vercel and mirrored to Hugging Face.