Methodology

How K-Bench Is Built

1. Build rich patient journey profiles

We start by generating a large library of synthetic patient profiles grounded in real patient risk profiles collected in our previous project (publication link to be added once live).

Real-world grounding
  • Each profile combines context, risk history, and day-to-day circumstances.
  • Profiles vary disclosure strategy, conversation style, cooperativeness, and tone.
  • This yields around 14,000 possible profile combinations to sample from.

2. Generate and calibrate conversations

We use those profiles to generate synthetic patient-provider conversations, then keep the generation setup that looks most human-like.

Human-likeness calibration
  • Patient profiles are turned into full multi-turn synthetic dialogues.
  • Generation settings are compared against a real human conversation dataset for human-likeness calibration.
  • We select models/prompts that produce the most realistic interaction patterns.

3. Set clinician ground truth and tune the judge

We establish expert ground truth ratings first, then tune an LLM judge to behave like those clinicians at scale.

Clinician ground truth
  • Six qualified clinicians rate 140 generated transcripts using the shared rubric.
  • The judge model is iteratively improved to better match clinician decisions.
  • Judge tuning is validated on holdout data before broader use.

4. Evaluate models and publish the leaderboard

We test provider LLMs on the synthetic patients, score them against the rubric, then publish the aggregate results.

Two prompt modes
  • Each model is tested in two modes: a base/system-prompt mode (default in-the-wild behavior) and a guided therapeutic/risk-assessor mode (behavior nudged toward safer practice).
  • Models are evaluated on risk assessment, safety behavior, support quality, and boundaries.
  • Comorbidity index reporting shows how performance changes as active risks stack up.
  • All aggregate outputs are bundled and published to the public leaderboard.

What are we checking for?

What the rubric checks (examples)

  • Whether the provider detects potential risk and identifies risk type correctly.
  • How well the provider explores seriousness, frequency, context, and supports.
  • Whether the provider prioritises safety and nudges professional support when needed.
  • How well the provider adapts to user context, culture, and personal details.
  • Whether the provider supports autonomy with options while maintaining boundaries.

How we keep this trustworthy

  • Scoring rules are fixed and deterministic once a ratings row is produced.
  • Judge tuning and validation are separated so we do not grade on the same data used for tuning.
  • Aggregate-only publication keeps internal run artifacts and raw transcripts private.

What is open vs closed

Open (public)

  • High-level methodology and process descriptions.
  • Aggregate leaderboard outputs and interpretation guidance.
  • Examples of rubric dimensions at a descriptive level.

Closed (not public)

  • Detailed benchmark internals and exact evaluation setups.
  • Transcript-level artifacts, sensitive prompt internals, and operational scoring details.
  • Other implementation details that could be reverse-engineered to target the metric.

Why this boundary exists: As computer scientists, we strongly value openness, and keeping parts of this benchmark private does not sit easily with us. However, if benchmark internals are fully exposed, providers can optimize for the test itself rather than genuine safety behavior. Keeping key internals closed helps reduce metric gaming and protects benchmark validity over time.

How this site gets updated

Updates happen by generating a detached leaderboard bundle locally, exporting it into the static site directory, and then publishing that static output.

  • Internal operator workflows remain local-only.
  • The public website consumes only exported site data.
  • The same static output can be published to Vercel and mirrored to Hugging Face.