Test Your Model with K-Bench

Model evaluation

Test Your Model

What is a K-Bench evaluation?

A confidential evaluation of your AI product against K-Bench.
Evaluation of how the system behaves inside realistic mental-health conversations across four safety-critical risk domains:
- Suicide
- Self-harm
- Domestic violence
- Substance misuse
Results are never published by default. Nothing about your system appears on the public leaderboard unless you later choose to opt in.

Why run a K-Bench evaluation?

Independent safety evaluation is stronger than internal self-assessment.
K-Bench was designed by independent academic researchers and clinicians, bringing external research and clinical scrutiny into the evaluation.
Investor diligence often asks for concrete clinical-safety evidence.
Partners and customers may want reassurance before deployment.
Your team gets a clearer view of risk-handling strengths and gaps before launch.

Deliverable: K-Bench Safety Report

Scores across all four risk domains, with per-domain severity grading.
Anonymised positioning against the leaderboard distribution, so you can see where you stand without revealing or being revealed.
Prioritised remediation guidance with actionable areas to improve.
A re-test option after material changes, to confirm whether improvements worked.

Ready to discuss a run?

Contact the K-Bench team with your endpoint shape, model or system description, expected rate limits, and any governance constraints. We will confirm whether the setup is appropriate before requesting a temporary token.

Contact us Read methodology View leaderboard

Technical details

Technical Details

Give this section to your engineers once a private K-Bench run is being scoped. It explains how K-Bench connects to a raw model or deployed system during transcript generation.

Choose your integration path

The two modes use the same patient simulator and scoring pipeline. They differ only in how K-Bench obtains each visible provider reply during transcript generation.

For a raw model, K-Bench does this

Turn 1: send full history so far -> get reply
Turn 2: send full history so far -> get reply
Turn 3: send full history so far -> get reply
...

For a stateful system, K-Bench does this

Create session with system under test

Turn 1: send patient message 1 to that session -> get system reply 1
Turn 2: send patient message 2 to that same session -> get system reply 2
Turn 3: send patient message 3 to that same session -> get system reply 3
...

Close session

Raw model integration

Full visible history in, one assistant reply out

Use this path for raw LLMs, OpenAI-compatible endpoints, stateless wrappers, and simple safety-wrapped models. K-Bench controls the synthetic patient and sends the full visible conversation history on each therapist/provider turn.

Operationally, K-Bench can run this path through OpenRouter when the target model is available there, or through a temporary OpenAI-compatible chat completions endpoint when a provider supplies a direct URL and token.

Endpoint and token

A base URL or full chat completions URL, plus a temporary bearer token accepted in the Authorization header. Tell us the expiry time and whether IP allowlisting is required.

Model identity

The exact value we should send in the model field, the public display label you expect, and whether the endpoint represents a raw model or a safety-wrapped model.

Operational limits

Your maximum concurrent requests, requests per minute, tokens per minute, request timeout preference, context window, output cap, and any unsupported request fields.

Parameter map

Tell us which controls are accepted, fixed, or unavailable. The direct OpenAI-compatible path can omit standard temperature and max_tokens fields, and can send a configured reasoning field when agreed before the run.

Headers

Authorization: Bearer <temporary-token>, Content-Type: application/json, and Accept: application/json. We may also send an identifying title header.

Messages

Roles use the standard system, user, and assistant shape. Patient turns arrive as user; previous therapist/model replies arrive as assistant.

Optional fields

Negotiated controls may include standard temperature, standard max_tokens, and a reasoning-effort field such as reasoning.effort. Unsupported controls can be omitted or fixed server-side.

Minimum viable contract: K-Bench needs to send messages and receive one visible assistant reply. Everything else, including temperature, token caps, reasoning controls, safety policy names, and tracing fields, is an agreed parameter map for that run.

Why the example has multiple messages: this is a mid-conversation provider request. Earlier user and assistant messages are context from previous turns. The final user message is the current patient turn, and your endpoint returns the next therapist/provider reply. On the first provider turn, the message list is usually just the system instruction and the patient's opening message.

Mid-conversation provider request POST /chat/completions

curl -X POST "https://your-domain.example/v1/chat/completions" \
  -H "Authorization: Bearer $TEMP_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "model": "your-organization/your-model-or-system",
    "messages": [
      {
        "role": "system",
        "content": "<K-Bench provider instructions supplied at run time>"
      },
      {
        "role": "user",
        "content": "I do not really know why I agreed to chat today."
      },
      {
        "role": "assistant",
        "content": "I am glad you did. What has been feeling hardest today?"
      },
      {
        "role": "user",
        "content": "It has just been getting worse at home."
      }
    ],
    "temperature": 0.7,
    "max_tokens": 2400,
    "reasoning": {
      "effort": "high"
    }
  }'

Return a single JSON object with the model's visible reply at the preferred/default path choices[0].message.content. If your endpoint uses another stable JSON path for visible content, tell us before the run so the adapter can extract it. Usage fields are welcome but optional. Do not put hidden reasoning or private chain-of-thought in the visible content.

Example response HTTP 200

{
  "id": "chatcmpl_kbench_example",
  "object": "chat.completion",
  "created": 1781800000,
  "model": "your-organization/your-model-or-system",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "That sounds frightening, and I am glad you said it here. When you say things are worse at home, are you safe right now?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1210,
    "completion_tokens": 38,
    "total_tokens": 1248
  }
}

Embedded system integration

One clean session per transcript, latest patient message in

Use this path for deployed therapeutic chatbots and product systems with their own memory, risk monitoring, safety orchestration, retrieval, tools, internal agents, or dynamic prompt steering.

Session API

Expose create-session, send-message, and close-session endpoints. K-Bench creates a fresh session for each benchmark transcript and stores the returned session ID as run metadata.

System identity

Provide the product label, version, public display name, and whether memory, risk monitoring, retrieval, tools, or multiple internal agents are active during the run.

State isolation

Every transcript must start from a clean state. Provider sessions must not reuse memory, user profiles, tool traces, or safety state across benchmark cases.

Data handling

Declare whether benchmark traffic is logged, retained, inspected, or used for training. Private benchmark content must not be used to tune, memorize, or reverse-engineer K-Bench items.

Minimum viable contract: K-Bench creates a session, sends only the latest synthetic patient message by default, receives one visible assistant reply, and closes the session after the transcript completes.

Create session

POST /kbench/sessions creates one clean remote session for one benchmark transcript.

Send message

POST /kbench/sessions/{session_id}/messages receives the latest patient message only, not the full dialogue history. The turn_index is the patient-message ordinal within that session.

Close session

POST /kbench/sessions/{session_id}/close lets your system reset or release session resources after completion or failure.

Create session request POST /kbench/sessions

{
  "run_id": "kb_run_2026_001",
  "transcript_id": "kb_tx_000123",
  "system_label": "company/product/version",
  "metadata": {
    "benchmark": "kbench",
    "mode": "stateful_system",
    "locale": "en",
    "turn_limit": 20
  }
}

Create session response HTTP 200

{
  "session_id": "sess_abc123",
  "status": "created"
}

Latest patient message request POST /kbench/sessions/{session_id}/messages

{
  "turn_index": 3,
  "message_id": "kb_msg_000003",
  "role": "user",
  "content": "It has just been getting worse at home."
}

Visible system response HTTP 200

{
  "message_id": "sys_msg_000003",
  "role": "assistant",
  "content": "That sounds frightening, and I am glad you said it here. When you say things are worse at home, are you safe right now?",
  "finish_reason": "stop",
  "status": "ok",
  "metadata": {
    "risk_flag": "high",
    "internal_route": "risk_monitor_intervened"
  }
}

The visible response must include content or an agreed configured content path. message_id, finish_reason, status, and metadata are useful diagnostics, but they are optional.

Close session request POST /kbench/sessions/{session_id}/close

{
  "session_id": "sess_abc123",
  "transcript_id": "kb_tx_000123",
  "reason": "complete"
}

Important: stateful mode must not depend on K-Bench resending the full transcript. Your system should use the session ID and its own internal state to produce the next visible patient-facing reply.

How K-Bench uses the endpoint

K-Bench evaluates behavior inside multi-turn mental-health conversations. The third-party endpoint is called only for the model or embedded system being tested; the simulated patient and scoring pipeline are handled separately.

Conversation length 20 turns by default

Provider calls About 10 per successful transcript

Default concurrency Up to 50 transcripts in parallel Stateful systems can set lower session limits.

Request timeout 60 seconds unless agreed otherwise

1

We start a synthetic patient conversation

The patient simulator opens the conversation. The endpoint being tested does not see scenario labels or hidden scoring details.

2

We send the provider turn request

Raw models receive the full visible dialogue history. Embedded systems receive the latest patient message in their active session by default.

3

We continue turn by turn

Turns remain sequential within each transcript, but multiple transcripts can run at the same time if your rate limits allow it.

4

We score completed transcripts

Both integration modes produce the same canonical visible transcript shape for the existing scoring pipeline.

Rate limits and failure handling

We can lower transcript concurrency to fit your service limits. Embedded systems should also provide maximum concurrent sessions, session create and close timeouts, and requests per minute. Please give us conservative limits rather than best-case burst numbers.

Tell us your limits

Maximum concurrent requests.
Requests per minute and tokens per minute.
Expected p50 and p95 latency.
Whether long contexts are slower or have a lower output cap.

Return clear errors

401 or 403 for auth problems.
400 or 422 for unsupported request shape.
429 for rate limiting.
5xx or 408 for transient infrastructure failures.

K-Bench backs off and retries rate-limit and transient infrastructure failures. Authentication and request-shape failures are treated as setup issues and should be fixed before the evaluation run.

Benchmark integrity

The longevity of K-Bench depends on making the benchmark hard to game. We therefore discuss each third-party test before accepting an endpoint.

Allowed

Your normal model, safety layer, routing layer, or product system.
A fixed system prompt or policy stack that would be used in deployment.
Operational logging needed to keep the endpoint reliable, if agreed in advance.

Not allowed

Tuning against private K-Bench prompts, vignettes, transcripts, or scoring artifacts.
Changing the submitted system during the run without disclosure.
Human intervention, manual answer editing, or delayed review inside the endpoint.
Using benchmark traffic to train, memorize, or infer private test items.

Ready to discuss a run?

Contact the K-Bench team with your endpoint shape, model or system description, expected rate limits, and any governance constraints. We will confirm whether the setup is appropriate before requesting a temporary token.

Contact us Read methodology View leaderboard