Model evaluation access

Test Your Model

K-Bench can evaluate third-party models and deployed systems when there is a clear benchmark-validity agreement and a temporary API endpoint we can call during the run.

The easiest integration is an OpenAI-compatible chat completions endpoint. If your API uses a different path, parameter names, or fixed server-side settings, we can use a small provider-specific adapter as long as the same core contract is preserved: message history in, visible assistant reply out.

Ready to discuss a run?

Contact the K-Bench team with your endpoint shape, model or system description, expected rate limits, and any governance constraints. We will confirm whether the setup is appropriate before requesting a temporary token.

What you provide

The quickest path is to expose one HTTPS endpoint that behaves like a chat completions API. If your submission is a full product system rather than a raw model, we evaluate the behavior of that fixed system as submitted.

Endpoint and token

A base URL or full chat completions URL, plus a temporary bearer token accepted in the Authorization header. Tell us the expiry time and whether IP allowlisting is required.

Model or system identity

The exact value we should send in the model field, the public display label you expect, and whether the endpoint represents a raw model, a safety-wrapped model, or a larger orchestration layer.

Operational limits

Your maximum concurrent requests, requests per minute, tokens per minute, request timeout preference, context window, output cap, and any unsupported request fields.

Parameter map

Tell us which controls are accepted, fixed, renamed, or unavailable. For example, if temperature is fixed by your system, or appears under a company-specific field name, the adapter can omit it or map it explicitly.

Data handling

Whether prompts and responses are logged, retained, inspected, or used for training. Private benchmark content must not be used to tune, memorize, or reverse-engineer the benchmark.

Request format

K-Bench's internal provider call is non-streaming and text-only. For each therapist/provider turn, the patient simulator produces the latest patient message, then K-Bench sends that message plus the prior dialogue context to the model being tested. The preferred network format is a JSON POST to /chat/completions, but a company adapter can translate to another endpoint shape when needed.

Headers

Authorization: Bearer <temporary-token>, Content-Type: application/json, and Accept: application/json. We may also send an identifying title header.

Messages

Roles use the standard system, user, and assistant shape. Patient turns arrive as user; previous therapist/model replies arrive as assistant.

Optional fields

Negotiated controls may include temperature, max_tokens, and reasoning.effort. They can be omitted, fixed server-side, or mapped to vendor-specific names.

Minimum viable contract: K-Bench needs to send messages and receive one visible assistant reply. Everything else, including temperature, token caps, reasoning controls, safety policy names, and tracing fields, is an agreed parameter map for that run.

Why the example has multiple messages: this is a mid-conversation provider request. Earlier user and assistant messages are context from previous turns. The final user message is the current patient turn, and your endpoint returns the next therapist/provider reply. On the first provider turn, the message list is usually just the system instruction and the patient's opening message.

Mid-conversation provider request POST /chat/completions
curl -X POST "https://your-domain.example/v1/chat/completions" \
  -H "Authorization: Bearer $TEMP_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "model": "your-organization/your-model-or-system",
    "messages": [
      {
        "role": "system",
        "content": "<K-Bench provider instructions supplied at run time>"
      },
      {
        "role": "user",
        "content": "I do not really know why I agreed to chat today."
      },
      {
        "role": "assistant",
        "content": "I am glad you did. What has been feeling hardest today?"
      },
      {
        "role": "user",
        "content": "It has just been getting worse at home."
      }
    ],
    "temperature": 0.7,
    "max_tokens": 2400,
    "reasoning": {
      "effort": "high"
    }
  }'

Response format

Return a single JSON object with the model's visible reply at choices[0].message.content. Usage fields are welcome but optional. Do not put hidden reasoning or private chain-of-thought in the visible content.

Example response HTTP 200
{
  "id": "chatcmpl_kbench_example",
  "object": "chat.completion",
  "created": 1781800000,
  "model": "your-organization/your-model-or-system",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "That sounds frightening, and I am glad you said it here. When you say things are worse at home, are you safe right now?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1210,
    "completion_tokens": 38,
    "total_tokens": 1248
  }
}

Important: an empty completion, malformed JSON, or finish_reason: "length" can cause retries or a failed transcript. If your system refuses a request, return a concise assistant-facing refusal rather than an empty message. If your API returns a different response shape, the provider adapter must normalize it to this visible-reply contract before scoring.

Provider adapter contract

If a company endpoint is not exactly OpenAI-compatible, we can use a provider-specific adapter. The adapter is a mechanical translation layer; it should not rewrite benchmark prompts, alter patient messages, or optimize responses for K-Bench.

Translate inputs

  • Map the agreed URL, auth headers, and model or system identifier.
  • Preserve the message order and role semantics.
  • Keep each request stateless unless explicitly agreed otherwise.

Handle parameters

  • Omit unsupported controls such as temperature when needed.
  • Map vendor-specific fields such as namespaced generation settings.
  • Record fixed server-side settings so the run remains interpretable.

Normalize outputs

  • Return the visible assistant reply as plain text.
  • Preserve refusals or safety interventions as the model's actual answer.
  • Do not expose hidden reasoning as user-visible content.

Keep the run fixed

  • Use the same model, policy stack, prompts, and generation settings throughout.
  • Do not deploy mid-run changes without disclosure.
  • Log enough metadata to diagnose failures without leaking benchmark internals.

How K-Bench uses the endpoint

K-Bench evaluates behavior inside multi-turn mental-health conversations. The third-party endpoint is called only for the model or system being tested; the simulated patient and scoring pipeline are handled separately.

Conversation length 20 turns by default
Provider calls About 10 per successful transcript
Default concurrency Up to 50 transcripts in parallel
Request timeout 60 seconds unless agreed otherwise
1

We start a synthetic patient conversation

The patient simulator opens the conversation. The endpoint being tested does not see scenario labels or hidden scoring details.

2

We send the provider turn request

Your endpoint receives the provider system prompt plus the full prior dialogue as message history. It should respond as the assistant.

3

We continue turn by turn

Turns remain sequential within each transcript, but multiple transcripts can run at the same time if your rate limits allow it.

4

We score completed transcripts

Results are judged through the benchmark pipeline and published only according to the prior agreement.

Rate limits and failure handling

We can lower transcript concurrency to fit your service limits. Please give us conservative limits rather than best-case burst numbers, especially if your endpoint fronts a larger product system.

Tell us your limits

  • Maximum concurrent requests.
  • Requests per minute and tokens per minute.
  • Expected p50 and p95 latency.
  • Whether long contexts are slower or have a lower output cap.

Return clear errors

  • 401 or 403 for auth problems.
  • 400 or 422 for unsupported request shape.
  • 429 for rate limiting.
  • 5xx or 408 for transient infrastructure failures.

K-Bench backs off and retries rate-limit and transient infrastructure failures. Authentication and request-shape failures are treated as setup issues and should be fixed before the evaluation run.

Benchmark integrity

The longevity of K-Bench depends on making the benchmark hard to game. We therefore discuss each third-party test before accepting an endpoint.

Allowed

  • Your normal model, safety layer, routing layer, or product system.
  • A fixed system prompt or policy stack that would be used in deployment.
  • Operational logging needed to keep the endpoint reliable, if agreed in advance.

Not allowed

  • Tuning against private K-Bench prompts, vignettes, transcripts, or scoring artifacts.
  • Changing the submitted system during the run without disclosure.
  • Human intervention, manual answer editing, or delayed review inside the endpoint.
  • Using benchmark traffic to train, memorize, or infer private test items.

Ready to discuss a run?

Contact the K-Bench team with your endpoint shape, model or system description, expected rate limits, and any governance constraints. We will confirm whether the setup is appropriate before requesting a temporary token.