Endpoint and token
A base URL or full chat completions URL, plus a temporary bearer token accepted in the Authorization header. Tell us the expiry time and whether IP allowlisting is required.
Model evaluation access
K-Bench can evaluate third-party models and deployed systems when there is a clear benchmark-validity agreement and a temporary API endpoint we can call during the run.
The easiest integration is an OpenAI-compatible chat completions endpoint. If your API uses a different path, parameter names, or fixed server-side settings, we can use a small provider-specific adapter as long as the same core contract is preserved: message history in, visible assistant reply out.
Contact the K-Bench team with your endpoint shape, model or system description, expected rate limits, and any governance constraints. We will confirm whether the setup is appropriate before requesting a temporary token.
The quickest path is to expose one HTTPS endpoint that behaves like a chat completions API. If your submission is a full product system rather than a raw model, we evaluate the behavior of that fixed system as submitted.
A base URL or full chat completions URL, plus a temporary bearer token accepted in the Authorization header. Tell us the expiry time and whether IP allowlisting is required.
The exact value we should send in the model field, the public display label you expect, and whether the endpoint represents a raw model, a safety-wrapped model, or a larger orchestration layer.
Your maximum concurrent requests, requests per minute, tokens per minute, request timeout preference, context window, output cap, and any unsupported request fields.
Tell us which controls are accepted, fixed, renamed, or unavailable. For example, if temperature is fixed by your system, or appears under a company-specific field name, the adapter can omit it or map it explicitly.
Whether prompts and responses are logged, retained, inspected, or used for training. Private benchmark content must not be used to tune, memorize, or reverse-engineer the benchmark.
K-Bench's internal provider call is non-streaming and text-only. For each therapist/provider turn, the patient simulator produces the latest patient message, then K-Bench sends that message plus the prior dialogue context to the model being tested. The preferred network format is a JSON POST to /chat/completions, but a company adapter can translate to another endpoint shape when needed.
Authorization: Bearer <temporary-token>, Content-Type: application/json, and Accept: application/json. We may also send an identifying title header.
Roles use the standard system, user, and assistant shape. Patient turns arrive as user; previous therapist/model replies arrive as assistant.
Negotiated controls may include temperature, max_tokens, and reasoning.effort. They can be omitted, fixed server-side, or mapped to vendor-specific names.
Minimum viable contract: K-Bench needs to send messages and receive one visible assistant reply. Everything else, including temperature, token caps, reasoning controls, safety policy names, and tracing fields, is an agreed parameter map for that run.
Why the example has multiple messages: this is a mid-conversation provider request. Earlier user and assistant messages are context from previous turns. The final user message is the current patient turn, and your endpoint returns the next therapist/provider reply. On the first provider turn, the message list is usually just the system instruction and the patient's opening message.
curl -X POST "https://your-domain.example/v1/chat/completions" \
-H "Authorization: Bearer $TEMP_TOKEN" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"model": "your-organization/your-model-or-system",
"messages": [
{
"role": "system",
"content": "<K-Bench provider instructions supplied at run time>"
},
{
"role": "user",
"content": "I do not really know why I agreed to chat today."
},
{
"role": "assistant",
"content": "I am glad you did. What has been feeling hardest today?"
},
{
"role": "user",
"content": "It has just been getting worse at home."
}
],
"temperature": 0.7,
"max_tokens": 2400,
"reasoning": {
"effort": "high"
}
}'
Return a single JSON object with the model's visible reply at choices[0].message.content. Usage fields are welcome but optional. Do not put hidden reasoning or private chain-of-thought in the visible content.
{
"id": "chatcmpl_kbench_example",
"object": "chat.completion",
"created": 1781800000,
"model": "your-organization/your-model-or-system",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "That sounds frightening, and I am glad you said it here. When you say things are worse at home, are you safe right now?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1210,
"completion_tokens": 38,
"total_tokens": 1248
}
}
Important: an empty completion, malformed JSON, or finish_reason: "length" can cause retries or a failed transcript. If your system refuses a request, return a concise assistant-facing refusal rather than an empty message. If your API returns a different response shape, the provider adapter must normalize it to this visible-reply contract before scoring.
If a company endpoint is not exactly OpenAI-compatible, we can use a provider-specific adapter. The adapter is a mechanical translation layer; it should not rewrite benchmark prompts, alter patient messages, or optimize responses for K-Bench.
temperature when needed.K-Bench evaluates behavior inside multi-turn mental-health conversations. The third-party endpoint is called only for the model or system being tested; the simulated patient and scoring pipeline are handled separately.
The patient simulator opens the conversation. The endpoint being tested does not see scenario labels or hidden scoring details.
Your endpoint receives the provider system prompt plus the full prior dialogue as message history. It should respond as the assistant.
Turns remain sequential within each transcript, but multiple transcripts can run at the same time if your rate limits allow it.
Results are judged through the benchmark pipeline and published only according to the prior agreement.
We can lower transcript concurrency to fit your service limits. Please give us conservative limits rather than best-case burst numbers, especially if your endpoint fronts a larger product system.
401 or 403 for auth problems.400 or 422 for unsupported request shape.429 for rate limiting.5xx or 408 for transient infrastructure failures.K-Bench backs off and retries rate-limit and transient infrastructure failures. Authentication and request-shape failures are treated as setup issues and should be fixed before the evaluation run.
The longevity of K-Bench depends on making the benchmark hard to game. We therefore discuss each third-party test before accepting an endpoint.
Contact the K-Bench team with your endpoint shape, model or system description, expected rate limits, and any governance constraints. We will confirm whether the setup is appropriate before requesting a temporary token.