feat(api): consolidate HTTP API endpoints and fixes#1282
Conversation
Extracted from qwen35_stable_chat_template branch into a focused PR
against main, separated from unrelated radix cache / mamba / DP-EP work
that was bundled in the same branch.
New endpoints:
- /v1/messages (Anthropic Messages compat): extra_body field forwarding,
tool_use stream re-open on interleaved deltas, better missing-litellm error
- /v1/responses (OpenAI Responses API compat, stateless lifecycle)
OpenAI compatibility:
- Wrap error responses in {error:{message,type,param,code}} envelope
- SSE: role-only initial chunk + data:[DONE] terminator
- Stream reasoning tokens immediately, flush buffer on truncation
- reasoning_effort field, prompt_tokens_details, ChatMessage.reasoning alias
- Default max_tokens=None so -1 sentinel falls through to budget
Behavior:
- max_new_tokens=-1 sentinel resolves to max_req_total_len - prompt_tokens
- Reject prompts whose char length > 8 * max_req_total_len pre-tokenize
- Catch ValueError -> HTTP 400 instead of 500 across all endpoints
- Run tokenizer.encode in executor to keep event loop responsive
- Re-inject prompt_cache_len into response metadata
- Tool name validation gated by LIGHTLLM_ENABLE_TOOL_NAME_CHECK env
Misc:
- Replace gunicorn --access-logfile with FastAPI middleware
- Drop unused Function.response that leaked <response>null</response>
into chat templates
- Alias assistant.reasoning -> reasoning_content for Qwen3 templates
- Qwen3.5 fixed and vela-alpha chat templates
- End-to-end smoke test covering every HTTP endpoint
- litellm declared as 'anthropic' extras_require
Hooks bypassed: pre-commit virtualenv install fails on this filesystem
(flock errno=2 in shared storage). Code verified clean by running
black 21.12b0 (hook-pinned version) and flake8 6.1.0 ruleset directly.
There was a problem hiding this comment.
Code Review
This pull request introduces the OpenAI Responses API compatibility layer, enabling LightLLM to support the newer OpenAI SDK response format through a new /v1/responses endpoint. Key enhancements include support for reasoning effort parameters, the inclusion of prompt cache details in usage reporting, and the implementation of non-blocking tokenization by offloading it to an executor. The update also refactors error handling across the OpenAI and Anthropic adapters to provide descriptive error envelopes and improves the handling of interleaved tool calls in streaming responses. Review feedback correctly identified missing uuid imports in several API modules, potential breaking changes due to the renaming of reasoning fields in model definitions, and type inconsistencies in streaming generators that could affect ASGI middleware.
| "index": state["anthropic_index"], | ||
| "content_block": { | ||
| "type": "tool_use", | ||
| "id": state["id"] or f"toolu_{uuid.uuid4().hex[:24]}", |
| reasoning_parser_dict = {} | ||
|
|
||
| # Pre-generate a UUID-style request ID (matching the 36888 service format) | ||
| chat_completion_id = f"chatcmpl-{uuid.uuid4().hex}" |
| role: Optional[str] = None | ||
| content: Optional[str] = None | ||
| reasoning_content: Optional[str] = None | ||
| reasoning: Optional[str] = None |
There was a problem hiding this comment.
Renaming reasoning_content to reasoning in ChatMessage is a breaking change for existing API clients. To maintain backward compatibility while supporting the new field name, both fields should be present, or reasoning should be added as an alias.
| reasoning: Optional[str] = None | |
| reasoning: Optional[str] = None | |
| reasoning_content: Optional[str] = None |
| content: Optional[str] = None | ||
| tool_calls: Optional[List[ToolCall]] = Field(default=None, examples=[None]) | ||
| reasoning_content: Optional[str] = None | ||
| reasoning: Optional[str] = None |
There was a problem hiding this comment.
Renaming reasoning_content to reasoning in DeltaMessage is a breaking change for streaming clients. It is recommended to keep both fields to ensure compatibility with existing integrations.
| reasoning: Optional[str] = None | |
| reasoning: Optional[str] = None | |
| reasoning_content: Optional[str] = None |
| yield ("data: " + json.dumps(stream_resp.dict(), ensure_ascii=False) + "\n\n").encode("utf-8") | ||
|
|
||
| yield "data: [DONE]\n\n".encode("utf-8") | ||
| usage = UsageInfo( | ||
| prompt_tokens=prompt_tokens, | ||
| completion_tokens=completion_tokens, | ||
| total_tokens=prompt_tokens + completion_tokens, | ||
| prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens), | ||
| ) | ||
| usage_chunk = CompletionStreamResponse( | ||
| id=group_request_id, | ||
| created=created_time, | ||
| choices=[], # Empty choices array as per OpenAI spec | ||
| model=request.model, | ||
| usage=usage, | ||
| ) | ||
| yield f"data: {usage_chunk.model_dump_json()}\n\n" | ||
|
|
||
| if request.stream_options and request.stream_options.include_usage: | ||
| usage = UsageInfo( | ||
| prompt_tokens=prompt_tokens, | ||
| completion_tokens=completion_tokens, | ||
| total_tokens=prompt_tokens + completion_tokens, | ||
| ) | ||
| usage_chunk = CompletionStreamResponse( | ||
| id=group_request_id, | ||
| created=created_time, | ||
| choices=[], # Empty choices array as per OpenAI spec | ||
| model=request.model, | ||
| usage=usage, | ||
| ) | ||
| yield f"data: {usage_chunk.model_dump_json()}\n\n" | ||
| yield "data: [DONE]\n\n".encode("utf-8") |
There was a problem hiding this comment.
The streaming generator in completions_impl yields a mix of bytes and strings, which is inconsistent and can cause issues with certain ASGI middlewares or clients. Specifically, line 943 and 960 yield encoded bytes, while line 958 yields a string. Additionally, stream_resp.dict() is deprecated in Pydantic v2; model_dump() should be used instead for consistency with other parts of the codebase.
| yield ("data: " + json.dumps(stream_resp.dict(), ensure_ascii=False) + "\n\n").encode("utf-8") | |
| yield "data: [DONE]\n\n".encode("utf-8") | |
| usage = UsageInfo( | |
| prompt_tokens=prompt_tokens, | |
| completion_tokens=completion_tokens, | |
| total_tokens=prompt_tokens + completion_tokens, | |
| prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens), | |
| ) | |
| usage_chunk = CompletionStreamResponse( | |
| id=group_request_id, | |
| created=created_time, | |
| choices=[], # Empty choices array as per OpenAI spec | |
| model=request.model, | |
| usage=usage, | |
| ) | |
| yield f"data: {usage_chunk.model_dump_json()}\n\n" | |
| if request.stream_options and request.stream_options.include_usage: | |
| usage = UsageInfo( | |
| prompt_tokens=prompt_tokens, | |
| completion_tokens=completion_tokens, | |
| total_tokens=prompt_tokens + completion_tokens, | |
| ) | |
| usage_chunk = CompletionStreamResponse( | |
| id=group_request_id, | |
| created=created_time, | |
| choices=[], # Empty choices array as per OpenAI spec | |
| model=request.model, | |
| usage=usage, | |
| ) | |
| yield f"data: {usage_chunk.model_dump_json()}\n\n" | |
| yield "data: [DONE]\n\n".encode("utf-8") | |
| yield "data: " + stream_resp.model_dump_json() + "\n\n" | |
| usage = UsageInfo( | |
| prompt_tokens=prompt_tokens, | |
| completion_tokens=completion_tokens, | |
| total_tokens=prompt_tokens + completion_tokens, | |
| prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens), | |
| ) | |
| usage_chunk = CompletionStreamResponse( | |
| id=group_request_id, | |
| created=created_time, | |
| choices=[], # Empty choices array as per OpenAI spec | |
| model=request.model, | |
| usage=usage, | |
| ) | |
| yield f"data: {usage_chunk.model_dump_json()}\n\n" | |
| yield "data: [DONE]\n\n" |
| async def _safe_stream_wrapper(stream_generator): | ||
| """Wrap a streaming generator to catch ValueError (e.g. input too long) and yield an SSE error | ||
| event instead of letting the exception propagate to Starlette which prints a long traceback.""" | ||
| try: | ||
| async for item in stream_generator: | ||
| yield item | ||
| except ValueError as e: | ||
| error_data = json.dumps({"error": {"message": str(e), "type": "invalid_request_error"}}) | ||
| yield f"data: {error_data}\n\n" |
There was a problem hiding this comment.
The _safe_stream_wrapper yields a string in the except block, but it may wrap generators that yield bytes (e.g., in api_tgi.py or completions_impl). Mixing types in a single stream can lead to unexpected behavior in downstream consumers. It is safer to yield bytes if the input stream is byte-oriented.
Critical: - ChatMessage / DeltaMessage: keep both `reasoning` and `reasoning_content` with a model_validator that auto-syncs between them, so legacy clients reading `reasoning_content` continue to work alongside the new OpenAI-aligned `reasoning` field. - Default LIGHTLLM_ENABLE_TOOL_NAME_CHECK=True (was False). Tool-name validation must be opt-out, not opt-in, to avoid silently accepting hallucinated tool calls. Pre-existing polish: - api_openai.py: stream_resp.dict() -> model_dump_json() (Pydantic v2) - api_openai.py / api_tgi.py: yield strings consistently in SSE streams so _safe_stream_wrapper's string error path doesn't mid-stream-switch type with byte-yielding producers. - httpserver/manager.py: asyncio.get_event_loop().run_in_executor(...) -> asyncio.to_thread(...) (Python 3.10+ recommended). - api_openai.py: drop dead `nonlocal offset` (F824). - build_prompt.py: drop dead `global tokenizer` (F824).
Replace BaseHTTPMiddleware-based @app.middleware("http") access log with
a pure ASGI class middleware. Starlette's BaseHTTPMiddleware swallows
http.disconnect on the inner request's receive channel, so
Request.is_disconnected() never flipped True and the abort path at
HttpServerManager._wait_to_token_package never fired — inference kept
running until natural EOS / max_tokens after the client closed the
socket, leaving ghost requests on the GPU.
… into api-fixes-extract
Summary
Extracted the HTTP API surface changes from the bloated
qwen35_stable_chat_templatebranch into a focused PR. The source branch had ~125 changed files mixing API work with unrelated radix-cache / mamba / DP-EP / multimodal / OOM-probe changes; this PR contains only the API-related portion (22 files).What's in
New endpoints
/v1/messages— Anthropic Messages compat, withextra_bodyforwarding into chat-completion params and a fix for tool_use streaming when deltas for an earlier tool index arrive after a later one opens./v1/responses— OpenAI Responses API compat (stateless lifecycle: retrieve/delete/cancel return error).OpenAI compat fixes
{error:{message,type,param,code}}envelope.data:[DONE]terminator; reasoning tokens stream immediately; flush partial-tag buffer on truncation.reasoning_effort,prompt_tokens_details,ChatMessage.reasoningalias.max_tokensdefault isNone(was16384) so the-1sentinel can fall through to the budget logic below.Sampling/length behavior
max_new_tokens=-1sentinel resolves tomax_req_total_len - prompt_tokensso clients that don't specifymax_tokensget the full remaining budget instead of a 16384 cap.8 * max_req_total_lenpre-tokenize (cheap rejection of obviously-too-long inputs).ValueErroracross all endpoints and return HTTP 400 instead of 500.tokenizer.encodein a thread-pool executor to keep the event loop responsive on long inputs.prompt_cache_leninto response metadata.LIGHTLLM_ENABLE_TOOL_NAME_CHECKenv.Misc
--access-logfile -with a FastAPI access-log middleware (avoids double-logging once the middleware is in).Function.responsefield that was leaking<response>null</response>into chat-template renders (added ~7 tokens/tool, drifted prompts vs other engines).assistant.reasoning→reasoning_contentfor Qwen3 templates so OpenRouter-style replays don't render as empty<think></think>blocks.test/chat_template/for Qwen3.5 fixed and vela-alpha.test/test_api/test_all_endpoints.py).litellmdeclared asextras_require['anthropic'](only needed when serving/v1/messages).What's intentionally NOT in this PR
These also touch nominally-API-adjacent files but are tied to other features and will land separately:
LIGHTLLM_CHECK_OOM=1)model_namelabel — backward-incompat, needs separate review)Test plan
black 21.12b0(the version configured by.pre-commit-config.yaml)flake8 6.1.0ruleset clean (the 2 F824 warnings on this branch pre-exist on main and aren't enforced by configured flake8 6.1.0)/v1/messages,/v1/responsesdata:[DONE]terminator) against an OpenAI SDK clientmax_tokensunset behavior allows fullmax_req_total_len - prompt_tokensbudget