feat(api): consolidate HTTP API endpoints and fixes by sufubao · Pull Request #1282 · ModelTC/LightLLM

sufubao · 2026-04-29T10:49:37Z

Summary

Extracted the HTTP API surface changes from the bloated qwen35_stable_chat_template branch into a focused PR. The source branch had ~125 changed files mixing API work with unrelated radix-cache / mamba / DP-EP / multimodal / OOM-probe changes; this PR contains only the API-related portion (22 files).

What's in

New endpoints

/v1/messages — Anthropic Messages compat, with extra_body forwarding into chat-completion params and a fix for tool_use streaming when deltas for an earlier tool index arrive after a later one opens.
/v1/responses — OpenAI Responses API compat (stateless lifecycle: retrieve/delete/cancel return error).

OpenAI compat fixes

Wrap error responses in {error:{message,type,param,code}} envelope.
SSE alignment: role-only initial chunk + data:[DONE] terminator; reasoning tokens stream immediately; flush partial-tag buffer on truncation.
New fields: reasoning_effort, prompt_tokens_details, ChatMessage.reasoning alias.
max_tokens default is None (was 16384) so the -1 sentinel can fall through to the budget logic below.

Sampling/length behavior

max_new_tokens=-1 sentinel resolves to max_req_total_len - prompt_tokens so clients that don't specify max_tokens get the full remaining budget instead of a 16384 cap.
Reject prompts whose char length exceeds 8 * max_req_total_len pre-tokenize (cheap rejection of obviously-too-long inputs).
Catch ValueError across all endpoints and return HTTP 400 instead of 500.
Run tokenizer.encode in a thread-pool executor to keep the event loop responsive on long inputs.
Re-inject prompt_cache_len into response metadata.
Optional tool-name validation gated by LIGHTLLM_ENABLE_TOOL_NAME_CHECK env.

Misc

Replace gunicorn's --access-logfile - with a FastAPI access-log middleware (avoids double-logging once the middleware is in).
Drop unused Function.response field that was leaking <response>null</response> into chat-template renders (added ~7 tokens/tool, drifted prompts vs other engines).
Alias assistant.reasoning → reasoning_content for Qwen3 templates so OpenRouter-style replays don't render as empty <think></think> blocks.
New chat templates under test/chat_template/ for Qwen3.5 fixed and vela-alpha.
End-to-end smoke test covering every HTTP endpoint (test/test_api/test_all_endpoints.py).
litellm declared as extras_require['anthropic'] (only needed when serving /v1/messages).

What's intentionally NOT in this PR

These also touch nominally-API-adjacent files but are tied to other features and will land separately:

ViT memory / token-budget admission (multimodal)
Radix cache redesign / hybrid cache hit rate work
CPU mamba cache offload
DP+EP / qwen3.5-moe layer infer
OOM probe (LIGHTLLM_CHECK_OOM=1)
Logging refactor (colored output, windowed cache stats)
Prometheus metric label restructure (adds model_name label — backward-incompat, needs separate review)

Test plan

All 18 modified Python files pass black 21.12b0 (the version configured by .pre-commit-config.yaml)
flake8 6.1.0 ruleset clean (the 2 F824 warnings on this branch pre-exist on main and aren't enforced by configured flake8 6.1.0)
CI: full test suite
Smoke test new endpoints: /v1/messages, /v1/responses
Verify SSE stream contract (role-only initial chunk, data:[DONE] terminator) against an OpenAI SDK client
Verify error envelope format against an OpenAI SDK client
Verify max_tokens unset behavior allows full max_req_total_len - prompt_tokens budget

Note: pre-commit hooks did not run during commit because the project's pre-commit virtualenv install fails on this shared storage (flock errno=2). Code was verified by running black 21.12b0 (per-file, to bypass an unrelated asyncio incompatibility with Python 3.12) and flake8 6.1.0 ruleset directly. Reviewer should re-run hooks in their own environment if they wish to double-check.

Extracted from qwen35_stable_chat_template branch into a focused PR against main, separated from unrelated radix cache / mamba / DP-EP work that was bundled in the same branch. New endpoints: - /v1/messages (Anthropic Messages compat): extra_body field forwarding, tool_use stream re-open on interleaved deltas, better missing-litellm error - /v1/responses (OpenAI Responses API compat, stateless lifecycle) OpenAI compatibility: - Wrap error responses in {error:{message,type,param,code}} envelope - SSE: role-only initial chunk + data:[DONE] terminator - Stream reasoning tokens immediately, flush buffer on truncation - reasoning_effort field, prompt_tokens_details, ChatMessage.reasoning alias - Default max_tokens=None so -1 sentinel falls through to budget Behavior: - max_new_tokens=-1 sentinel resolves to max_req_total_len - prompt_tokens - Reject prompts whose char length > 8 * max_req_total_len pre-tokenize - Catch ValueError -> HTTP 400 instead of 500 across all endpoints - Run tokenizer.encode in executor to keep event loop responsive - Re-inject prompt_cache_len into response metadata - Tool name validation gated by LIGHTLLM_ENABLE_TOOL_NAME_CHECK env Misc: - Replace gunicorn --access-logfile with FastAPI middleware - Drop unused Function.response that leaked <response>null</response> into chat templates - Alias assistant.reasoning -> reasoning_content for Qwen3 templates - Qwen3.5 fixed and vela-alpha chat templates - End-to-end smoke test covering every HTTP endpoint - litellm declared as 'anthropic' extras_require Hooks bypassed: pre-commit virtualenv install fails on this filesystem (flock errno=2 in shared storage). Code verified clean by running black 21.12b0 (hook-pinned version) and flake8 6.1.0 ruleset directly.

gemini-code-assist

Code Review

This pull request introduces the OpenAI Responses API compatibility layer, enabling LightLLM to support the newer OpenAI SDK response format through a new /v1/responses endpoint. Key enhancements include support for reasoning effort parameters, the inclusion of prompt cache details in usage reporting, and the implementation of non-blocking tokenization by offloading it to an executor. The update also refactors error handling across the OpenAI and Anthropic adapters to provide descriptive error envelopes and improves the handling of interleaved tool calls in streaming responses. Review feedback correctly identified missing uuid imports in several API modules, potential breaking changes due to the renaming of reasoning fields in model definitions, and type inconsistencies in streaming generators that could affect ASGI middleware.

gemini-code-assist · 2026-04-29T10:55:07Z

+                                    "index": state["anthropic_index"],
+                                    "content_block": {
+                                        "type": "tool_use",
+                                        "id": state["id"] or f"toolu_{uuid.uuid4().hex[:24]}",


The uuid module is used here to generate a unique tool use ID, but it does not appear to be imported in this file. This will cause a NameError at runtime when this branch is hit.

gemini-code-assist · 2026-04-29T10:55:07Z

    reasoning_parser_dict = {}

+    # Pre-generate a UUID-style request ID (matching the 36888 service format)
+    chat_completion_id = f"chatcmpl-{uuid.uuid4().hex}"


The uuid module is used to generate chat_completion_id, but it is not imported in this file. This will lead to a NameError during execution.

gemini-code-assist · 2026-04-29T10:55:07Z

    role: Optional[str] = None
    content: Optional[str] = None
-    reasoning_content: Optional[str] = None
+    reasoning: Optional[str] = None


Renaming reasoning_content to reasoning in ChatMessage is a breaking change for existing API clients. To maintain backward compatibility while supporting the new field name, both fields should be present, or reasoning should be added as an alias.

Suggested change

reasoning: Optional[str] = None

reasoning: Optional[str] = None

reasoning_content: Optional[str] = None

gemini-code-assist · 2026-04-29T10:55:07Z

    content: Optional[str] = None
    tool_calls: Optional[List[ToolCall]] = Field(default=None, examples=[None])
-    reasoning_content: Optional[str] = None
+    reasoning: Optional[str] = None


Renaming reasoning_content to reasoning in DeltaMessage is a breaking change for streaming clients. It is recommended to keep both fields to ensure compatibility with existing integrations.

Suggested change

reasoning: Optional[str] = None

reasoning: Optional[str] = None

reasoning_content: Optional[str] = None

gemini-code-assist · 2026-04-29T10:55:07Z

            yield ("data: " + json.dumps(stream_resp.dict(), ensure_ascii=False) + "\n\n").encode("utf-8")

-        yield "data: [DONE]\n\n".encode("utf-8")
+        usage = UsageInfo(
+            prompt_tokens=prompt_tokens,
+            completion_tokens=completion_tokens,
+            total_tokens=prompt_tokens + completion_tokens,
+            prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens),
+        )
+        usage_chunk = CompletionStreamResponse(
+            id=group_request_id,
+            created=created_time,
+            choices=[],  # Empty choices array as per OpenAI spec
+            model=request.model,
+            usage=usage,
+        )
+        yield f"data: {usage_chunk.model_dump_json()}\n\n"

-        if request.stream_options and request.stream_options.include_usage:
-            usage = UsageInfo(
-                prompt_tokens=prompt_tokens,
-                completion_tokens=completion_tokens,
-                total_tokens=prompt_tokens + completion_tokens,
-            )
-            usage_chunk = CompletionStreamResponse(
-                id=group_request_id,
-                created=created_time,
-                choices=[],  # Empty choices array as per OpenAI spec
-                model=request.model,
-                usage=usage,
-            )
-            yield f"data: {usage_chunk.model_dump_json()}\n\n"
+        yield "data: [DONE]\n\n".encode("utf-8")


The streaming generator in completions_impl yields a mix of bytes and strings, which is inconsistent and can cause issues with certain ASGI middlewares or clients. Specifically, line 943 and 960 yield encoded bytes, while line 958 yields a string. Additionally, stream_resp.dict() is deprecated in Pydantic v2; model_dump() should be used instead for consistency with other parts of the codebase.

Suggested change

yield ("data: " + json.dumps(stream_resp.dict(), ensure_ascii=False) + "\n\n").encode("utf-8")

yield "data: [DONE]\n\n".encode("utf-8")

usage = UsageInfo(

prompt_tokens=prompt_tokens,

completion_tokens=completion_tokens,

total_tokens=prompt_tokens + completion_tokens,

prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens),

)

usage_chunk = CompletionStreamResponse(

id=group_request_id,

created=created_time,

choices=[], # Empty choices array as per OpenAI spec

model=request.model,

usage=usage,

)

yield f"data: {usage_chunk.model_dump_json()}\n\n"

if request.stream_options and request.stream_options.include_usage:

usage = UsageInfo(

prompt_tokens=prompt_tokens,

completion_tokens=completion_tokens,

total_tokens=prompt_tokens + completion_tokens,

)

usage_chunk = CompletionStreamResponse(

id=group_request_id,

created=created_time,

choices=[], # Empty choices array as per OpenAI spec

model=request.model,

usage=usage,

)

yield f"data: {usage_chunk.model_dump_json()}\n\n"

yield "data: [DONE]\n\n".encode("utf-8")

yield "data: " + stream_resp.model_dump_json() + "\n\n"

usage = UsageInfo(

prompt_tokens=prompt_tokens,

completion_tokens=completion_tokens,

total_tokens=prompt_tokens + completion_tokens,

prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens),

)

usage_chunk = CompletionStreamResponse(

id=group_request_id,

created=created_time,

choices=[], # Empty choices array as per OpenAI spec

model=request.model,

usage=usage,

)

yield f"data: {usage_chunk.model_dump_json()}\n\n"

yield "data: [DONE]\n\n"

gemini-code-assist · 2026-04-29T10:55:07Z

+async def _safe_stream_wrapper(stream_generator):
+    """Wrap a streaming generator to catch ValueError (e.g. input too long) and yield an SSE error
+    event instead of letting the exception propagate to Starlette which prints a long traceback."""
+    try:
+        async for item in stream_generator:
+            yield item
+    except ValueError as e:
+        error_data = json.dumps({"error": {"message": str(e), "type": "invalid_request_error"}})
+        yield f"data: {error_data}\n\n"


The _safe_stream_wrapper yields a string in the except block, but it may wrap generators that yield bytes (e.g., in api_tgi.py or completions_impl). Mixing types in a single stream can lead to unexpected behavior in downstream consumers. It is safer to yield bytes if the input stream is byte-oriented.

Critical: - ChatMessage / DeltaMessage: keep both `reasoning` and `reasoning_content` with a model_validator that auto-syncs between them, so legacy clients reading `reasoning_content` continue to work alongside the new OpenAI-aligned `reasoning` field. - Default LIGHTLLM_ENABLE_TOOL_NAME_CHECK=True (was False). Tool-name validation must be opt-out, not opt-in, to avoid silently accepting hallucinated tool calls. Pre-existing polish: - api_openai.py: stream_resp.dict() -> model_dump_json() (Pydantic v2) - api_openai.py / api_tgi.py: yield strings consistently in SSE streams so _safe_stream_wrapper's string error path doesn't mid-stream-switch type with byte-yielding producers. - httpserver/manager.py: asyncio.get_event_loop().run_in_executor(...) -> asyncio.to_thread(...) (Python 3.10+ recommended). - api_openai.py: drop dead `nonlocal offset` (F824). - build_prompt.py: drop dead `global tokenizer` (F824).

Replace BaseHTTPMiddleware-based @app.middleware("http") access log with a pure ASGI class middleware. Starlette's BaseHTTPMiddleware swallows http.disconnect on the inner request's receive channel, so Request.is_disconnected() never flipped True and the abort path at HttpServerManager._wait_to_token_package never fired — inference kept running until natural EOS / max_tokens after the client closed the socket, leaving ghost requests on the GPU.

… into api-fixes-extract

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

sufubao and others added 14 commits April 29, 2026 19:24

chore: drop test_tokenizer_blocking and qwen3.5 chat templates from PR

aa63820

remove response

e357f64

Merge branch 'api-fixes-extract' of https://github.com/ModelTC/lightllm…

f2e5b6c

… into api-fixes-extract

fix

087c4da

fix

2108b29

fix

e45a274

remove unused code

7979e7e

fix

7b00104

remove unused code

109475d

update

104b4ce

fix

1c3b7e7

fix ascii

c7ceb6f

shihaobai merged commit e28f984 into main Apr 29, 2026
1 check passed

shihaobai deleted the api-fixes-extract branch April 29, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): consolidate HTTP API endpoints and fixes#1282

feat(api): consolidate HTTP API endpoints and fixes#1282
shihaobai merged 15 commits intomainfrom
api-fixes-extract

sufubao commented Apr 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	reasoning: Optional[str] = None
	reasoning: Optional[str] = None
	reasoning_content: Optional[str] = None

Conversation

sufubao commented Apr 29, 2026

Summary

What's in

What's intentionally NOT in this PR

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants