Skip to content

feat(api): consolidate HTTP API endpoints and fixes#1282

Merged
shihaobai merged 15 commits intomainfrom
api-fixes-extract
Apr 29, 2026
Merged

feat(api): consolidate HTTP API endpoints and fixes#1282
shihaobai merged 15 commits intomainfrom
api-fixes-extract

Conversation

@sufubao
Copy link
Copy Markdown
Collaborator

@sufubao sufubao commented Apr 29, 2026

Summary

Extracted the HTTP API surface changes from the bloated qwen35_stable_chat_template branch into a focused PR. The source branch had ~125 changed files mixing API work with unrelated radix-cache / mamba / DP-EP / multimodal / OOM-probe changes; this PR contains only the API-related portion (22 files).

What's in

New endpoints

  • /v1/messages — Anthropic Messages compat, with extra_body forwarding into chat-completion params and a fix for tool_use streaming when deltas for an earlier tool index arrive after a later one opens.
  • /v1/responses — OpenAI Responses API compat (stateless lifecycle: retrieve/delete/cancel return error).

OpenAI compat fixes

  • Wrap error responses in {error:{message,type,param,code}} envelope.
  • SSE alignment: role-only initial chunk + data:[DONE] terminator; reasoning tokens stream immediately; flush partial-tag buffer on truncation.
  • New fields: reasoning_effort, prompt_tokens_details, ChatMessage.reasoning alias.
  • max_tokens default is None (was 16384) so the -1 sentinel can fall through to the budget logic below.

Sampling/length behavior

  • max_new_tokens=-1 sentinel resolves to max_req_total_len - prompt_tokens so clients that don't specify max_tokens get the full remaining budget instead of a 16384 cap.
  • Reject prompts whose char length exceeds 8 * max_req_total_len pre-tokenize (cheap rejection of obviously-too-long inputs).
  • Catch ValueError across all endpoints and return HTTP 400 instead of 500.
  • Run tokenizer.encode in a thread-pool executor to keep the event loop responsive on long inputs.
  • Re-inject prompt_cache_len into response metadata.
  • Optional tool-name validation gated by LIGHTLLM_ENABLE_TOOL_NAME_CHECK env.

Misc

  • Replace gunicorn's --access-logfile - with a FastAPI access-log middleware (avoids double-logging once the middleware is in).
  • Drop unused Function.response field that was leaking <response>null</response> into chat-template renders (added ~7 tokens/tool, drifted prompts vs other engines).
  • Alias assistant.reasoningreasoning_content for Qwen3 templates so OpenRouter-style replays don't render as empty <think></think> blocks.
  • New chat templates under test/chat_template/ for Qwen3.5 fixed and vela-alpha.
  • End-to-end smoke test covering every HTTP endpoint (test/test_api/test_all_endpoints.py).
  • litellm declared as extras_require['anthropic'] (only needed when serving /v1/messages).

What's intentionally NOT in this PR

These also touch nominally-API-adjacent files but are tied to other features and will land separately:

  • ViT memory / token-budget admission (multimodal)
  • Radix cache redesign / hybrid cache hit rate work
  • CPU mamba cache offload
  • DP+EP / qwen3.5-moe layer infer
  • OOM probe (LIGHTLLM_CHECK_OOM=1)
  • Logging refactor (colored output, windowed cache stats)
  • Prometheus metric label restructure (adds model_name label — backward-incompat, needs separate review)

Test plan

  • All 18 modified Python files pass black 21.12b0 (the version configured by .pre-commit-config.yaml)
  • flake8 6.1.0 ruleset clean (the 2 F824 warnings on this branch pre-exist on main and aren't enforced by configured flake8 6.1.0)
  • CI: full test suite
  • Smoke test new endpoints: /v1/messages, /v1/responses
  • Verify SSE stream contract (role-only initial chunk, data:[DONE] terminator) against an OpenAI SDK client
  • Verify error envelope format against an OpenAI SDK client
  • Verify max_tokens unset behavior allows full max_req_total_len - prompt_tokens budget

Note: pre-commit hooks did not run during commit because the project's pre-commit virtualenv install fails on this shared storage (flock errno=2). Code was verified by running black 21.12b0 (per-file, to bypass an unrelated asyncio incompatibility with Python 3.12) and flake8 6.1.0 ruleset directly. Reviewer should re-run hooks in their own environment if they wish to double-check.

Extracted from qwen35_stable_chat_template branch into a focused PR
against main, separated from unrelated radix cache / mamba / DP-EP work
that was bundled in the same branch.

New endpoints:
- /v1/messages (Anthropic Messages compat): extra_body field forwarding,
  tool_use stream re-open on interleaved deltas, better missing-litellm error
- /v1/responses (OpenAI Responses API compat, stateless lifecycle)

OpenAI compatibility:
- Wrap error responses in {error:{message,type,param,code}} envelope
- SSE: role-only initial chunk + data:[DONE] terminator
- Stream reasoning tokens immediately, flush buffer on truncation
- reasoning_effort field, prompt_tokens_details, ChatMessage.reasoning alias
- Default max_tokens=None so -1 sentinel falls through to budget

Behavior:
- max_new_tokens=-1 sentinel resolves to max_req_total_len - prompt_tokens
- Reject prompts whose char length > 8 * max_req_total_len pre-tokenize
- Catch ValueError -> HTTP 400 instead of 500 across all endpoints
- Run tokenizer.encode in executor to keep event loop responsive
- Re-inject prompt_cache_len into response metadata
- Tool name validation gated by LIGHTLLM_ENABLE_TOOL_NAME_CHECK env

Misc:
- Replace gunicorn --access-logfile with FastAPI middleware
- Drop unused Function.response that leaked <response>null</response>
  into chat templates
- Alias assistant.reasoning -> reasoning_content for Qwen3 templates
- Qwen3.5 fixed and vela-alpha chat templates
- End-to-end smoke test covering every HTTP endpoint
- litellm declared as 'anthropic' extras_require

Hooks bypassed: pre-commit virtualenv install fails on this filesystem
(flock errno=2 in shared storage). Code verified clean by running
black 21.12b0 (hook-pinned version) and flake8 6.1.0 ruleset directly.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the OpenAI Responses API compatibility layer, enabling LightLLM to support the newer OpenAI SDK response format through a new /v1/responses endpoint. Key enhancements include support for reasoning effort parameters, the inclusion of prompt cache details in usage reporting, and the implementation of non-blocking tokenization by offloading it to an executor. The update also refactors error handling across the OpenAI and Anthropic adapters to provide descriptive error envelopes and improves the handling of interleaved tool calls in streaming responses. Review feedback correctly identified missing uuid imports in several API modules, potential breaking changes due to the renaming of reasoning fields in model definitions, and type inconsistencies in streaming generators that could affect ASGI middleware.

"index": state["anthropic_index"],
"content_block": {
"type": "tool_use",
"id": state["id"] or f"toolu_{uuid.uuid4().hex[:24]}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The uuid module is used here to generate a unique tool use ID, but it does not appear to be imported in this file. This will cause a NameError at runtime when this branch is hit.

reasoning_parser_dict = {}

# Pre-generate a UUID-style request ID (matching the 36888 service format)
chat_completion_id = f"chatcmpl-{uuid.uuid4().hex}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The uuid module is used to generate chat_completion_id, but it is not imported in this file. This will lead to a NameError during execution.

role: Optional[str] = None
content: Optional[str] = None
reasoning_content: Optional[str] = None
reasoning: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Renaming reasoning_content to reasoning in ChatMessage is a breaking change for existing API clients. To maintain backward compatibility while supporting the new field name, both fields should be present, or reasoning should be added as an alias.

Suggested change
reasoning: Optional[str] = None
reasoning: Optional[str] = None
reasoning_content: Optional[str] = None

content: Optional[str] = None
tool_calls: Optional[List[ToolCall]] = Field(default=None, examples=[None])
reasoning_content: Optional[str] = None
reasoning: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Renaming reasoning_content to reasoning in DeltaMessage is a breaking change for streaming clients. It is recommended to keep both fields to ensure compatibility with existing integrations.

Suggested change
reasoning: Optional[str] = None
reasoning: Optional[str] = None
reasoning_content: Optional[str] = None

Comment thread lightllm/server/api_openai.py Outdated
Comment on lines +943 to +960
yield ("data: " + json.dumps(stream_resp.dict(), ensure_ascii=False) + "\n\n").encode("utf-8")

yield "data: [DONE]\n\n".encode("utf-8")
usage = UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens),
)
usage_chunk = CompletionStreamResponse(
id=group_request_id,
created=created_time,
choices=[], # Empty choices array as per OpenAI spec
model=request.model,
usage=usage,
)
yield f"data: {usage_chunk.model_dump_json()}\n\n"

if request.stream_options and request.stream_options.include_usage:
usage = UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
)
usage_chunk = CompletionStreamResponse(
id=group_request_id,
created=created_time,
choices=[], # Empty choices array as per OpenAI spec
model=request.model,
usage=usage,
)
yield f"data: {usage_chunk.model_dump_json()}\n\n"
yield "data: [DONE]\n\n".encode("utf-8")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The streaming generator in completions_impl yields a mix of bytes and strings, which is inconsistent and can cause issues with certain ASGI middlewares or clients. Specifically, line 943 and 960 yield encoded bytes, while line 958 yields a string. Additionally, stream_resp.dict() is deprecated in Pydantic v2; model_dump() should be used instead for consistency with other parts of the codebase.

Suggested change
yield ("data: " + json.dumps(stream_resp.dict(), ensure_ascii=False) + "\n\n").encode("utf-8")
yield "data: [DONE]\n\n".encode("utf-8")
usage = UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens),
)
usage_chunk = CompletionStreamResponse(
id=group_request_id,
created=created_time,
choices=[], # Empty choices array as per OpenAI spec
model=request.model,
usage=usage,
)
yield f"data: {usage_chunk.model_dump_json()}\n\n"
if request.stream_options and request.stream_options.include_usage:
usage = UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
)
usage_chunk = CompletionStreamResponse(
id=group_request_id,
created=created_time,
choices=[], # Empty choices array as per OpenAI spec
model=request.model,
usage=usage,
)
yield f"data: {usage_chunk.model_dump_json()}\n\n"
yield "data: [DONE]\n\n".encode("utf-8")
yield "data: " + stream_resp.model_dump_json() + "\n\n"
usage = UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
prompt_tokens_details=PromptTokensDetails(cached_tokens=cached_tokens),
)
usage_chunk = CompletionStreamResponse(
id=group_request_id,
created=created_time,
choices=[], # Empty choices array as per OpenAI spec
model=request.model,
usage=usage,
)
yield f"data: {usage_chunk.model_dump_json()}\n\n"
yield "data: [DONE]\n\n"

Comment on lines +62 to +70
async def _safe_stream_wrapper(stream_generator):
"""Wrap a streaming generator to catch ValueError (e.g. input too long) and yield an SSE error
event instead of letting the exception propagate to Starlette which prints a long traceback."""
try:
async for item in stream_generator:
yield item
except ValueError as e:
error_data = json.dumps({"error": {"message": str(e), "type": "invalid_request_error"}})
yield f"data: {error_data}\n\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _safe_stream_wrapper yields a string in the except block, but it may wrap generators that yield bytes (e.g., in api_tgi.py or completions_impl). Mixing types in a single stream can lead to unexpected behavior in downstream consumers. It is safer to yield bytes if the input stream is byte-oriented.

sufubao and others added 14 commits April 29, 2026 19:24
Critical:
- ChatMessage / DeltaMessage: keep both `reasoning` and `reasoning_content`
  with a model_validator that auto-syncs between them, so legacy clients
  reading `reasoning_content` continue to work alongside the new
  OpenAI-aligned `reasoning` field.
- Default LIGHTLLM_ENABLE_TOOL_NAME_CHECK=True (was False). Tool-name
  validation must be opt-out, not opt-in, to avoid silently accepting
  hallucinated tool calls.

Pre-existing polish:
- api_openai.py: stream_resp.dict() -> model_dump_json() (Pydantic v2)
- api_openai.py / api_tgi.py: yield strings consistently in SSE streams
  so _safe_stream_wrapper's string error path doesn't mid-stream-switch
  type with byte-yielding producers.
- httpserver/manager.py: asyncio.get_event_loop().run_in_executor(...)
  -> asyncio.to_thread(...) (Python 3.10+ recommended).
- api_openai.py: drop dead `nonlocal offset` (F824).
- build_prompt.py: drop dead `global tokenizer` (F824).
Replace BaseHTTPMiddleware-based @app.middleware("http") access log with
a pure ASGI class middleware. Starlette's BaseHTTPMiddleware swallows
http.disconnect on the inner request's receive channel, so
Request.is_disconnected() never flipped True and the abort path at
HttpServerManager._wait_to_token_package never fired — inference kept
running until natural EOS / max_tokens after the client closed the
socket, leaving ghost requests on the GPU.
@shihaobai shihaobai merged commit e28f984 into main Apr 29, 2026
1 check passed
@shihaobai shihaobai deleted the api-fixes-extract branch April 29, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants