feat(rime): add WebSocket streaming TTS support#5663
Conversation
Adds opt-in WS streaming to the Rime TTS plugin via use_websocket=True. Pattern mirrors the Cartesia plugin: single-context JSON+base64 WS, ConnectionPool with mark_refreshed_on_get=True, blingfire sentence tokenizer, weakref.WeakSet for stream cleanup. - New SynthesizeStream class with input/send/recv task split - _connect_ws / _close_ws (eos shutdown, mirrors Deepgram) - _model_params helper consolidates the arcana/mist option-walking shared between the WS query string and the HTTP body - update_options invalidates the pool when the WS URL changes, computed via before/after _ws_url() diff - Capabilities flips streaming and aligned_transcript on with the flag - Routes to /ws3 only (mistv2 stays HTTP-only)
012c6ec to
50518df
Compare
tinalenguyen
left a comment
There was a problem hiding this comment.
thanks for the pr, left a comment! also we do not have a rime plugin testing key at the moment, so we can omit the test file edit for now
| self, | ||
| *, | ||
| base_url: str = RIME_BASE_URL, | ||
| ws_base_url: str = RIME_WS_BASE_URL, |
There was a problem hiding this comment.
i think it might be confusing to pass both the http and wss url, wdyt of keeping base_url and adding logic to detect the prefix to see if it's meant to be streaming? if the base url is not given then we can check via use_websocket (i think use_websocket could supersede the url)
There was a problem hiding this comment.
I like that idea + have implemented it. This is resolved, I believe.
Address review feedback on PR livekit#5663: - Single base_url param; default chosen from use_websocket. wss/ws prefix also infers streaming mode. Explicit use_websocket=True supersedes. - Drop rime STREAM_TTS test entry (no upstream Rime test key available).
ChunkedStream POSTs to self._base_url, which is now the wss:// URL when use_websocket=True. Raise upfront with a symmetric message to stream() instead of letting an HTTP POST hit a wss endpoint.
tinalenguyen
left a comment
There was a problem hiding this comment.
i tested it out and the websocket path worked well, but when trying rime.TTS() i get:
aiohttp.client_exceptions.WSServerHandshakeError: 400, message='Invalid response status', url='https://users.rime.ai/v1/rime-tts/ws3?speaker=astra&modelId=arcana&audioFormat=pcm&samplingRate=22050&segment=bySentence&lang=eng'
let me know if you're seeing this as well!
Summary
Adds opt-in WebSocket streaming to the Rime TTS plugin via a new
use_websocket=Trueconstructor argument. The existing HTTPsynthesizepath is unchanged and remains the default. When enabled, the plugin setsstreaming=Trueandaligned_transcript=Trueduring construction, opens a long-lived pooled WebSocket to Rime's/ws3endpoint, and emits word-level timestamps viapush_timed_transcript.New constructor arguments
use_websocket: bool = False— opt into the streaming path. Off by default so existing consumers see no behavior change.ws_base_url: str = "wss://users-ws.rime.ai"— overridable for self-hosted deployments, parallel to the existingbase_url.segment: NotGivenOr[str] = NOT_GIVEN— passed to Rime as a connect-time query param. Defaults to"bySentence"(server-side sentence buffering, mirrorsStreamAdaptersemantics). Pass"immediate"if the consumer is already feeding sentence-tokenized text and wants to skip server-side buffering.tokenizer: NotGivenOr[tokenize.SentenceTokenizer] = NOT_GIVEN— overridable client-side sentence tokenizer. Defaults totokenize.blingfire.SentenceTokenizer(). Mirrors the hook Cartesia exposes.Implementation
The streaming class is similar to the implementation in the Cartesia plugin: single-context JSON-envelope WebSocket, base64 PCM audio frames,
weakref.WeakSet[SynthesizeStream]for cleanup,utils.ConnectionPool[aiohttp.ClientWebSocketResponse]withmax_session_duration=300andmark_refreshed_on_get=True. Word timestamps are pushed asTimedString.Connection lifecycle:
_connect_wsopens the pooled WebSocket using the URL built from current options. Connect-time errors propagate to the outer_runexception block, which classifiesaiohttp.ClientResponseError(coveringWSServerHandshakeError) asAPIStatusErrorwith the HTTP status code preserved._close_wsfollows the graceful-shutdown pattern in the Deepgram plugin: send theeosoperation, wait one second for the server's ack, suppress-and-log any send or recv errors during teardown so they don't mask the original cause that evicted the connection from the pool.update_optionsinvalidates the pool when the WebSocket URL changes, computed via a before/after_ws_url()diff. This automatically handles model swaps, speaker swaps, and any per-model option that participates in the URL.A small
_model_params(opts)helper consolidates the per-model option walking shared between the WebSocket query string and the HTTP JSON body.Routes through
/ws3, which accepts every model the plugin supports (mistv2,mistv3,arcana). The older/ws2endpoint is not wired in.Validating
update_optionsmid-session: model swap drops the existing pooled connection and reconnects with the new URL. Verified by observing two distinct_connect_wscalls and matching audio output.APIStatusError(status_code=401)with the server message preserved, rather than a genericAPIConnectionError.tts.stream()followed byend_input()with nopush_text()raisesAPIErrorimmediately at the protocol layer rather than hanging on the receive timeout.max_session_durationwindow share the same WebSocket — no new handshake.use_websocket=False(default),synthesize()behavior is identical to before;_runpayload assembly continues to use the same_model_paramshelper plus HTTP-only fields (samplingRate,reduceLatencyformistv2).