Enhance Redis HA with client retry and connection health settings

### Self Checks

- [x] I have read the [Contributing Guide](https://github.com/langgenius/dify/blob/main/CONTRIBUTING.md) and [Language Policy](https://github.com/langgenius/dify/issues/1542).
- [x] I have searched for existing issues [search for existing issues](https://github.com/langgenius/dify/issues), including closed ones.
- [x] I confirm that I am using English to submit this report, otherwise it will be closed.
- [x] Please do not modify this template :) and fill in all the required fields.

### 1. Is this request related to a challenge you're experiencing? Tell me about your story.

I am looking at Redis high availability behavior in the backend and found that the main Redis clients do not appear to be configured to retry failed calls during transient network issues or Redis failover windows.

The current implementation in `api/extensions/ext_redis.py` builds the main clients through:

- `redis.ConnectionPool(**redis_params)` for standalone Redis
- `sentinel.master_for(...)` for Sentinel
- `redis.Redis.from_url(...)` / `RedisCluster.from_url(...)` for pub/sub

However, the shared Redis parameters currently only include auth/db/encoding/cache settings and do not pass retry-oriented options such as:

- `retry`
- `retry_on_timeout`
- `retry_on_error`
- `socket_timeout`
- `socket_connect_timeout`
- `health_check_interval`

Since Dify is using `redis-py 7.3.0`. I verified that:

- clients created from `redis.ConnectionPool(...)` use connections with `retry._retries == 0`
- Sentinel-managed connections also end up with `retry._retries == 0`
- `Redis.from_url(...)` also results in connections with `retry._retries == 0`

So even though Sentinel/Cluster may help with topology discovery, individual failed commands are still surfaced immediately instead of being retried by the Redis client. This weakens HA behavior in practice, especially during:

- master failover
- brief network blips
- half-open or stale socket reuse

There is already one local example in `api/schedule/queue_monitor_task.py` that sets `socket_timeout`, `socket_connect_timeout`, and `health_check_interval`, which suggests this problem is already recognized in one isolated path but not applied consistently to the main backend Redis clients.

I would like to propose enhancing Redis HA by adding a shared retry/backoff and connection health policy for all backend Redis client construction paths in `api/extensions/ext_redis.py`.

### 2. Additional context or comments

Suggested direction:

- define one shared retry policy in `api/extensions/ext_redis.py`, for example with `redis.retry.Retry` plus exponential backoff
- pass that policy into standalone, Sentinel, Cluster, and pub/sub client creation
- add conservative `socket_timeout`, `socket_connect_timeout`, and `health_check_interval` defaults
- optionally expose the retry and timeout knobs in config so operators can tune them for different deployments

This would make Redis behavior more resilient without forcing every call site to implement its own retry logic.

It would also complement the existing `redis_fallback()` decorator, which is useful for best-effort paths but does not replace transport-level retry for transient failures.

### 3. Can you help us with this feature?

- [x] I am interested in contributing to this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Redis HA with client retry and connection health settings #34557

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

2. Additional context or comments

3. Can you help us with this feature?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhance Redis HA with client retry and connection health settings #34557

Description

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

2. Additional context or comments

3. Can you help us with this feature?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions