Fix manul failover with CLIENT PAUSE/UNPAUSE#389
Conversation
|
In actual testing with the Controller and Kvrocks deployed in the same IDC, the system achieved a single-node write QPS of 10k/s with 20MB/s throughput. The write-stop duration (stall time) remained consistently under 10ms. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## unstable #389 +/- ##
============================================
+ Coverage 43.38% 49.67% +6.28%
============================================
Files 37 45 +8
Lines 2971 4103 +1132
============================================
+ Hits 1289 2038 +749
- Misses 1544 1839 +295
- Partials 138 226 +88
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jihuayu All issues are fixed. Ready for another round of review. Thanks! |
|
@jihuayu @git-hulk By the way,I initially planned to use slave lags for sync checks, but found it adds a 1s delay because the master only updates these stats every second. To get real-time offsets, I had to query both master and slaves, which unfortunately complicates the logic. Why does Kvrocks cap the slave lag update frequency at 1s rather than providing real-time updates? Is it a performance trade-off to avoid overhead? |
|
Hi @Paragrf. Sorry for the delay! The PR looks good to me so far.
Yes, it is mainly a performance trade-off. The slave lag shown by the master is intended as a coarse-grained monitoring/health metric, so updating it once per second avoids adding extra work or locking on the replication hot path. |
@jihuayu @git-hulk OK, if we immediately trigger a _getack command on the server side upon receiving INFO replication, it will force the slave to report its current seq right away. Since INFO replication is currently only invoked by the operator and the controller, the performance impact on the server would be minimal. This also enables the monitoring system to capture real-time slave lags (the current parameter has a 1-second delay between the master and slave, which is virtually meaningless for monitoring purposes). Additionally, the controller's workflow during an active master switch can be simplified. What do you think of this idea? |
We shouldn't let the |
@git-hulk OK,How about adding an asynchronous probe in the controller to expose a slave lag metric for better monitoring accuracy? |
Yes, it would be better to do this in the controller. It can sync the lag between nodes before failover. |
OK, how about we merge this for now? It's already handled this way in the failover logic of this commit. The discussion here is only about the slave lag metric, which we can always add later. |
|
@Paragrf Great thanks for your contribution. |
Motivation
Implement Controller modifications to ensure master-replica data consistency during failover, aligning with server-side changes in apache/kvrocks#3377
Solution
Step 1 (Pause): Send CLIENT PAUSE WRITE from the controller to the current master.
Step 2 (Wait): Monitor the master-replica sequence gap until it hits zero, ensuring no data loss.
Step 3 (Metadata): Update the global topology metadata for the switchover.
Step 4 (Switch & Unpause): Promote the target and demote the old master; then explicitly call CLIENT UNPAUSE on the old master to restore its status.
Step 5 (Replicate): Reconfigure all other followers to sync from the new master.
Configuration Options
To prevent excessive blocking durations during periods of high write traffic, a maximum pause timeout has been introduced; the failover process will fail if the synchronization times out. The following parameters are added to the Controller failover configuration:
"force_on_timeout": false,
"sync_timeout_ms": 100,
"pause_timeout_ms": 500
Related Issues
Fixes #384