[dm/syncer]: Fix secondsBehindMaster spurious drop to 0 between DML batches#12581
[dm/syncer]: Fix secondsBehindMaster spurious drop to 0 between DML batches#12581mengxian-li wants to merge 2 commits intopingcap:masterfrom
Conversation
…atches ## Problem `query-status` occasionally reported `secondsBehindMaster = 0` even when DM was significantly behind the upstream, before jumping back to the real lag value. ## Root Cause `updateReplicationLagMetric()` runs every 100ms and computes lag from the minimum non-zero entry in `workerJobTSArray`. After each DML batch completes, `successFunc` calls `updateReplicationJobTS(nil, idx)` which resets that worker's TS to 0. During the brief window between one batch finishing and the next starting, all worker TSes can be 0 simultaneously. If the cron fires in that window, `minTS` is 0, `lag` is computed as 0, and `secondsBehindMaster` is stored as 0 — a false reading. ## Fix Add a `lastNonZeroMinTS` field that remembers the most recent minimum event timestamp from when at least one worker was active. In `updateReplicationLagMetric`, fall back to this value when all current worker TSes are 0 (idle between batches) instead of reporting 0. Reset `lastNonZeroMinTS` in `reset()` alongside `secondsBehindMaster` so task restarts begin cleanly. This is safe for the genuine catch-up case: when DM is truly caught up, the last processed event has a recent timestamp, so `calcReplicationLag(lastNonZeroMinTS)` returns a near-zero value, which is correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TestUpdateReplicationLagMetric to cover the four key cases of updateReplicationLagMetric: 1. No events seen yet (lastNonZeroMinTS=0) → lag is 0. 2. Active worker TS → lag computed from that TS and lastNonZeroMinTS is updated. 3. All workers reset to 0 between batches → lag does NOT drop to 0, lastNonZeroMinTS fallback is used (the bug scenario). 4. Newer active TS arrives → lastNonZeroMinTS advances and lag decreases. Also assert that reset() clears lastNonZeroMinTS alongside secondsBehindMaster in the existing TestRun reset assertions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @mengxian-li. Thanks for your PR. I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @mengxian-li! |
|
mengxian-li seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where the replication lag metric would spuriously drop to zero between DML batches. It introduces a lastNonZeroMinTS field to the Syncer struct to store the most recent valid event timestamp, ensuring that updateReplicationLagMetric has a fallback value when all worker timestamps are reset. The changes include updates to the Syncer struct, its reset method, and the lag calculation logic, along with a new unit test to verify the fix. I have no feedback to provide.
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
|
/ok-to-test |
|
/retest |
1 similar comment
|
/retest |
|
@mengxian-li: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| if minTS != int64(0) { | ||
| s.lastNonZeroMinTS.Store(minTS) | ||
| lag = s.calcReplicationLag(minTS) | ||
| } else if lastTS := s.lastNonZeroMinTS.Load(); lastTS != int64(0) { | ||
| // All workers are between batches (their TSes were reset to 0 after the | ||
| // last batch completed). Use the most recent known event timestamp so we | ||
| // don't emit a spurious drop to 0 while no worker is actively processing. | ||
| lag = s.calcReplicationLag(lastTS) |
There was a problem hiding this comment.
When DM is actually caught up and the upstream is simply idle. In that state calcReplicationLag(lastTS) keeps growing with wall-clock time, so secondsBehindMaster starts reporting artificial lag even though there is no backlog.
Problem
Ref: #12580
secondsBehindMaster(replication lag) spuriously drops to 0 between DML batches in the syncer.Each DM worker resets its job timestamp to 0 after completing a batch.
updateReplicationLagMetriccomputes the minimum non-zero timestamp across all workers — but when all workers finish a batch simultaneously, every worker TS is 0, causingminTSto be 0 and reporting a false lag of 0 before the next batch starts.Fix
Introduce
lastNonZeroMinTSatomic field that tracks the last known minimum event timestamp when at least one worker was active. InupdateReplicationLagMetric, when all worker TSes are 0 (i.e. between batches), fall back tolastNonZeroMinTSinstead of emitting 0, so the metric holds its last valid value until a new batch begins.Tests
Added unit tests to
syncer_test.gocovering:reset()🤖 Generated with Claude Code