[dm/syncer]: Fix secondsBehindMaster spurious drop to 0 between DML batches by mengxian-li · Pull Request #12581 · pingcap/tiflow

mengxian-li · 2026-04-01T17:54:13Z

Problem

secondsBehindMaster (replication lag) spuriously drops to 0 between DML batches in the syncer.

Each DM worker resets its job timestamp to 0 after completing a batch. updateReplicationLagMetric computes the minimum non-zero timestamp across all workers — but when all workers finish a batch simultaneously, every worker TS is 0, causing minTS to be 0 and reporting a false lag of 0 before the next batch starts.

Fix

Introduce lastNonZeroMinTS atomic field that tracks the last known minimum event timestamp when at least one worker was active. In updateReplicationLagMetric, when all worker TSes are 0 (i.e. between batches), fall back to lastNonZeroMinTS instead of emitting 0, so the metric holds its last valid value until a new batch begins.

Tests

Added unit tests to syncer_test.go covering:

Normal lag reporting when workers are active
Fallback to last non-zero TS when all workers are between batches
Reset behavior on reset()

🤖 Generated with Claude Code

…atches ## Problem `query-status` occasionally reported `secondsBehindMaster = 0` even when DM was significantly behind the upstream, before jumping back to the real lag value. ## Root Cause `updateReplicationLagMetric()` runs every 100ms and computes lag from the minimum non-zero entry in `workerJobTSArray`. After each DML batch completes, `successFunc` calls `updateReplicationJobTS(nil, idx)` which resets that worker's TS to 0. During the brief window between one batch finishing and the next starting, all worker TSes can be 0 simultaneously. If the cron fires in that window, `minTS` is 0, `lag` is computed as 0, and `secondsBehindMaster` is stored as 0 — a false reading. ## Fix Add a `lastNonZeroMinTS` field that remembers the most recent minimum event timestamp from when at least one worker was active. In `updateReplicationLagMetric`, fall back to this value when all current worker TSes are 0 (idle between batches) instead of reporting 0. Reset `lastNonZeroMinTS` in `reset()` alongside `secondsBehindMaster` so task restarts begin cleanly. This is safe for the genuine catch-up case: when DM is truly caught up, the last processed event has a recent timestamp, so `calcReplicationLag(lastNonZeroMinTS)` returns a near-zero value, which is correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add TestUpdateReplicationLagMetric to cover the four key cases of updateReplicationLagMetric: 1. No events seen yet (lastNonZeroMinTS=0) → lag is 0. 2. Active worker TS → lag computed from that TS and lastNonZeroMinTS is updated. 3. All workers reset to 0 between batches → lag does NOT drop to 0, lastNonZeroMinTS fallback is used (the bug scenario). 4. Newer active TS arrives → lastNonZeroMinTS advances and lag decreases. Also assert that reset() clears lastNonZeroMinTS alongside secondsBehindMaster in the existing TestRun reset assertions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ti-chi-bot · 2026-04-01T17:54:18Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-01T17:54:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benjamin2037 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

dm/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-01T17:54:25Z

Hi @mengxian-li. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-01T17:54:25Z

Welcome @mengxian-li!

It looks like this is your first PR to pingcap/tiflow 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to pingcap/tiflow. 😃

pingcap-cla-assistant · 2026-04-01T17:54:29Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

mengxian-li seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request addresses an issue where the replication lag metric would spuriously drop to zero between DML batches. It introduces a lastNonZeroMinTS field to the Syncer struct to store the most recent valid event timestamp, ensuring that updateReplicationLagMetric has a fallback value when all worker timestamps are reset. The changes include updates to the Syncer struct, its reset method, and the lag calculation logic, along with a new unit test to verify the fix. I have no feedback to provide.

ti-chi-bot · 2026-04-02T02:36:39Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

lance6716 · 2026-04-03T01:59:48Z

/ok-to-test

GMHDBJD · 2026-04-10T18:31:22Z

/retest

GMHDBJD · 2026-04-17T08:54:20Z

/retest

ti-chi-bot · 2026-04-17T09:23:58Z

@mengxian-li: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-dm-integration-test-next-gen	`40b80de`	link	false	`/test pull-dm-integration-test-next-gen`
pull-dm-integration-test	`40b80de`	link	true	`/test pull-dm-integration-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

GMHDBJD · 2026-04-17T09:51:50Z

 	if minTS != int64(0) {
+		s.lastNonZeroMinTS.Store(minTS)
 		lag = s.calcReplicationLag(minTS)
+	} else if lastTS := s.lastNonZeroMinTS.Load(); lastTS != int64(0) {
+		// All workers are between batches (their TSes were reset to 0 after the
+		// last batch completed). Use the most recent known event timestamp so we
+		// don't emit a spurious drop to 0 while no worker is actively processing.
+		lag = s.calcReplicationLag(lastTS)


When DM is actually caught up and the upstream is simply idle. In that state calcReplicationLag(lastTS) keeps growing with wall-clock time, so secondsBehindMaster starts reporting artificial lag even though there is no backlog.

mengxian-li and others added 2 commits April 1, 2026 10:43

ti-chi-bot Bot added the do-not-merge/needs-linked-issue label Apr 1, 2026

ti-chi-bot Bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/dm Issues or PRs related to DM. labels Apr 1, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 1, 2026

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

ti-chi-bot Bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 3, 2026

GMHDBJD reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dm/syncer]: Fix secondsBehindMaster spurious drop to 0 between DML batches#12581

[dm/syncer]: Fix secondsBehindMaster spurious drop to 0 between DML batches#12581
mengxian-li wants to merge 2 commits intopingcap:masterfrom
mengxian-li:fix/dm-seconds-behind-master-spurious-zero-upstream

mengxian-li commented Apr 1, 2026 •

edited by GMHDBJD

Loading

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

pingcap-cla-assistant Bot commented Apr 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

ti-chi-bot Bot commented Apr 2, 2026

Uh oh!

lance6716 commented Apr 3, 2026

Uh oh!

GMHDBJD commented Apr 10, 2026

Uh oh!

GMHDBJD commented Apr 17, 2026

Uh oh!

ti-chi-bot Bot commented Apr 17, 2026

Uh oh!

GMHDBJD Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mengxian-li commented Apr 1, 2026 • edited by GMHDBJD Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

ti-chi-bot Bot commented Apr 1, 2026

Uh oh!

pingcap-cla-assistant Bot commented Apr 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ti-chi-bot Bot commented Apr 2, 2026

Uh oh!

lance6716 commented Apr 3, 2026

Uh oh!

GMHDBJD commented Apr 10, 2026

Uh oh!

GMHDBJD commented Apr 17, 2026

Uh oh!

ti-chi-bot Bot commented Apr 17, 2026

Uh oh!

GMHDBJD Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mengxian-li commented Apr 1, 2026 •

edited by GMHDBJD

Loading