Skip to content

[dm/syncer]: Fix secondsBehindMaster spurious drop to 0 between DML batches#12581

Open
mengxian-li wants to merge 2 commits intopingcap:masterfrom
mengxian-li:fix/dm-seconds-behind-master-spurious-zero-upstream
Open

[dm/syncer]: Fix secondsBehindMaster spurious drop to 0 between DML batches#12581
mengxian-li wants to merge 2 commits intopingcap:masterfrom
mengxian-li:fix/dm-seconds-behind-master-spurious-zero-upstream

Conversation

@mengxian-li
Copy link
Copy Markdown

@mengxian-li mengxian-li commented Apr 1, 2026

Problem

Ref: #12580

secondsBehindMaster (replication lag) spuriously drops to 0 between DML batches in the syncer.

Each DM worker resets its job timestamp to 0 after completing a batch. updateReplicationLagMetric computes the minimum non-zero timestamp across all workers — but when all workers finish a batch simultaneously, every worker TS is 0, causing minTS to be 0 and reporting a false lag of 0 before the next batch starts.

Fix

Introduce lastNonZeroMinTS atomic field that tracks the last known minimum event timestamp when at least one worker was active. In updateReplicationLagMetric, when all worker TSes are 0 (i.e. between batches), fall back to lastNonZeroMinTS instead of emitting 0, so the metric holds its last valid value until a new batch begins.

Tests

Added unit tests to syncer_test.go covering:

  • Normal lag reporting when workers are active
  • Fallback to last non-zero TS when all workers are between batches
  • Reset behavior on reset()

🤖 Generated with Claude Code

mengxian-li and others added 2 commits April 1, 2026 10:43
…atches

## Problem

`query-status` occasionally reported `secondsBehindMaster = 0` even when
DM was significantly behind the upstream, before jumping back to the real
lag value.

## Root Cause

`updateReplicationLagMetric()` runs every 100ms and computes lag from the
minimum non-zero entry in `workerJobTSArray`. After each DML batch
completes, `successFunc` calls `updateReplicationJobTS(nil, idx)` which
resets that worker's TS to 0. During the brief window between one batch
finishing and the next starting, all worker TSes can be 0 simultaneously.
If the cron fires in that window, `minTS` is 0, `lag` is computed as 0,
and `secondsBehindMaster` is stored as 0 — a false reading.

## Fix

Add a `lastNonZeroMinTS` field that remembers the most recent minimum
event timestamp from when at least one worker was active. In
`updateReplicationLagMetric`, fall back to this value when all current
worker TSes are 0 (idle between batches) instead of reporting 0.

Reset `lastNonZeroMinTS` in `reset()` alongside `secondsBehindMaster` so
task restarts begin cleanly.

This is safe for the genuine catch-up case: when DM is truly caught up,
the last processed event has a recent timestamp, so
`calcReplicationLag(lastNonZeroMinTS)` returns a near-zero value, which
is correct.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TestUpdateReplicationLagMetric to cover the four key cases of
updateReplicationLagMetric:

1. No events seen yet (lastNonZeroMinTS=0) → lag is 0.
2. Active worker TS → lag computed from that TS and lastNonZeroMinTS is
   updated.
3. All workers reset to 0 between batches → lag does NOT drop to 0,
   lastNonZeroMinTS fallback is used (the bug scenario).
4. Newer active TS arrives → lastNonZeroMinTS advances and lag decreases.

Also assert that reset() clears lastNonZeroMinTS alongside
secondsBehindMaster in the existing TestRun reset assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 1, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/dm Issues or PRs related to DM. labels Apr 1, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benjamin2037 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 1, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 1, 2026

Hi @mengxian-li. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 1, 2026

Welcome @mengxian-li!

It looks like this is your first PR to pingcap/tiflow 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to pingcap/tiflow. 😃

@pingcap-cla-assistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


mengxian-li seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 1, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where the replication lag metric would spuriously drop to zero between DML batches. It introduces a lastNonZeroMinTS field to the Syncer struct to store the most recent valid event timestamp, ensuring that updateReplicationLagMetric has a fallback value when all worker timestamps are reset. The changes include updates to the Syncer struct, its reset method, and the lag calculation logic, along with a new unit test to verify the fix. I have no feedback to provide.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 2, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@lance6716
Copy link
Copy Markdown
Contributor

/ok-to-test

@ti-chi-bot ti-chi-bot Bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 3, 2026
@GMHDBJD
Copy link
Copy Markdown
Contributor

GMHDBJD commented Apr 10, 2026

/retest

1 similar comment
@GMHDBJD
Copy link
Copy Markdown
Contributor

GMHDBJD commented Apr 17, 2026

/retest

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 17, 2026

@mengxian-li: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-dm-integration-test-next-gen 40b80de link false /test pull-dm-integration-test-next-gen
pull-dm-integration-test 40b80de link true /test pull-dm-integration-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment thread dm/syncer/syncer.go
Comment on lines 972 to +979
if minTS != int64(0) {
s.lastNonZeroMinTS.Store(minTS)
lag = s.calcReplicationLag(minTS)
} else if lastTS := s.lastNonZeroMinTS.Load(); lastTS != int64(0) {
// All workers are between batches (their TSes were reset to 0 after the
// last batch completed). Use the most recent known event timestamp so we
// don't emit a spurious drop to 0 while no worker is actively processing.
lag = s.calcReplicationLag(lastTS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When DM is actually caught up and the upstream is simply idle. In that state calcReplicationLag(lastTS) keeps growing with wall-clock time, so secondsBehindMaster starts reporting artificial lag even though there is no backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/dm Issues or PRs related to DM. contribution This PR is from a community contributor. do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. ok-to-test Indicates a PR is ready to be tested. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants