dm/tests: stabilize G11 on release-7.5 (ha_cases, lightning_mode) by joechenrh · Pull Request #12617 · pingcap/tiflow

joechenrh · 2026-04-21T04:13:40Z

Summary

G11 of pull_dm_integration_test on release-7.5 has been failing across builds 1022–1042, blocking PR merges. Investigation identifies two pre-existing races exposed by slower runners ~20 days ago. This PR fixes one at source and mitigates the other at the test level.

Changes

1. `dm/pkg/election/election.go` — source fix for the ha_cases hang

Session.Close() issues an etcd Revoke bounded by lease TTL (60s default). The session was created without WithContext, so its ctx defaulted to e.cli.Ctx() (client lifetime). When Server.Close() runs and e.cancel() fires, the Revoke call still can't be cancelled — it must wait out its own 60 s timeout.

In ha_cases cleanup, pkill -hup dm-master.test hits all three masters concurrently. Each non-leader proposes its LeaseRevoke via raft, but peers race to call s.etcd.Close() and quorum dies before the proposal commits. The RPC then burns the full 60 s per process, exceeding the wait_process_exit 120 s deadline and failing the whole G11 group (observed on builds 1022, 1033, 1037, 1038 — always a non-leader master2/master3).

Fix: pass concurrency.WithContext(ctx) so e.cancel() aborts Revoke immediately. The lease is abandoned and expires by TTL, which the etcd docs call out as the acceptable shutdown tradeoff. The only other closeSession call site (session natural expiry at election.go:232) already has a dead lease on the server, so Revoke returns fast regardless of ctx binding.

2. `dm/tests/lightning_mode/run.sh` — pre-create `lightning_metadata` tables

Two dm-workers each spawn a TiDB Lightning instance and concurrently bootstrap lightning_metadata.task_meta_v2 / table_meta via CREATE TABLE IF NOT EXISTS. Lightning opens *sql.DB without pinning a conn (br/pkg/lightning/importer/meta_manager.go:40-59 on tidb release-7.5), so one instance's CREATE bumps schemaVersion on node X, the other's CREATE on node Y short-circuits (table already in InfoSchema, no DDL scheduled, no bump), and its follow-up SELECT on a pooled conn with stale schema returns Error 1146: Table 'lightning_metadata.task_meta_v2' doesn't exist. SQLWithRetry doesn't retry 1146 (br/pkg/lightning/common/retry.go:101-136), so the task permanently fails. Observed once in build 1039.

The proper fix is in tidb/br (pin a *sql.Conn across Init + CheckTaskExist) and should be filed separately. As a DM-side mitigation, pre-creating both tables makes the two CREATEs no-ops that never hit the DDL queue, closing the race window entirely.

Not fixed here

A separate problem: G11 packs 37 cases under a 50-min stage timeout while other groups finish in ~17 min. Builds 1025/1026/1032/1035/1036/1040/1041 were "Terminated" by the timeout at whichever test happened to be running. Recommend a follow-up PR to split G11 — out of scope for this CI unblock.

Test plan

/test pull-dm-integration-test passes
Re-run pull_dm_integration_test multiple times to confirm G11 stability (the two races should no longer trigger; the timeout problem remains until G11 is split)

ti-chi-bot · 2026-04-21T04:13:44Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-21T04:13:44Z

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

It must be LGTMed and approved by the reviewers firstly.
For pull requests to TiDB-x branches, it must have no failed tests.
AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-21T04:13:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign d3hunter for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

dm/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request enhances test stability by implementing a SIGKILL escalation for hung processes and pre-creating Lightning metadata tables to avoid concurrent DDL race conditions. Feedback highlights a potential mismatch in the metadata schema name, the need for database cleanup, and several improvements for process management and SQL schema definitions, including the use of pgrep and fixing non-standard data types.

gemini-code-assist · 2026-04-21T04:15:41Z

+	run_sql_tidb "CREATE DATABASE IF NOT EXISTS lightning_metadata;"
+	run_sql_tidb "CREATE TABLE IF NOT EXISTS lightning_metadata.task_meta_v2 (


There is a likely discrepancy in the schema name. While this code creates tables in lightning_metadata, line 97 of this script expects conflict results in dm_meta_test. In DM, the Lightning metadata schema is typically configured as {meta_schema}_{task_name} (e.g., dm_meta_test). If the workers are using dm_meta_test, pre-creating tables in lightning_metadata will not resolve the race condition. Please verify the actual schema name used by the task. Additionally, the lightning_metadata database is created but never dropped; it should be added to the cleanup logic.

gemini-code-assist · 2026-04-21T04:15:41Z

+# Escalate to SIGKILL so the test can proceed; the hang is reported above for
+# later investigation instead of failing the whole G11 group.
+pkill -KILL -x "$process" 2>/dev/null || true
+ps aux | grep "$process" | grep -v 'grep' | grep -v 'wait_process_exit' | awk '{print $2}' | xargs -r kill -KILL 2>/dev/null || true


The manual process killing logic using ps aux | grep is redundant and potentially unsafe. pkill -KILL -x "$process" (on line 24) already performs an exact match on the process name. The manual grep without exact matching (and with the dot in $process acting as a wildcard) could lead to false positives by matching unintended processes. It is recommended to remove this line.

gemini-code-assist · 2026-04-21T04:15:41Z

+
+WAIT_COUNT=0
+while [ $WAIT_COUNT -lt 10 ]; do
+	ps aux | grep $process | grep -v 'grep' | grep -v 'wait_process_exit' >/dev/null 2>&1


Using ps | grep to check for process existence is fragile and prone to false positives. Since pkill is used in this script, pgrep -x should be available and is a much more reliable way to check for an exact process name match.

Suggested change

ps aux | grep $process | grep -v 'grep' | grep -v 'wait_process_exit' >/dev/null 2>&1

pgrep -x "$process" >/dev/null 2>&1

gemini-code-assist · 2026-04-21T04:15:41Z

+		task_id BIGINT(20) UNSIGNED,
+		table_id BIGINT(64) NOT NULL,


Two issues in the table definition:

task_id is part of the PRIMARY KEY and should be explicitly defined as NOT NULL for consistency with the task_meta_v2 definition.

BIGINT(64) is a non-standard display width for BIGINT. It appears to be a typo for BIGINT or BIGINT(20).

Suggested change

task_id BIGINT(20) UNSIGNED,

table_id BIGINT(64) NOT NULL,

task_id BIGINT(20) UNSIGNED NOT NULL,

table_id BIGINT NOT NULL,

…ke aborts fast Session.Close() issues an etcd Revoke bounded by lease TTL (60s by default). With the old code the session's ctx defaulted to cli.Ctx() (client lifetime), so during Server.Close() the Revoke call could block up to 60s per master when quorum collapses mid-shutdown. In ha_cases cleanup (pkill -hup on all three masters at once), this made one non-leader routinely exceed the wait_process_exit 120s deadline and failed the whole G11 group. Pass concurrency.WithContext(ctx) so e.cancel() from Server.Close() immediately aborts Revoke; the lease is abandoned and naturally expires via TTL, which is the documented acceptable tradeoff on shutdown. The only other closeSession call site is session-natural-expiry at line 232 where the lease is already gone on the server, so Revoke returns fast regardless of ctx binding.

…otstrap race Two dm-workers each spawn a TiDB Lightning instance in physical import mode and concurrently bootstrap lightning_metadata.task_meta_v2 / table_meta via CREATE TABLE IF NOT EXISTS. Lightning's dbMetaMgrBuilder.Init uses the raw *sql.DB pool without pinning a conn (br/pkg/lightning/importer/meta_manager.go on tidb release-7.5), so CREATE and the follow-up SELECT can land on different pooled connections with different InfoSchema caches. The result is Error 1146: Table 'lightning_metadata.task_meta_v2' doesn't exist on the loser — which SQLWithRetry doesn't retry. Pre-creating both tables before start-task turns both lightning CREATEs into no-ops that never hit the DDL queue, closing the race window. The proper fix (pinning a *sql.Conn in Lightning) belongs in tidb/br and should be filed separately.

ti-chi-bot · 2026-04-21T05:36:40Z

@joechenrh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-verify	`ccb5aea`	link	true	`/test pull-verify`
pull-dm-integration-test	`ccb5aea`	link	true	`/test pull-dm-integration-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot Bot added the do-not-merge/cherry-pick-not-approved label Apr 21, 2026

ti-chi-bot Bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/dm Issues or PRs related to DM. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 21, 2026

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

joechenrh added 2 commits April 21, 2026 13:00

joechenrh force-pushed the fix/release-7.5-dm-ci-g11 branch from 2f51647 to ccb5aea Compare April 21, 2026 05:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dm/tests: stabilize G11 on release-7.5 (ha_cases, lightning_mode)#12617

dm/tests: stabilize G11 on release-7.5 (ha_cases, lightning_mode)#12617
joechenrh wants to merge 2 commits intopingcap:release-7.5from
joechenrh:fix/release-7.5-dm-ci-g11

joechenrh commented Apr 21, 2026 •

edited

Loading

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		run_sql_tidb "CREATE DATABASE IF NOT EXISTS lightning_metadata;"
		run_sql_tidb "CREATE TABLE IF NOT EXISTS lightning_metadata.task_meta_v2 (

	ps aux \| grep $process \| grep -v 'grep' \| grep -v 'wait_process_exit' >/dev/null 2>&1
	pgrep -x "$process" >/dev/null 2>&1

Conversation

joechenrh commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. dm/pkg/election/election.go — source fix for the ha_cases hang

2. dm/tests/lightning_mode/run.sh — pre-create lightning_metadata tables

Not fixed here

Test plan

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joechenrh commented Apr 21, 2026 •

edited

Loading

1. `dm/pkg/election/election.go` — source fix for the ha_cases hang

2. `dm/tests/lightning_mode/run.sh` — pre-create `lightning_metadata` tables