Skip to content

dm/tests: stabilize G11 on release-7.5 (ha_cases, lightning_mode)#12617

Open
joechenrh wants to merge 2 commits intopingcap:release-7.5from
joechenrh:fix/release-7.5-dm-ci-g11
Open

dm/tests: stabilize G11 on release-7.5 (ha_cases, lightning_mode)#12617
joechenrh wants to merge 2 commits intopingcap:release-7.5from
joechenrh:fix/release-7.5-dm-ci-g11

Conversation

@joechenrh
Copy link
Copy Markdown
Contributor

@joechenrh joechenrh commented Apr 21, 2026

Summary

G11 of pull_dm_integration_test on release-7.5 has been failing across builds 1022–1042, blocking PR merges. Investigation identifies two pre-existing races exposed by slower runners ~20 days ago. This PR fixes one at source and mitigates the other at the test level.

Changes

1. dm/pkg/election/election.go — source fix for the ha_cases hang

Session.Close() issues an etcd Revoke bounded by lease TTL (60s default). The session was created without WithContext, so its ctx defaulted to e.cli.Ctx() (client lifetime). When Server.Close() runs and e.cancel() fires, the Revoke call still can't be cancelled — it must wait out its own 60 s timeout.

In ha_cases cleanup, pkill -hup dm-master.test hits all three masters concurrently. Each non-leader proposes its LeaseRevoke via raft, but peers race to call s.etcd.Close() and quorum dies before the proposal commits. The RPC then burns the full 60 s per process, exceeding the wait_process_exit 120 s deadline and failing the whole G11 group (observed on builds 1022, 1033, 1037, 1038 — always a non-leader master2/master3).

Fix: pass concurrency.WithContext(ctx) so e.cancel() aborts Revoke immediately. The lease is abandoned and expires by TTL, which the etcd docs call out as the acceptable shutdown tradeoff. The only other closeSession call site (session natural expiry at election.go:232) already has a dead lease on the server, so Revoke returns fast regardless of ctx binding.

2. dm/tests/lightning_mode/run.sh — pre-create lightning_metadata tables

Two dm-workers each spawn a TiDB Lightning instance and concurrently bootstrap lightning_metadata.task_meta_v2 / table_meta via CREATE TABLE IF NOT EXISTS. Lightning opens *sql.DB without pinning a conn (br/pkg/lightning/importer/meta_manager.go:40-59 on tidb release-7.5), so one instance's CREATE bumps schemaVersion on node X, the other's CREATE on node Y short-circuits (table already in InfoSchema, no DDL scheduled, no bump), and its follow-up SELECT on a pooled conn with stale schema returns Error 1146: Table 'lightning_metadata.task_meta_v2' doesn't exist. SQLWithRetry doesn't retry 1146 (br/pkg/lightning/common/retry.go:101-136), so the task permanently fails. Observed once in build 1039.

The proper fix is in tidb/br (pin a *sql.Conn across Init + CheckTaskExist) and should be filed separately. As a DM-side mitigation, pre-creating both tables makes the two CREATEs no-ops that never hit the DDL queue, closing the race window entirely.

Not fixed here

A separate problem: G11 packs 37 cases under a 50-min stage timeout while other groups finish in ~17 min. Builds 1025/1026/1032/1035/1036/1040/1041 were "Terminated" by the timeout at whichever test happened to be running. Recommend a follow-up PR to split G11 — out of scope for this CI unblock.

Test plan

  • /test pull-dm-integration-test passes
  • Re-run pull_dm_integration_test multiple times to confirm G11 stability (the two races should no longer trigger; the timeout problem remains until G11 is split)

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be LGTMed and approved by the reviewers firstly.
  2. For pull requests to TiDB-x branches, it must have no failed tests.
  3. AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign d3hunter for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/dm Issues or PRs related to DM. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 21, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances test stability by implementing a SIGKILL escalation for hung processes and pre-creating Lightning metadata tables to avoid concurrent DDL race conditions. Feedback highlights a potential mismatch in the metadata schema name, the need for database cleanup, and several improvements for process management and SQL schema definitions, including the use of pgrep and fixing non-standard data types.

Comment on lines +62 to +63
run_sql_tidb "CREATE DATABASE IF NOT EXISTS lightning_metadata;"
run_sql_tidb "CREATE TABLE IF NOT EXISTS lightning_metadata.task_meta_v2 (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a likely discrepancy in the schema name. While this code creates tables in lightning_metadata, line 97 of this script expects conflict results in dm_meta_test. In DM, the Lightning metadata schema is typically configured as {meta_schema}_{task_name} (e.g., dm_meta_test). If the workers are using dm_meta_test, pre-creating tables in lightning_metadata will not resolve the race condition. Please verify the actual schema name used by the task. Additionally, the lightning_metadata database is created but never dropped; it should be added to the cleanup logic.

Comment thread dm/tests/_utils/wait_process_exit Outdated
# Escalate to SIGKILL so the test can proceed; the hang is reported above for
# later investigation instead of failing the whole G11 group.
pkill -KILL -x "$process" 2>/dev/null || true
ps aux | grep "$process" | grep -v 'grep' | grep -v 'wait_process_exit' | awk '{print $2}' | xargs -r kill -KILL 2>/dev/null || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual process killing logic using ps aux | grep is redundant and potentially unsafe. pkill -KILL -x "$process" (on line 24) already performs an exact match on the process name. The manual grep without exact matching (and with the dot in $process acting as a wildcard) could lead to false positives by matching unintended processes. It is recommended to remove this line.

Comment thread dm/tests/_utils/wait_process_exit Outdated

WAIT_COUNT=0
while [ $WAIT_COUNT -lt 10 ]; do
ps aux | grep $process | grep -v 'grep' | grep -v 'wait_process_exit' >/dev/null 2>&1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using ps | grep to check for process existence is fragile and prone to false positives. Since pkill is used in this script, pgrep -x should be available and is a much more reliable way to check for an exact process name match.

Suggested change
ps aux | grep $process | grep -v 'grep' | grep -v 'wait_process_exit' >/dev/null 2>&1
pgrep -x "$process" >/dev/null 2>&1

Comment on lines +75 to +76
task_id BIGINT(20) UNSIGNED,
table_id BIGINT(64) NOT NULL,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Two issues in the table definition:

  1. task_id is part of the PRIMARY KEY and should be explicitly defined as NOT NULL for consistency with the task_meta_v2 definition.
  2. BIGINT(64) is a non-standard display width for BIGINT. It appears to be a typo for BIGINT or BIGINT(20).
Suggested change
task_id BIGINT(20) UNSIGNED,
table_id BIGINT(64) NOT NULL,
task_id BIGINT(20) UNSIGNED NOT NULL,
table_id BIGINT NOT NULL,

…ke aborts fast

Session.Close() issues an etcd Revoke bounded by lease TTL (60s by default).
With the old code the session's ctx defaulted to cli.Ctx() (client lifetime),
so during Server.Close() the Revoke call could block up to 60s per master
when quorum collapses mid-shutdown. In ha_cases cleanup (pkill -hup on all
three masters at once), this made one non-leader routinely exceed the
wait_process_exit 120s deadline and failed the whole G11 group.

Pass concurrency.WithContext(ctx) so e.cancel() from Server.Close()
immediately aborts Revoke; the lease is abandoned and naturally expires
via TTL, which is the documented acceptable tradeoff on shutdown.

The only other closeSession call site is session-natural-expiry at line 232
where the lease is already gone on the server, so Revoke returns fast
regardless of ctx binding.
…otstrap race

Two dm-workers each spawn a TiDB Lightning instance in physical import mode
and concurrently bootstrap lightning_metadata.task_meta_v2 / table_meta via
CREATE TABLE IF NOT EXISTS. Lightning's dbMetaMgrBuilder.Init uses the raw
*sql.DB pool without pinning a conn (br/pkg/lightning/importer/meta_manager.go
on tidb release-7.5), so CREATE and the follow-up SELECT can land on
different pooled connections with different InfoSchema caches. The result is
Error 1146: Table 'lightning_metadata.task_meta_v2' doesn't exist on the
loser — which SQLWithRetry doesn't retry.

Pre-creating both tables before start-task turns both lightning CREATEs
into no-ops that never hit the DDL queue, closing the race window. The
proper fix (pinning a *sql.Conn in Lightning) belongs in tidb/br and should
be filed separately.
@joechenrh joechenrh force-pushed the fix/release-7.5-dm-ci-g11 branch from 2f51647 to ccb5aea Compare April 21, 2026 05:01
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

@joechenrh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify ccb5aea link true /test pull-verify
pull-dm-integration-test ccb5aea link true /test pull-dm-integration-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/dm Issues or PRs related to DM. do-not-merge/cherry-pick-not-approved do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant