dm/tests: stabilize G11 on release-7.5 (ha_cases, lightning_mode)#12617
dm/tests: stabilize G11 on release-7.5 (ha_cases, lightning_mode)#12617joechenrh wants to merge 2 commits intopingcap:release-7.5from
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request enhances test stability by implementing a SIGKILL escalation for hung processes and pre-creating Lightning metadata tables to avoid concurrent DDL race conditions. Feedback highlights a potential mismatch in the metadata schema name, the need for database cleanup, and several improvements for process management and SQL schema definitions, including the use of pgrep and fixing non-standard data types.
| run_sql_tidb "CREATE DATABASE IF NOT EXISTS lightning_metadata;" | ||
| run_sql_tidb "CREATE TABLE IF NOT EXISTS lightning_metadata.task_meta_v2 ( |
There was a problem hiding this comment.
There is a likely discrepancy in the schema name. While this code creates tables in lightning_metadata, line 97 of this script expects conflict results in dm_meta_test. In DM, the Lightning metadata schema is typically configured as {meta_schema}_{task_name} (e.g., dm_meta_test). If the workers are using dm_meta_test, pre-creating tables in lightning_metadata will not resolve the race condition. Please verify the actual schema name used by the task. Additionally, the lightning_metadata database is created but never dropped; it should be added to the cleanup logic.
| # Escalate to SIGKILL so the test can proceed; the hang is reported above for | ||
| # later investigation instead of failing the whole G11 group. | ||
| pkill -KILL -x "$process" 2>/dev/null || true | ||
| ps aux | grep "$process" | grep -v 'grep' | grep -v 'wait_process_exit' | awk '{print $2}' | xargs -r kill -KILL 2>/dev/null || true |
There was a problem hiding this comment.
The manual process killing logic using ps aux | grep is redundant and potentially unsafe. pkill -KILL -x "$process" (on line 24) already performs an exact match on the process name. The manual grep without exact matching (and with the dot in $process acting as a wildcard) could lead to false positives by matching unintended processes. It is recommended to remove this line.
|
|
||
| WAIT_COUNT=0 | ||
| while [ $WAIT_COUNT -lt 10 ]; do | ||
| ps aux | grep $process | grep -v 'grep' | grep -v 'wait_process_exit' >/dev/null 2>&1 |
There was a problem hiding this comment.
Using ps | grep to check for process existence is fragile and prone to false positives. Since pkill is used in this script, pgrep -x should be available and is a much more reliable way to check for an exact process name match.
| ps aux | grep $process | grep -v 'grep' | grep -v 'wait_process_exit' >/dev/null 2>&1 | |
| pgrep -x "$process" >/dev/null 2>&1 |
| task_id BIGINT(20) UNSIGNED, | ||
| table_id BIGINT(64) NOT NULL, |
There was a problem hiding this comment.
Two issues in the table definition:
task_idis part of thePRIMARY KEYand should be explicitly defined asNOT NULLfor consistency with thetask_meta_v2definition.BIGINT(64)is a non-standard display width forBIGINT. It appears to be a typo forBIGINTorBIGINT(20).
| task_id BIGINT(20) UNSIGNED, | |
| table_id BIGINT(64) NOT NULL, | |
| task_id BIGINT(20) UNSIGNED NOT NULL, | |
| table_id BIGINT NOT NULL, |
…ke aborts fast Session.Close() issues an etcd Revoke bounded by lease TTL (60s by default). With the old code the session's ctx defaulted to cli.Ctx() (client lifetime), so during Server.Close() the Revoke call could block up to 60s per master when quorum collapses mid-shutdown. In ha_cases cleanup (pkill -hup on all three masters at once), this made one non-leader routinely exceed the wait_process_exit 120s deadline and failed the whole G11 group. Pass concurrency.WithContext(ctx) so e.cancel() from Server.Close() immediately aborts Revoke; the lease is abandoned and naturally expires via TTL, which is the documented acceptable tradeoff on shutdown. The only other closeSession call site is session-natural-expiry at line 232 where the lease is already gone on the server, so Revoke returns fast regardless of ctx binding.
…otstrap race Two dm-workers each spawn a TiDB Lightning instance in physical import mode and concurrently bootstrap lightning_metadata.task_meta_v2 / table_meta via CREATE TABLE IF NOT EXISTS. Lightning's dbMetaMgrBuilder.Init uses the raw *sql.DB pool without pinning a conn (br/pkg/lightning/importer/meta_manager.go on tidb release-7.5), so CREATE and the follow-up SELECT can land on different pooled connections with different InfoSchema caches. The result is Error 1146: Table 'lightning_metadata.task_meta_v2' doesn't exist on the loser — which SQLWithRetry doesn't retry. Pre-creating both tables before start-task turns both lightning CREATEs into no-ops that never hit the DDL queue, closing the race window. The proper fix (pinning a *sql.Conn in Lightning) belongs in tidb/br and should be filed separately.
2f51647 to
ccb5aea
Compare
|
@joechenrh: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
G11 of
pull_dm_integration_teston release-7.5 has been failing across builds 1022–1042, blocking PR merges. Investigation identifies two pre-existing races exposed by slower runners ~20 days ago. This PR fixes one at source and mitigates the other at the test level.Changes
1.
dm/pkg/election/election.go— source fix for the ha_cases hangSession.Close()issues an etcdRevokebounded by lease TTL (60s default). The session was created withoutWithContext, so its ctx defaulted toe.cli.Ctx()(client lifetime). WhenServer.Close()runs ande.cancel()fires, the Revoke call still can't be cancelled — it must wait out its own 60 s timeout.In ha_cases cleanup,
pkill -hup dm-master.testhits all three masters concurrently. Each non-leader proposes itsLeaseRevokevia raft, but peers race to calls.etcd.Close()and quorum dies before the proposal commits. The RPC then burns the full 60 s per process, exceeding thewait_process_exit120 s deadline and failing the whole G11 group (observed on builds 1022, 1033, 1037, 1038 — always a non-leader master2/master3).Fix: pass
concurrency.WithContext(ctx)soe.cancel()abortsRevokeimmediately. The lease is abandoned and expires by TTL, which the etcd docs call out as the acceptable shutdown tradeoff. The only othercloseSessioncall site (session natural expiry at election.go:232) already has a dead lease on the server, so Revoke returns fast regardless of ctx binding.2.
dm/tests/lightning_mode/run.sh— pre-createlightning_metadatatablesTwo dm-workers each spawn a TiDB Lightning instance and concurrently bootstrap
lightning_metadata.task_meta_v2/table_metaviaCREATE TABLE IF NOT EXISTS. Lightning opens*sql.DBwithout pinning a conn (br/pkg/lightning/importer/meta_manager.go:40-59on tidb release-7.5), so one instance's CREATE bumps schemaVersion on node X, the other's CREATE on node Y short-circuits (table already in InfoSchema, no DDL scheduled, no bump), and its follow-upSELECTon a pooled conn with stale schema returnsError 1146: Table 'lightning_metadata.task_meta_v2' doesn't exist.SQLWithRetrydoesn't retry 1146 (br/pkg/lightning/common/retry.go:101-136), so the task permanently fails. Observed once in build 1039.The proper fix is in tidb/br (pin a
*sql.Connacross Init + CheckTaskExist) and should be filed separately. As a DM-side mitigation, pre-creating both tables makes the two CREATEs no-ops that never hit the DDL queue, closing the race window entirely.Not fixed here
A separate problem: G11 packs 37 cases under a 50-min stage timeout while other groups finish in ~17 min. Builds 1025/1026/1032/1035/1036/1040/1041 were "Terminated" by the timeout at whichever test happened to be running. Recommend a follow-up PR to split G11 — out of scope for this CI unblock.
Test plan
/test pull-dm-integration-testpasses