Summary
Currently EvenScheduler (and therefore DefaultScheduler) does not move workers onto a fully-idle supervisor that returns to service after maintenance. The cluster keeps running with used slot = 0 on the returned supervisor until an operator manually rebalances or restarts every affected topology.
This proposal introduces an opt-in, binary-trigger rebalance pass inside EvenScheduler. When at least one non-blacklisted supervisor has zero used slots, the scheduler relocates one worker per topology per round-robin iteration onto the idle slots, until the idle capacity reaches an even per-supervisor share. Already-balanced topologies are never touched.
The feature is disabled by default so existing behavior is preserved.
Problem statement
Today's Cluster#needsScheduling returns true only when the topology has fewer assigned workers than desired or has unassigned executors:
public boolean needsScheduling(TopologyDetails topology) {
int desiredNumWorkers = topology.getNumWorkers();
int assignedNumWorkers = this.getAssignedNumWorkers(topology);
return desiredNumWorkers > assignedNumWorkers
|| getUnassignedExecutors(topology).size() > 0;
}
When a supervisor goes down for maintenance and comes back, the topology's workers were already redistributed onto the surviving supervisors, so both conditions are false and the scheduler considers the cluster "fully scheduled". The returning supervisor sits idle indefinitely.
Initial state After supervisor maintenance
sup-A sup-B sup-C sup-A sup-B sup-C
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
|W|W| | |W|W| | |W|W| | |W|W|W| |W|W|W| (sup-C is down)
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
After sup-C returns Today's behavior
sup-A sup-B sup-C sup-A sup-B sup-C
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
|W|W|W| |W|W|W| | | | | --> |W|W|W| |W|W|W| | | | |
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
(stays idle forever)
Operators must currently restart each affected topology by hand to drain the over-loaded supervisors.
Proposed approach
1. Binary trigger
Cluster#needsScheduling gets one extra branch: also return true when at least one non-blacklisted supervisor has zero used slots and the topology is not already on that supervisor. The check is binary — a supervisor either has zero used slots or it does not — so a near-balanced cluster never triggers this path.
Triggers sup-A sup-B sup-C
+-+-+-+ +-+-+-+ +-+-+-+
|W|W|W|W|W|W|W|W| | | | | <-- triggers
+-+-+-+ +-+-+-+ +-+-+-+
Does not sup-A sup-B sup-C
trigger +-+-+-+ +-+-+-+ +-+-+-+
|W|W|W|W|W|W|W| | |W| | | <-- (4, 3, 1) is not balanced
+-+-+-+ +-+-+-+ +-+-+-+ but sup-C has used > 0
so trigger does not fire
2. Round-robin relocation across topologies
A new phase at the start of scheduleTopologiesEvenly walks idle slots round-robin across all triggering topologies. Each iteration moves at most one worker per topology, so the returning supervisor ends up hosting workers from many topologies — preserving the per-supervisor workload diversity that a fresh submission produces.
Topologies A, B, C, D each at distribution (4, 4, 0).
sup-C just returned with 4 free slots.
Round-robin iterations:
iter 1: A's worker -> sup-C:0
iter 2: B's worker -> sup-C:1
iter 3: C's worker -> sup-C:2
iter 4: D's worker -> sup-C:3
done (idle capacity exhausted).
Result on sup-C: { worker of A, worker of B, worker of C, worker of D }
If only one topology triggers, that topology fills the idle supervisor up to its even share and stops. If only some topologies are eligible, the remaining idle slots stay empty for this round.
3. Per-topology budget per round
Each topology contributes at most
idleSupervisorCount * floor(numWorkers / nonBlacklistedSupervisorCount)
workers in one scheduling round, capped further by the idle side's free slot capacity. This means:
- topologies with
numWorkers < numSupervisors automatically skip the round-robin (their floor is zero) — single-worker topologies cannot ping-pong;
- "near balanced" cluster never triggers in the first place;
- one round produces an even per-supervisor share and the next round's trigger is silent.
4. Direct placement onto idle slots
The relocated executors are assigned directly onto idle slots via cluster.freeSlot(victim) + cluster.assign(idleSlot, ...). This bypasses the regular sortSlots/interleaveAll pass that would otherwise be free to drop some of the freed executors back into the just-vacated slots on the busy supervisors.
Configuration keys
Two new entries in DaemonConfig:
| Key |
Type |
Default |
Meaning |
nimbus.even.rebalance.idle-supervisor.enabled |
boolean |
false |
Master switch. When false, the new code paths are short-circuited and behavior is identical to today. |
nimbus.even.rebalance.max-free-per-topology |
int |
0 |
Optional upper bound on the number of workers a single topology may release per round. 0 or negative means unbounded (the per-round budget above takes effect). |
Safety guards (already in the draft)
- Opt-in (default off): zero behavior change for existing deployments.
- Binary trigger: an "almost balanced" cluster cannot trigger a relocation; no time-based cooldown is needed because the trigger condition itself rules out instability.
- Drain-to-zero protection: a supervisor that holds only one worker of a topology is never the donor — the same round can never produce a new idle supervisor.
- Round-robin across topologies: prevents the first-scheduled topology from monopolizing the idle capacity, which would concentrate one workload on the returning supervisor.
- Idle-side slot cap: the round-robin pass never tries to free more workers than the idle capacity can absorb.
- Direct placement: relocated executors never bounce back onto the just-vacated busy supervisors.
Implementation outline
The draft (against master / 3.0.0-SNAPSHOT) touches three production files plus a test:
storm-server/src/main/java/org/apache/storm/DaemonConfig.java
- Two new keys (above).
storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java
- needsScheduling: extra branch via hasIdleSupervisorReusableBy().
- hasIdleSupervisorReusableBy(): binary check, returns false when disabled.
storm-server/src/main/java/org/apache/storm/scheduler/EvenScheduler.java
- scheduleTopologiesEvenly(): calls redistributeOntoIdleSupervisors() first.
- redistributeOntoIdleSupervisors(): round-robin across topologies.
- relocateOneWorkerOntoIdleSlot(): pulls one worker from the most-loaded
supervisor of a topology, places it directly onto an idle slot.
conf/defaults.yaml
- Defaults for the two new keys.
storm-server/src/test/java/.../TestEvenSchedulerIdleSupervisor.java
- 9 unit tests covering trigger, drain cap, single-worker no-op,
drain-to-zero protection, one-round even distribution, and
round-robin sharing across multiple topologies.
I will open the PR once this proposal direction is agreed upon.
Backward compatibility
- Default off, so out-of-the-box behavior of
DefaultScheduler / EvenScheduler is byte-for-byte identical.
- No public API removed or renamed. Two new public methods (
Cluster#hasIdleSupervisorReusableBy, EvenScheduler#redistributeOntoIdleSupervisors) are additions only.
- New config keys both have safe defaults; no new YAML is required for existing clusters.
Feedback welcome on direction, naming, and scope before I open the PR. Thanks!
Summary
Currently
EvenScheduler(and thereforeDefaultScheduler) does not move workers onto a fully-idle supervisor that returns to service after maintenance. The cluster keeps running withused slot = 0on the returned supervisor until an operator manually rebalances or restarts every affected topology.This proposal introduces an opt-in, binary-trigger rebalance pass inside
EvenScheduler. When at least one non-blacklisted supervisor has zero used slots, the scheduler relocates one worker per topology per round-robin iteration onto the idle slots, until the idle capacity reaches an even per-supervisor share. Already-balanced topologies are never touched.The feature is disabled by default so existing behavior is preserved.
Problem statement
Today's
Cluster#needsSchedulingreturns true only when the topology has fewer assigned workers than desired or has unassigned executors:When a supervisor goes down for maintenance and comes back, the topology's workers were already redistributed onto the surviving supervisors, so both conditions are false and the scheduler considers the cluster "fully scheduled". The returning supervisor sits idle indefinitely.
Operators must currently restart each affected topology by hand to drain the over-loaded supervisors.
Proposed approach
1. Binary trigger
Cluster#needsSchedulinggets one extra branch: also return true when at least one non-blacklisted supervisor has zero used slots and the topology is not already on that supervisor. The check is binary — a supervisor either has zero used slots or it does not — so a near-balanced cluster never triggers this path.2. Round-robin relocation across topologies
A new phase at the start of
scheduleTopologiesEvenlywalks idle slots round-robin across all triggering topologies. Each iteration moves at most one worker per topology, so the returning supervisor ends up hosting workers from many topologies — preserving the per-supervisor workload diversity that a fresh submission produces.If only one topology triggers, that topology fills the idle supervisor up to its even share and stops. If only some topologies are eligible, the remaining idle slots stay empty for this round.
3. Per-topology budget per round
Each topology contributes at most
idleSupervisorCount * floor(numWorkers / nonBlacklistedSupervisorCount)workers in one scheduling round, capped further by the idle side's free slot capacity. This means:
numWorkers < numSupervisorsautomatically skip the round-robin (their floor is zero) — single-worker topologies cannot ping-pong;4. Direct placement onto idle slots
The relocated executors are assigned directly onto idle slots via
cluster.freeSlot(victim)+cluster.assign(idleSlot, ...). This bypasses the regularsortSlots/interleaveAllpass that would otherwise be free to drop some of the freed executors back into the just-vacated slots on the busy supervisors.Configuration keys
Two new entries in
DaemonConfig:nimbus.even.rebalance.idle-supervisor.enabledfalsenimbus.even.rebalance.max-free-per-topology00or negative means unbounded (the per-round budget above takes effect).Safety guards (already in the draft)
Implementation outline
The draft (against
master/3.0.0-SNAPSHOT) touches three production files plus a test:I will open the PR once this proposal direction is agreed upon.
Backward compatibility
DefaultScheduler/EvenScheduleris byte-for-byte identical.Cluster#hasIdleSupervisorReusableBy,EvenScheduler#redistributeOntoIdleSupervisors) are additions only.Feedback welcome on direction, naming, and scope before I open the PR. Thanks!