Skip to content

[PROPOSAL] EvenScheduler: opt-in round-robin rebalance to populate returning idle supervisors #8590

@mwkang

Description

@mwkang

Summary

Currently EvenScheduler (and therefore DefaultScheduler) does not move workers onto a fully-idle supervisor that returns to service after maintenance. The cluster keeps running with used slot = 0 on the returned supervisor until an operator manually rebalances or restarts every affected topology.

This proposal introduces an opt-in, binary-trigger rebalance pass inside EvenScheduler. When at least one non-blacklisted supervisor has zero used slots, the scheduler relocates one worker per topology per round-robin iteration onto the idle slots, until the idle capacity reaches an even per-supervisor share. Already-balanced topologies are never touched.

The feature is disabled by default so existing behavior is preserved.


Problem statement

Today's Cluster#needsScheduling returns true only when the topology has fewer assigned workers than desired or has unassigned executors:

public boolean needsScheduling(TopologyDetails topology) {
    int desiredNumWorkers = topology.getNumWorkers();
    int assignedNumWorkers = this.getAssignedNumWorkers(topology);
    return desiredNumWorkers > assignedNumWorkers
        || getUnassignedExecutors(topology).size() > 0;
}

When a supervisor goes down for maintenance and comes back, the topology's workers were already redistributed onto the surviving supervisors, so both conditions are false and the scheduler considers the cluster "fully scheduled". The returning supervisor sits idle indefinitely.

Initial state                  After supervisor maintenance
sup-A    sup-B    sup-C        sup-A    sup-B    sup-C
+-+-+-+  +-+-+-+  +-+-+-+      +-+-+-+  +-+-+-+
|W|W| |  |W|W| |  |W|W| |      |W|W|W|  |W|W|W|       (sup-C is down)
+-+-+-+  +-+-+-+  +-+-+-+      +-+-+-+  +-+-+-+

After sup-C returns                  Today's behavior
sup-A    sup-B    sup-C              sup-A    sup-B    sup-C
+-+-+-+  +-+-+-+  +-+-+-+            +-+-+-+  +-+-+-+  +-+-+-+
|W|W|W|  |W|W|W|  | | | |    -->     |W|W|W|  |W|W|W|  | | | |
+-+-+-+  +-+-+-+  +-+-+-+            +-+-+-+  +-+-+-+  +-+-+-+
                                                       (stays idle forever)

Operators must currently restart each affected topology by hand to drain the over-loaded supervisors.


Proposed approach

1. Binary trigger

Cluster#needsScheduling gets one extra branch: also return true when at least one non-blacklisted supervisor has zero used slots and the topology is not already on that supervisor. The check is binary — a supervisor either has zero used slots or it does not — so a near-balanced cluster never triggers this path.

Triggers       sup-A    sup-B    sup-C
               +-+-+-+  +-+-+-+  +-+-+-+
               |W|W|W|W|W|W|W|W| | | | |   <-- triggers
               +-+-+-+  +-+-+-+  +-+-+-+

Does not       sup-A    sup-B    sup-C
trigger        +-+-+-+  +-+-+-+  +-+-+-+
               |W|W|W|W|W|W|W| | |W| | |   <-- (4, 3, 1) is not balanced
               +-+-+-+  +-+-+-+  +-+-+-+       but sup-C has used > 0
                                                so trigger does not fire

2. Round-robin relocation across topologies

A new phase at the start of scheduleTopologiesEvenly walks idle slots round-robin across all triggering topologies. Each iteration moves at most one worker per topology, so the returning supervisor ends up hosting workers from many topologies — preserving the per-supervisor workload diversity that a fresh submission produces.

Topologies A, B, C, D each at distribution (4, 4, 0).
sup-C just returned with 4 free slots.

Round-robin iterations:
  iter 1: A's worker -> sup-C:0
  iter 2: B's worker -> sup-C:1
  iter 3: C's worker -> sup-C:2
  iter 4: D's worker -> sup-C:3
  done (idle capacity exhausted).

Result on sup-C: { worker of A, worker of B, worker of C, worker of D }

If only one topology triggers, that topology fills the idle supervisor up to its even share and stops. If only some topologies are eligible, the remaining idle slots stay empty for this round.

3. Per-topology budget per round

Each topology contributes at most
idleSupervisorCount * floor(numWorkers / nonBlacklistedSupervisorCount)
workers in one scheduling round, capped further by the idle side's free slot capacity. This means:

  • topologies with numWorkers < numSupervisors automatically skip the round-robin (their floor is zero) — single-worker topologies cannot ping-pong;
  • "near balanced" cluster never triggers in the first place;
  • one round produces an even per-supervisor share and the next round's trigger is silent.

4. Direct placement onto idle slots

The relocated executors are assigned directly onto idle slots via cluster.freeSlot(victim) + cluster.assign(idleSlot, ...). This bypasses the regular sortSlots/interleaveAll pass that would otherwise be free to drop some of the freed executors back into the just-vacated slots on the busy supervisors.


Configuration keys

Two new entries in DaemonConfig:

Key Type Default Meaning
nimbus.even.rebalance.idle-supervisor.enabled boolean false Master switch. When false, the new code paths are short-circuited and behavior is identical to today.
nimbus.even.rebalance.max-free-per-topology int 0 Optional upper bound on the number of workers a single topology may release per round. 0 or negative means unbounded (the per-round budget above takes effect).

Safety guards (already in the draft)

  • Opt-in (default off): zero behavior change for existing deployments.
  • Binary trigger: an "almost balanced" cluster cannot trigger a relocation; no time-based cooldown is needed because the trigger condition itself rules out instability.
  • Drain-to-zero protection: a supervisor that holds only one worker of a topology is never the donor — the same round can never produce a new idle supervisor.
  • Round-robin across topologies: prevents the first-scheduled topology from monopolizing the idle capacity, which would concentrate one workload on the returning supervisor.
  • Idle-side slot cap: the round-robin pass never tries to free more workers than the idle capacity can absorb.
  • Direct placement: relocated executors never bounce back onto the just-vacated busy supervisors.

Implementation outline

The draft (against master / 3.0.0-SNAPSHOT) touches three production files plus a test:

storm-server/src/main/java/org/apache/storm/DaemonConfig.java
   - Two new keys (above).

storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java
   - needsScheduling: extra branch via hasIdleSupervisorReusableBy().
   - hasIdleSupervisorReusableBy(): binary check, returns false when disabled.

storm-server/src/main/java/org/apache/storm/scheduler/EvenScheduler.java
   - scheduleTopologiesEvenly(): calls redistributeOntoIdleSupervisors() first.
   - redistributeOntoIdleSupervisors(): round-robin across topologies.
   - relocateOneWorkerOntoIdleSlot(): pulls one worker from the most-loaded
     supervisor of a topology, places it directly onto an idle slot.

conf/defaults.yaml
   - Defaults for the two new keys.

storm-server/src/test/java/.../TestEvenSchedulerIdleSupervisor.java
   - 9 unit tests covering trigger, drain cap, single-worker no-op,
     drain-to-zero protection, one-round even distribution, and
     round-robin sharing across multiple topologies.

I will open the PR once this proposal direction is agreed upon.


Backward compatibility

  • Default off, so out-of-the-box behavior of DefaultScheduler / EvenScheduler is byte-for-byte identical.
  • No public API removed or renamed. Two new public methods (Cluster#hasIdleSupervisorReusableBy, EvenScheduler#redistributeOntoIdleSupervisors) are additions only.
  • New config keys both have safe defaults; no new YAML is required for existing clusters.

Feedback welcome on direction, naming, and scope before I open the PR. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions