[PROPOSAL] EvenScheduler: opt-in round-robin rebalance to populate returning idle supervisors

## Summary

Currently `EvenScheduler` (and therefore `DefaultScheduler`) does not move workers onto a fully-idle supervisor that returns to service after maintenance. The cluster keeps running with `used slot = 0` on the returned supervisor until an operator manually rebalances or restarts every affected topology.

This proposal introduces an **opt-in, binary-trigger** rebalance pass inside `EvenScheduler`. When at least one non-blacklisted supervisor has zero used slots, the scheduler relocates one worker per topology per round-robin iteration onto the idle slots, until the idle capacity reaches an even per-supervisor share. Already-balanced topologies are never touched.

The feature is **disabled by default** so existing behavior is preserved.

---

## Problem statement

Today's `Cluster#needsScheduling` returns true only when the topology has fewer assigned workers than desired or has unassigned executors:

```java
public boolean needsScheduling(TopologyDetails topology) {
    int desiredNumWorkers = topology.getNumWorkers();
    int assignedNumWorkers = this.getAssignedNumWorkers(topology);
    return desiredNumWorkers > assignedNumWorkers
        || getUnassignedExecutors(topology).size() > 0;
}
```

When a supervisor goes down for maintenance and comes back, the topology's workers were already redistributed onto the surviving supervisors, so both conditions are false and the scheduler considers the cluster "fully scheduled". The returning supervisor sits idle indefinitely.

```
Initial state                  After supervisor maintenance
sup-A    sup-B    sup-C        sup-A    sup-B    sup-C
+-+-+-+  +-+-+-+  +-+-+-+      +-+-+-+  +-+-+-+
|W|W| |  |W|W| |  |W|W| |      |W|W|W|  |W|W|W|       (sup-C is down)
+-+-+-+  +-+-+-+  +-+-+-+      +-+-+-+  +-+-+-+

After sup-C returns                  Today's behavior
sup-A    sup-B    sup-C              sup-A    sup-B    sup-C
+-+-+-+  +-+-+-+  +-+-+-+            +-+-+-+  +-+-+-+  +-+-+-+
|W|W|W|  |W|W|W|  | | | |    -->     |W|W|W|  |W|W|W|  | | | |
+-+-+-+  +-+-+-+  +-+-+-+            +-+-+-+  +-+-+-+  +-+-+-+
                                                       (stays idle forever)
```

Operators must currently restart each affected topology by hand to drain the over-loaded supervisors.

---

## Proposed approach

### 1. Binary trigger

`Cluster#needsScheduling` gets one extra branch: also return true when at least one non-blacklisted supervisor has **zero used slots** and the topology is not already on that supervisor. The check is binary — a supervisor either has zero used slots or it does not — so a near-balanced cluster never triggers this path.

```
Triggers       sup-A    sup-B    sup-C
               +-+-+-+  +-+-+-+  +-+-+-+
               |W|W|W|W|W|W|W|W| | | | |   <-- triggers
               +-+-+-+  +-+-+-+  +-+-+-+

Does not       sup-A    sup-B    sup-C
trigger        +-+-+-+  +-+-+-+  +-+-+-+
               |W|W|W|W|W|W|W| | |W| | |   <-- (4, 3, 1) is not balanced
               +-+-+-+  +-+-+-+  +-+-+-+       but sup-C has used > 0
                                                so trigger does not fire
```

### 2. Round-robin relocation across topologies

A new phase at the start of `scheduleTopologiesEvenly` walks idle slots round-robin across all triggering topologies. Each iteration moves at most one worker per topology, so the returning supervisor ends up hosting workers from many topologies — preserving the per-supervisor workload diversity that a fresh submission produces.

```
Topologies A, B, C, D each at distribution (4, 4, 0).
sup-C just returned with 4 free slots.

Round-robin iterations:
  iter 1: A's worker -> sup-C:0
  iter 2: B's worker -> sup-C:1
  iter 3: C's worker -> sup-C:2
  iter 4: D's worker -> sup-C:3
  done (idle capacity exhausted).

Result on sup-C: { worker of A, worker of B, worker of C, worker of D }
```

If only one topology triggers, that topology fills the idle supervisor up to its even share and stops. If only some topologies are eligible, the remaining idle slots stay empty for this round.

### 3. Per-topology budget per round

Each topology contributes at most
`idleSupervisorCount * floor(numWorkers / nonBlacklistedSupervisorCount)`
workers in one scheduling round, capped further by the idle side's free slot capacity. This means:

- topologies with `numWorkers < numSupervisors` automatically skip the round-robin (their floor is zero) — single-worker topologies cannot ping-pong;
- "near balanced" cluster never triggers in the first place;
- one round produces an even per-supervisor share and the next round's trigger is silent.

### 4. Direct placement onto idle slots

The relocated executors are assigned directly onto idle slots via `cluster.freeSlot(victim)` + `cluster.assign(idleSlot, ...)`. This bypasses the regular `sortSlots`/`interleaveAll` pass that would otherwise be free to drop some of the freed executors back into the just-vacated slots on the busy supervisors.

---

## Configuration keys

Two new entries in `DaemonConfig`:

| Key | Type | Default | Meaning |
|---|---|---|---|
| `nimbus.even.rebalance.idle-supervisor.enabled` | boolean | `false` | Master switch. When false, the new code paths are short-circuited and behavior is identical to today. |
| `nimbus.even.rebalance.max-free-per-topology` | int | `0` | Optional upper bound on the number of workers a single topology may release per round. `0` or negative means unbounded (the per-round budget above takes effect). |

---

## Safety guards (already in the draft)

- **Opt-in (default off)**: zero behavior change for existing deployments.
- **Binary trigger**: an "almost balanced" cluster cannot trigger a relocation; no time-based cooldown is needed because the trigger condition itself rules out instability.
- **Drain-to-zero protection**: a supervisor that holds only one worker of a topology is never the donor — the same round can never produce a new idle supervisor.
- **Round-robin across topologies**: prevents the first-scheduled topology from monopolizing the idle capacity, which would concentrate one workload on the returning supervisor.
- **Idle-side slot cap**: the round-robin pass never tries to free more workers than the idle capacity can absorb.
- **Direct placement**: relocated executors never bounce back onto the just-vacated busy supervisors.

---

## Implementation outline

The draft (against `master` / `3.0.0-SNAPSHOT`) touches three production files plus a test:

```
storm-server/src/main/java/org/apache/storm/DaemonConfig.java
   - Two new keys (above).

storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java
   - needsScheduling: extra branch via hasIdleSupervisorReusableBy().
   - hasIdleSupervisorReusableBy(): binary check, returns false when disabled.

storm-server/src/main/java/org/apache/storm/scheduler/EvenScheduler.java
   - scheduleTopologiesEvenly(): calls redistributeOntoIdleSupervisors() first.
   - redistributeOntoIdleSupervisors(): round-robin across topologies.
   - relocateOneWorkerOntoIdleSlot(): pulls one worker from the most-loaded
     supervisor of a topology, places it directly onto an idle slot.

conf/defaults.yaml
   - Defaults for the two new keys.

storm-server/src/test/java/.../TestEvenSchedulerIdleSupervisor.java
   - 9 unit tests covering trigger, drain cap, single-worker no-op,
     drain-to-zero protection, one-round even distribution, and
     round-robin sharing across multiple topologies.
```

I will open the PR once this proposal direction is agreed upon.

---

## Backward compatibility

- Default off, so out-of-the-box behavior of `DefaultScheduler` / `EvenScheduler` is byte-for-byte identical.
- No public API removed or renamed. Two new public methods (`Cluster#hasIdleSupervisorReusableBy`, `EvenScheduler#redistributeOntoIdleSupervisors`) are additions only.
- New config keys both have safe defaults; no new YAML is required for existing clusters.

Feedback welcome on direction, naming, and scope before I open the PR. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] EvenScheduler: opt-in round-robin rebalance to populate returning idle supervisors #8590

Summary

Problem statement

Proposed approach

1. Binary trigger

2. Round-robin relocation across topologies

3. Per-topology budget per round

4. Direct placement onto idle slots

Configuration keys

Safety guards (already in the draft)

Implementation outline

Backward compatibility

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Key	Type	Default	Meaning
`nimbus.even.rebalance.idle-supervisor.enabled`	boolean	`false`	Master switch. When false, the new code paths are short-circuited and behavior is identical to today.
`nimbus.even.rebalance.max-free-per-topology`	int	`0`	Optional upper bound on the number of workers a single topology may release per round. `0` or negative means unbounded (the per-round budget above takes effect).

[PROPOSAL] EvenScheduler: opt-in round-robin rebalance to populate returning idle supervisors #8590

Description

Summary

Problem statement

Proposed approach

1. Binary trigger

2. Round-robin relocation across topologies

3. Per-topology budget per round

4. Direct placement onto idle slots

Configuration keys

Safety guards (already in the draft)

Implementation outline

Backward compatibility

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions