Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion .github/workflows/ci-cd.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: CI/CD
name: CLI CI/CD

on:
push:
Expand All @@ -17,6 +17,21 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v6

- name: Set up Java (for OpenAPI Generator)
uses: actions/setup-java@v4
with:
java-version: '26'
distribution: 'temurin'

- name: Generate OpenAPI Client
run: |
# Install and run OpenAPI Generator
npm install -g @openapitools/openapi-generator-cli
openapi-generator-cli generate \
-i openapi-specs/slurmrest-api-v0.0.44.json \
-g python \
-o slurmrest_client/

- name: Install uv
uses: astral-sh/setup-uv@v7
with:
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ lib64

# VS Code
.vscode/

# Secrets
.env

# Generated Client
slurmrest_client/
8 changes: 8 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ PIP_BIN ?= $(VENV_DIR)/bin/pip
PYTHON_BIN ?= python3.9
VAULT_SECRET_PATH ?= secret/tid/coact

generate-client:
npm install -g @openapitools/openapi-generator-cli
openapi-generator-cli generate \
-i openapi-specs/slurmrest-api-v0.0.44.json \
-g python \
-o slurmrest_client \
--package-name openapi_client

secrets:
mkdir etc/.secrets/ -p
#set -e; for i in ldap_binddn ldap_bindpw; do vault kv get --field=$$i $(VAULT_SECRET_PATH) > etc/.secrets/$$i ; done
Expand Down
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,22 @@ then run
make apply
```

## OpenAPI Client Generation

The CLI includes auto-generated Python client for the Slurm REST API. To generate the client locally:

**Prerequisites:**
- Java runtime (required by OpenAPI Generator)
- On macOS with Homebrew: `brew install openjdk`
- Add to your shell profile: `echo 'export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"' >> ~/.zshrc`

**Generate the client:**
```
make generate-client
```

The client is automatically generated in CI/CD and included when installing dependencies with `uv sync`.


# Usage

Expand Down
2 changes: 1 addition & 1 deletion ansible-runner/project
106 changes: 106 additions & 0 deletions docs/slurmrest_migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Migration of `sacctmgr` and `sacct` Calls to `slurmrest`

Currently, `sdf-cli` runs `sacctmgr` and `sacct` CLI tools directly via `subprocess` in order to gather account association information, job accounting information, and toggling the number of nodes assigned to a facility in the event of an overage. In a step towards containerization and more robust interaction with SLURM, it is desired to migrate these CLI tool usages to `slurmrest` endpoints.

## Current Usage

There are currently three read operations and one write operations:

### Read Operations

- `sacctmgr show assoc where account={','.join(list_of_assoc)} --noheader -P format=Account,GrpNodes,GrpJobs,MaxJobs`
- `cli/modules/coact.py:1015`
- Equivalent endpoint: `GET /slurmdb/v0.0.44/associations/`
- `SLURM_TIME_FORMAT=%s sacct --allusers --duplicates --allclusters --allocations --starttime="{start}T00:00:00" --endtime="{start}T23:59:59" --truncate --parsable2 --format=JobID,User,UID,Account,Partition,QOS,Submit,Start,End,Elapsed,NCPUS,AllocNodes,AllocTRES,CPUTimeRAW,NodeList,Reservation,ReservationId,State`
- `api/scripts/jobs2usage.py:37`
- Equivalent endpoint: `GET /slurmdb/v0.0.44/jobs/`
- `SLURM_TIME_FORMAT=%s {sacct_bin_path} --allusers --duplicates --allclusters --allocations --starttime="{date}T{start_time}" --endtime="{date}T{end_time}" --truncate --parsable2 --format=JobID,User,UID,Account,Partition,QOS,Submit,Start,End,Elapsed,NCPUS,AllocNodes,AllocTRES,CPUTimeRAW,NodeList,Reservation,ReservationId,State`
- `cli/modules/coact.py:135`
- Equivalent endpoint: `GET /slurmdb/v0.0.44/jobs/`

### Write Operations

- `sacctmgr modify -i account name=$facility:_regular_@$cluster set GrpTRES=node=$nodes`
- `cli/modules/coact.py`
- NOT directly possible, `slurmrest` does not support account resource allocation assignments

## Migration Implications

Currently, three daemon tasks can be migrated easily:
- `coact-jobs-import.sh`
- `coact-reporegistration-daemon.sh`
- `coact-userregistration-daemon.sh`

One remains difficult due to the un-migratable write operation:
- `coact-facility-overage-daemon.sh`

## Possible Alternatives

All potential workarounds within `slurmrest` hae significant disadvantages vs the current `GrpTRES=node=0` approach. `sacctmgr` appears to be the only reliable away to make modifications to account allocations. There is [a ticket](https://support.schedmd.com/show_bug.cgi?id=24356) with SLURM to support more `sacctmgr` features, however there is no activity on it other than the original post.

### Path Forward

> Execute same CLI tools via ansible

We already have to maintain an ssh connection between the future container and the SLURM infrastructure through Ansible. Ansible allows for direct, ad-hoc command execution via its [`command`](https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/command_module.html) module.

e.g.
```
ansible [pattern] -m command -a 'sacctmgr ...'
```

## Use of the Autogenerated Python Client

`slurmrest` ships with the ability to [generate a Python client](https://slurm.schedmd.com/rest.html#python-guide) via `openapi-generator-cli`. The openapi spec was generated once using the `slurmrest` instance within [`slurm-docker-cluster`](https://github.com/giovtorres/slurm-docker-cluster) and now lives in `openapi-specs/`.

Any future updates to `slurmrest` should support previous endpoints, but any new endpoints will require regenerating the openapi spec, which requires a live `slurmrest` instance.

For local development, the client can be created via `make generate-client` (a Jave runtime is needed). For containerization, the client is built in CI/CD and will be packaged inside the container for usage. This was chosen to keep the client out of the git history, as it is large and not managed by SLAC, while still keeping it available for usage.

## Associations Fetching

One notable deviation from the previous CLI data gathering to `slurmrest` is the fetching of associations. Previously, all associations were fetched with:

```bash
sacctmgr show assoc where account={','.join(list_of_assoc)} --noheader -P format=Account,GrpNodes,GrpJobs,MaxJobs
```

In the `slurmrest` implementation, now associations are collected one by one as the data volume over network has caused instability.

## Data Format Differences

Several format differences between sacct/sacctmgr CLI output and slurmrest REST API responses required explicit handling in the migration.

### Jobs: Memory TRES units

sacct serialises memory TRES with a unit suffix (K/M/G), e.g.:

```
AllocTRES=cpu=128,mem=512G,node=4,billing=128,gres/gpu:a100=4
```

slurmrest returns memory TRES `count` as a **bare integer in megabytes** with no suffix. To remain compatible with `_kilos_to_int()`, which was written for sacct-style suffixed strings, the migration appends an `M` suffix when serialising memory from the REST response:

```python
value = f"{tres.count}M" if tres.type == "mem" else str(tres.count)
```

Without this, `_kilos_to_int("524288")` would interpret the value as bytes rather than megabytes, producing a ~1024× underestimate relative to `cluster["mem"]` (which is stored in bytes from `nodememgb * 1073741824`).

### Jobs: Time fields

sacct returns Unix timestamps as integers when `SLURM_TIME_FORMAT=%s` is set. The migration code then called `parse_datetime(int(d["Start"]), force_tz=True)` to convert them.

slurmrest returns timestamps as integers in the same epoch-second format, but nested under a `time` struct:

```python
# sacct: int(d["Start"]) → unix timestamp
# REST: job.time.start → unix timestamp (same value, different path)
pendulum.from_timestamp(job.time.start)
```

No unit conversion is needed, but a guard for `0` / falsy values is required since slurmrest uses `0` to indicate "not set" (e.g. a job that never started has `time.start == 0`).

### Jobs: TRES key format for GPUs

sacct uses `gres/gpu:a100=4` (type/name:subtype=count). slurmrest splits this into `tres.type = "gres/gpu"`, `tres.name = "a100"`, `tres.count = 4`. The migration reconstructs the sacct-style key as `f"{tres.type}/{tres.name}"` where a name is present, otherwise just `tres.type`. The `_calc_resource_hours` GPU detection (`if "gpu" in k`) handles both forms correctly.
6 changes: 2 additions & 4 deletions import-jobs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,13 @@ echo ">" $DATE" ("$(date)")"
# full
./venv/bin/python3 ./sdf_click.py coact slurmdump --date $DATE \
| tee ../slurm-job-history/$DATE \
| ./venv/bin/python3 ./sdf_click.py coact slurmremap \
| tee ../slurm-job-remapped/$DATE \
| ./venv/bin/python3 ./sdf_click.py coact slurmimport --password-file $PASSWORD_FILE --output=upload >/dev/null

# just for 2023 imports
#cat ../slurm-job-remapped/$DATE | ./sdf.py coact slurmimport --password-file $PASSWORD_FILE --output=upload >/dev/null
#cat ../slurm-job-history/$DATE | ./sdf.py coact slurmimport --password-file $PASSWORD_FILE --output=upload >/dev/null

# don't pull data from slurm
#cat ../slurm-job-history/$DATE | ./sdf.py coact slurmremap | tee ../slurm-job-remapped/$DATE | ./sdf.py coact slurmimport --password-file $PASSWORD_FILE --output=upload >/dev/null
#cat ../slurm-job-history/$DATE | ./sdf.py coact slurmimport --password-file $PASSWORD_FILE --output=upload >/dev/null

###
# recalculate summaries
Expand Down
Loading
Loading