-
Notifications
You must be signed in to change notification settings - Fork 17
fix(training): resolve Azure ML RL submission path and runtime dependency failures #320
Description
Training Framework
SKRL
Issue Type
Training script error
Python Version
3.11
GPU Model
A10
Isaac Sim/Lab Version
Isaac Lab 2.3.2 / Isaac Sim container nvcr.io/nvidia/isaac-lab:2.3.2
Issue Description
Submitting the RL Azure ML job from training/rl currently fails when using the repository implementation in training/rl/scripts/submit-azureml-training.sh.
The first observed failure is an entrypoint path mismatch. Azure ML accepts the job submission, but the run fails immediately because the submitted command points to training/scripts/train.sh, while the repository entrypoint exists at training/rl/scripts/train.sh.
The next constraint discovered during validation is that a narrow upload rooted only at training/rl is not sufficient by itself. The runtime launched by training/rl/scripts/train.sh imports from the top-level training package, including training.rl, training.utils, and training.stream. That means a training/rl-only snapshot would fix the shell path problem but then fail on Python imports unless the uploaded payload also preserves the parent training/ package layout.
Additionally, the entire repo is part of the AML job, as the .amlignore is not in expected place
Training Configuration
Just running from `/training/rl` `./scripts/submit-azureml-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0`Error Traceback
{"NonCompliant":"Execution failed. User process 'bash' exited with status code 127. Please check log file 'user_logs/std_log.txt' for error details. Error: bash: training/scripts/train.sh: No such file or directory\n"}
{"code": "ExecutionFailed", "target": "", "category": "UserError", "error_details": [{"key": "exit_codes", "value": "127"}]}Checklist
- I have verified my environment is synced with
uv sync - I have tested with a minimal configuration