Add NRP OSPool EP image builds#307
Conversation
…fixed condor config
…egrid/osgvo-docker-pilot:23-el9-releasee
|
Oops, wrong issue name! |
There was a problem hiding this comment.
Pull request overview
Adds a new osg-htc/nrp-ospool-ep container image (NRP OSPool EP) and updates the GitHub Actions container build workflow to include osg-htc/ images in addition to opensciencegrid/.
Changes:
- Introduces a new
osg-htc/nrp-ospool-epimage (Dockerfile + build-config) with entrypoint/init scripts for pilot/Condor behavior in k8s. - Adds wrapper scripts for singularity/apptainer to strip
--pid/-pusage. - Updates
.github/workflows/build-containers.ymlimage discovery to includeosg-htc/directories.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/build-containers.yml |
Expands image discovery to include osg-htc/ roots. |
osg-htc/nrp-ospool-ep/Dockerfile |
Builds the NRP OSPool EP-derived image; adds kubectl, wrappers, and entrypoint/init wiring. |
osg-htc/nrp-ospool-ep/build-config.json |
Declares build matrix parameters for this image. |
osg-htc/nrp-ospool-ep/scripts/entrypoint.sh |
New top-level entrypoint that sources config snippets, runs the original entrypoint as osg, and supervises shutdown. |
osg-htc/nrp-ospool-ep/scripts/check_master.sh |
Background monitor intended to detect condor_master termination/restart and dump logs. |
osg-htc/nrp-ospool-ep/scripts/singularity_npid.sh |
Wrapper to drop --pid/-p and rewrite --containall/-C to omit pid namespace. |
osg-htc/nrp-ospool-ep/scripts/apptainer_npid.sh |
Same as above for apptainer. |
osg-htc/nrp-ospool-ep/scripts/01_token.sh |
Exports TOKEN from the mounted token file for osgvo-pilot. |
osg-htc/nrp-ospool-ep/scripts/01_no_condor_host.sh |
Unsets CONDOR_HOST before starting the pilot. |
osg-htc/nrp-ospool-ep/scripts/11_set_OSGInstitutionID.sh |
Attempts to derive and export OSG_INSTITUTION_ID via kubectl node label. |
osg-htc/nrp-ospool-ep/scripts/02_validate_singularity.sh |
Runtime validation for singularity execution (with optional --nv). |
osg-htc/nrp-ospool-ep/scripts/02_validate_apptainer.sh |
Runtime validation for apptainer execution (with optional --nv). |
osg-htc/nrp-ospool-ep/scripts/19_set_resources.sh |
Writes resource/slot config into the pilot config file. |
osg-htc/nrp-ospool-ep/scripts/20_advertise_glidein.sh |
Advertises IS_GLIDEIN into STARTD_EXPRS. |
osg-htc/nrp-ospool-ep/scripts/20_advertise_k8s_domain.sh |
Advertises k8s-related attributes (pod/domain/namespace/physical hostname). |
osg-htc/nrp-ospool-ep/scripts/21_advertise_k8s_provisioner.sh |
Advertises provisioner name/type into STARTD_EXPRS. |
osg-htc/nrp-ospool-ep/scripts/22_set_requirements.sh |
Adds matching/provisioning expressions to START based on requested resources and namespace matching. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tail -100 /pilot/log/MasterLog | ||
|
|
||
| echo "=== tail /pilot/log/StartLog" | ||
| tail -100 /pilotl/log/StartLog |
There was a problem hiding this comment.
The StartLog tail path has a typo: /pilotl/log/StartLog will not exist (looks like it should be /pilot/log/StartLog). This will make the debugging output misleading right when it's needed.
| tail -100 /pilotl/log/StartLog | |
| tail -100 /pilot/log/StartLog |
| echo "condor_master restrated" 1>&2 | ||
| break |
There was a problem hiding this comment.
Typo in error message: "restrated" should be "restarted".
| su osg -p -c "/usr/local/sbin/entrypoint.osg.sh $@" & | ||
| myproc=$! | ||
|
|
||
| # protection in case it does not terminate oby itself when condor dies or restarts |
There was a problem hiding this comment.
Typo in comment: "oby" should be "on its own" (or similar).
| # protection in case it does not terminate oby itself when condor dies or restarts | |
| # protection in case it does not terminate on its own when condor dies or restarts |
| # Add kubectl, to be able to interact with the k8s cluster | ||
| RUN curl -L "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" -o /usr/sbin/kubectl && \ | ||
| chmod u+x /usr/sbin/kubectl |
There was a problem hiding this comment.
This downloads the latest kubectl at build time via stable.txt and doesn't verify an expected checksum/signature. That makes builds non-reproducible and introduces supply-chain risk (a compromised download endpoint would be silently trusted). Prefer pinning to an explicit kubectl version and validating the downloaded binary (e.g., sha256) during the build.
| # else do nothing, let Condor figure it out | ||
|
|
||
| if [ -f "/usr/bin/apptainer" ]; then | ||
| # only test for apptainer functionality if singularity is present |
There was a problem hiding this comment.
The comment says "only test for apptainer functionality if singularity is present" but this script is checking for /usr/bin/apptainer. The comment is misleading; update it to refer to apptainer.
| # only test for apptainer functionality if singularity is present | |
| # only test for apptainer functionality if apptainer is present |
|
|
||
|
|
||
| # | ||
| # Advertise the k8s namespace and physical hostname |
There was a problem hiding this comment.
The header comment says this script advertises the k8s namespace/hostname, but the body is setting NUM_CPUS/MEMORY/DISK/GPU and slot config. Please update the comment to match the actual behavior so future readers don't get misled.
| # Advertise the k8s namespace and physical hostname | |
| # Set resource and slot configuration in the pilot config file, | |
| # including CPUs, memory, disk, and optional GPUs. |
| grep -E "${ORG_DIR_REGEX}") | ||
| fi | ||
|
|
||
| image_json=$(echo -n "${images:-dummy}" | jq -Rcs '.|split("\n") | map(select(. != ""))') |
There was a problem hiding this comment.
Including osg-htc/ in ORG_DIR_REGEX will cause the image list to include paths like osg-htc/nrp-ospool-ep, but the rest of this workflow still hard-codes CONTEXT="opensciencegrid/${{ matrix.name }}" (both build and push jobs). Because build-job-matrix.py sets matrix.name to the basename of the image dir, osg-htc images will be built/pushed from a non-existent opensciencegrid/<name> context. Update the matrix generation and/or workflow so the org prefix is preserved and CONTEXT is set to the actual image directory (e.g., osg-htc/nrp-ospool-ep).
| image_json=$(echo -n "${images:-dummy}" | jq -Rcs '.|split("\n") | map(select(. != ""))') | |
| image_json=$(echo -n "${images:-dummy}" | jq -Rcs ' | |
| split("\n") | |
| | map(select(. != "")) | |
| | map({ | |
| name: (split("/") | last), | |
| context: . | |
| }) | |
| ') |
| echo "condor_master restrated at first step" 1>&2 | ||
| exit 1 |
There was a problem hiding this comment.
Typo in error message: "restrated" should be "restarted".
| trap 'echo signal received!; kill $(jobs -p); wait' SIGINT SIGTERM | ||
|
|
||
| export HOME=/pilot | ||
| su osg -p -c "/usr/local/sbin/entrypoint.osg.sh $@" & |
There was a problem hiding this comment.
su ... -c "/usr/local/sbin/entrypoint.osg.sh $@" expands $@ in the current shell and loses argument boundaries/quoting (args containing spaces, quotes, glob chars, etc.). This can break passing runtime flags through to entrypoint.osg.sh. Consider using a form that passes arguments separately (via su's --/arguments support or an intermediate bash -lc wrapper) so "$@" semantics are preserved.
| su osg -p -c "/usr/local/sbin/entrypoint.osg.sh $@" & | |
| su osg -p -s /bin/bash -c 'exec /usr/local/sbin/entrypoint.osg.sh "$@"' bash "$@" & |
| rm -f /root/master_pids.txt | ||
| awk '/CondorVersion/{split($3,a,":"); split(a[2],b,")"); print b[1]}' /pilot/log/MasterLog > /root/master_pids.txt | ||
| npids=`cat /root/master_pids.txt | wc -l` | ||
|
|
||
| if [ $npids -lt 1 ]; then | ||
| echo "condor_master log empty, unexpected after first step" 1>&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| if [ $npids -gt 1 ]; then | ||
| echo "condor_master restrated" 1>&2 | ||
| break | ||
| fi | ||
|
|
||
| if [ $nprocs -ne 1 ]; then | ||
| echo "condor_master not running" 1>&2 | ||
| break |
There was a problem hiding this comment.
check_master.sh computes nprocs only once (before the loop) and never refreshes it, but the loop later uses nprocs to decide whether condor_master is still running. As written, the loop can miss a condor_master exit/crash. Recompute nprocs (and ideally use ps -p "$orgpid") each iteration before the nprocs check.
I bet this will fail, at the very least because the Harbor user for pushing images probably doesn't have permissions to push to the
osg-htcproject.