Skip to content

Add NRP OSPool EP image builds#307

Closed
brianhlin wants to merge 34 commits intoopensciencegrid:mainfrom
brianhlin:OPS-158.nrp-ospool-ep
Closed

Add NRP OSPool EP image builds#307
brianhlin wants to merge 34 commits intoopensciencegrid:mainfrom
brianhlin:OPS-158.nrp-ospool-ep

Conversation

@brianhlin
Copy link
Copy Markdown
Member

I bet this will fail, at the very least because the Harbor user for pushing images probably doesn't have permissions to push to the osg-htc project.

@brianhlin brianhlin requested a review from a team as a code owner April 17, 2026 22:53
Copilot AI review requested due to automatic review settings April 17, 2026 22:53
@brianhlin
Copy link
Copy Markdown
Member Author

Oops, wrong issue name!

@brianhlin brianhlin closed this Apr 17, 2026
@brianhlin brianhlin deleted the OPS-158.nrp-ospool-ep branch April 17, 2026 22:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new osg-htc/nrp-ospool-ep container image (NRP OSPool EP) and updates the GitHub Actions container build workflow to include osg-htc/ images in addition to opensciencegrid/.

Changes:

  • Introduces a new osg-htc/nrp-ospool-ep image (Dockerfile + build-config) with entrypoint/init scripts for pilot/Condor behavior in k8s.
  • Adds wrapper scripts for singularity/apptainer to strip --pid / -p usage.
  • Updates .github/workflows/build-containers.yml image discovery to include osg-htc/ directories.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
.github/workflows/build-containers.yml Expands image discovery to include osg-htc/ roots.
osg-htc/nrp-ospool-ep/Dockerfile Builds the NRP OSPool EP-derived image; adds kubectl, wrappers, and entrypoint/init wiring.
osg-htc/nrp-ospool-ep/build-config.json Declares build matrix parameters for this image.
osg-htc/nrp-ospool-ep/scripts/entrypoint.sh New top-level entrypoint that sources config snippets, runs the original entrypoint as osg, and supervises shutdown.
osg-htc/nrp-ospool-ep/scripts/check_master.sh Background monitor intended to detect condor_master termination/restart and dump logs.
osg-htc/nrp-ospool-ep/scripts/singularity_npid.sh Wrapper to drop --pid/-p and rewrite --containall/-C to omit pid namespace.
osg-htc/nrp-ospool-ep/scripts/apptainer_npid.sh Same as above for apptainer.
osg-htc/nrp-ospool-ep/scripts/01_token.sh Exports TOKEN from the mounted token file for osgvo-pilot.
osg-htc/nrp-ospool-ep/scripts/01_no_condor_host.sh Unsets CONDOR_HOST before starting the pilot.
osg-htc/nrp-ospool-ep/scripts/11_set_OSGInstitutionID.sh Attempts to derive and export OSG_INSTITUTION_ID via kubectl node label.
osg-htc/nrp-ospool-ep/scripts/02_validate_singularity.sh Runtime validation for singularity execution (with optional --nv).
osg-htc/nrp-ospool-ep/scripts/02_validate_apptainer.sh Runtime validation for apptainer execution (with optional --nv).
osg-htc/nrp-ospool-ep/scripts/19_set_resources.sh Writes resource/slot config into the pilot config file.
osg-htc/nrp-ospool-ep/scripts/20_advertise_glidein.sh Advertises IS_GLIDEIN into STARTD_EXPRS.
osg-htc/nrp-ospool-ep/scripts/20_advertise_k8s_domain.sh Advertises k8s-related attributes (pod/domain/namespace/physical hostname).
osg-htc/nrp-ospool-ep/scripts/21_advertise_k8s_provisioner.sh Advertises provisioner name/type into STARTD_EXPRS.
osg-htc/nrp-ospool-ep/scripts/22_set_requirements.sh Adds matching/provisioning expressions to START based on requested resources and namespace matching.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tail -100 /pilot/log/MasterLog

echo "=== tail /pilot/log/StartLog"
tail -100 /pilotl/log/StartLog
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StartLog tail path has a typo: /pilotl/log/StartLog will not exist (looks like it should be /pilot/log/StartLog). This will make the debugging output misleading right when it's needed.

Suggested change
tail -100 /pilotl/log/StartLog
tail -100 /pilot/log/StartLog

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +46
echo "condor_master restrated" 1>&2
break
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in error message: "restrated" should be "restarted".

Copilot uses AI. Check for mistakes.
su osg -p -c "/usr/local/sbin/entrypoint.osg.sh $@" &
myproc=$!

# protection in case it does not terminate oby itself when condor dies or restarts
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "oby" should be "on its own" (or similar).

Suggested change
# protection in case it does not terminate oby itself when condor dies or restarts
# protection in case it does not terminate on its own when condor dies or restarts

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +24
# Add kubectl, to be able to interact with the k8s cluster
RUN curl -L "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" -o /usr/sbin/kubectl && \
chmod u+x /usr/sbin/kubectl
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This downloads the latest kubectl at build time via stable.txt and doesn't verify an expected checksum/signature. That makes builds non-reproducible and introduces supply-chain risk (a compromised download endpoint would be silently trusted). Prefer pinning to an explicit kubectl version and validating the downloaded binary (e.g., sha256) during the build.

Copilot uses AI. Check for mistakes.
# else do nothing, let Condor figure it out

if [ -f "/usr/bin/apptainer" ]; then
# only test for apptainer functionality if singularity is present
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "only test for apptainer functionality if singularity is present" but this script is checking for /usr/bin/apptainer. The comment is misleading; update it to refer to apptainer.

Suggested change
# only test for apptainer functionality if singularity is present
# only test for apptainer functionality if apptainer is present

Copilot uses AI. Check for mistakes.


#
# Advertise the k8s namespace and physical hostname
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says this script advertises the k8s namespace/hostname, but the body is setting NUM_CPUS/MEMORY/DISK/GPU and slot config. Please update the comment to match the actual behavior so future readers don't get misled.

Suggested change
# Advertise the k8s namespace and physical hostname
# Set resource and slot configuration in the pilot config file,
# including CPUs, memory, disk, and optional GPUs.

Copilot uses AI. Check for mistakes.
grep -E "${ORG_DIR_REGEX}")
fi

image_json=$(echo -n "${images:-dummy}" | jq -Rcs '.|split("\n") | map(select(. != ""))')
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including osg-htc/ in ORG_DIR_REGEX will cause the image list to include paths like osg-htc/nrp-ospool-ep, but the rest of this workflow still hard-codes CONTEXT="opensciencegrid/${{ matrix.name }}" (both build and push jobs). Because build-job-matrix.py sets matrix.name to the basename of the image dir, osg-htc images will be built/pushed from a non-existent opensciencegrid/<name> context. Update the matrix generation and/or workflow so the org prefix is preserved and CONTEXT is set to the actual image directory (e.g., osg-htc/nrp-ospool-ep).

Suggested change
image_json=$(echo -n "${images:-dummy}" | jq -Rcs '.|split("\n") | map(select(. != ""))')
image_json=$(echo -n "${images:-dummy}" | jq -Rcs '
split("\n")
| map(select(. != ""))
| map({
name: (split("/") | last),
context: .
})
')

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +20
echo "condor_master restrated at first step" 1>&2
exit 1
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in error message: "restrated" should be "restarted".

Copilot uses AI. Check for mistakes.
trap 'echo signal received!; kill $(jobs -p); wait' SIGINT SIGTERM

export HOME=/pilot
su osg -p -c "/usr/local/sbin/entrypoint.osg.sh $@" &
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

su ... -c "/usr/local/sbin/entrypoint.osg.sh $@" expands $@ in the current shell and loses argument boundaries/quoting (args containing spaces, quotes, glob chars, etc.). This can break passing runtime flags through to entrypoint.osg.sh. Consider using a form that passes arguments separately (via su's --/arguments support or an intermediate bash -lc wrapper) so "$@" semantics are preserved.

Suggested change
su osg -p -c "/usr/local/sbin/entrypoint.osg.sh $@" &
su osg -p -s /bin/bash -c 'exec /usr/local/sbin/entrypoint.osg.sh "$@"' bash "$@" &

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +51
rm -f /root/master_pids.txt
awk '/CondorVersion/{split($3,a,":"); split(a[2],b,")"); print b[1]}' /pilot/log/MasterLog > /root/master_pids.txt
npids=`cat /root/master_pids.txt | wc -l`

if [ $npids -lt 1 ]; then
echo "condor_master log empty, unexpected after first step" 1>&2
exit 1
fi

if [ $npids -gt 1 ]; then
echo "condor_master restrated" 1>&2
break
fi

if [ $nprocs -ne 1 ]; then
echo "condor_master not running" 1>&2
break
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_master.sh computes nprocs only once (before the loop) and never refreshes it, but the loop later uses nprocs to decide whether condor_master is still running. As written, the loop can miss a condor_master exit/crash. Recompute nprocs (and ideally use ps -p "$orgpid") each iteration before the nprocs check.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants