[FLINK-39404][runtime] HardwareDescription reports incorrect CPU cores in containerized environments with fractional CPU limits#27899
Open
Dennis-Mircea wants to merge 3 commits intoapache:masterfrom
Open
Conversation
…s in containerized environments with fractional CPU limits
Collaborator
Author
|
@flinkbot run azure re-run the last Azure build |
9b3b303 to
216eef4
Compare
216eef4 to
a647772
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
This PR fixes inaccurate CPU core reporting in containerized environments (Kubernetes, YARN) where fractional CPU limits are used (e.g. 0.5 cores).
Currently,
Hardware.getNumberCPUCores()relies onRuntime.getRuntime().availableProcessors(), which returns anint, ceiling-rounding fractional container CPU limits (0.5 → 1). This causes:ClusterEntrypointUtilscomputes4 * ceil(0.5) = 4threads instead ofceil(4 * 0.5) = 2.Brief change log
Hardware.java:getContainerCpuLimit()- reads the actual container CPU limit from Linux cgroup files (v2:/sys/fs/cgroup/cpu.max, v1:cpu.cfs_quota_us/cpu.cfs_period_us). Returns the fractional value (e.g. 0.5) or -1 if not in a container / no limit set.getNumberCPUCoresAsDouble()- returns the fractional CPU core count (container limit with fallback toavailableProcessors()). Use this when fractional precision matters (display, arithmetic before rounding).HardwareDescription.java:numberOfCPUCoresfrominttodouble(field, constructor, getter,equals,toString).extractFromSystem()now usesHardware.getNumberCPUCoresAsDouble().cpuCoresnow emits fractional values (e.g.0.5instead of1).ClusterEntrypointUtils.java:4 * Hardware.getNumberCPUCores()to(int) Math.ceil(4 * Hardware.getNumberCPUCoresAsDouble())so the multiplication happens before ceiling, avoiding thread over-provisioning (0.5 CPU → 2 threads instead of 4).Verifying this change
getContainerCpuLimit()returns -1,getNumberCPUCores()falls back toavailableProcessors(), representing no behavioral change.cpu: 0.5):getContainerCpuLimit()returns0.5getNumberCPUCoresAsDouble()returns0.5getNumberCPUCores()returns1(ceiling)0.5instead of1ceil(4 * 0.5) = 2threads instead of4Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation