Skip to content

fix: evict expired entries in TokenCache.Get() to prevent memory leak#334

Open
HarshitPal25 wants to merge 1 commit into
volcano-sh:mainfrom
HarshitPal25:fix/token-cache-stale-entry-leak
Open

fix: evict expired entries in TokenCache.Get() to prevent memory leak#334
HarshitPal25 wants to merge 1 commit into
volcano-sh:mainfrom
HarshitPal25:fix/token-cache-stale-entry-leak

Conversation

@HarshitPal25
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a memory leak in TokenCache.Get() where expired entries are never evicted from the cache.

Previously, when TokenCache.Get() encountered an expired entry, it returned (false, false, "") but left the stale entry in the cache map and LRU list. Under high-cardinality token usage (e.g., short-lived K8s service account tokens rotated frequently), dead entries accumulate until LRU eviction pressure pushes them out, starving valid tokens from being cached.

ClientCache.Get() in the same file (client_cache.go:110-114) correctly evicts expired entries inline, proving this was an oversight in TokenCache.

The root cause is that TokenCache.Get() used RLock() (read lock), which prevented it from mutating the cache to remove stale entries. ClientCache.Get() uses a full Lock() and removes expired entries correctly.

This PR:

  • Upgrades TokenCache.Get() from RLockLock so it can mutate state on expiry.
  • Evicts expired entries inline (removes from both the LRU list and the map), matching the pattern established by ClientCache.Get().
  • Promotes accessed entries in the LRU list on Get() for consistent eviction ordering.
  • Adds a missing assertion in TestTokenCache_Get_Expired verifying that expired entries are actually removed (cache.Size() == 0), not just hidden by the return value.
  • Updates TestTokenCache_LRUBehavior to reflect that Get() now promotes entries, so eviction order changes correctly.

Which issue(s) this PR fixes:

None (discovered during code review )

Special notes for your reviewer:

The lock upgrade from RLockLock in Get() trades a small amount of read concurrency for correctness. Under production workloads, Get() calls are short-lived (map lookup + time check), so the contention impact is negligible. This matches how ClientCache.Get() already operates with a full Lock().

All existing tests pass with go test -race.

Does this PR introduce a user-facing change?:

NONE

TokenCache.Get() was returning (false, false, "") for expired entries
without actually removing them from the cache, causing stale entries
to accumulate until LRU eviction pressure pushed them out. Under
high-cardinality token usage (e.g., short-lived K8s service account
tokens), this could fill the cache with dead entries and starve valid
tokens.

ClientCache.Get() in the same file correctly evicts expired entries,
proving this was an oversight in TokenCache.

Changes:
- Upgrade TokenCache.Get() from RLock to Lock so it can mutate state.
- Evict expired entries inline (remove from LRU list and map), matching
  the pattern established by ClientCache.Get().
- Promote accessed entries in the LRU list on Get(), ensuring consistent
  eviction ordering (previously only Set() promoted).
- Add missing assertion in TestTokenCache_Get_Expired verifying that
  expired entries are actually removed (cache.Size() == 0), not just
  hidden by the return value.
- Update TestTokenCache_LRUBehavior to reflect that Get() now promotes
  entries, so eviction order changes correctly.

Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
Copilot AI review requested due to automatic review settings May 13, 2026 19:59
@volcano-sh-bot volcano-sh-bot added the kind/bug Something isn't working label May 13, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a memory leak in TokenCache.Get() where expired entries remained in the cache map and LRU list, mirroring the eviction behavior already present in ClientCache.Get(). Also adds LRU promotion on Get() for consistent ordering.

Changes:

  • Upgrade TokenCache.Get() from RLock to Lock and evict expired entries inline (remove from both LRU list and map).
  • Promote accessed entries to the front of the LRU list on Get().
  • Update tests: assert size goes to 0 after expiry; adjust LRU behavior test to reflect promotion semantics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pkg/workloadmanager/client_cache.go Replaces read lock with write lock in TokenCache.Get, evicts expired entries, and promotes entries on access.
pkg/workloadmanager/client_cache_test.go Adds size assertion to expiry test and updates LRU behavior test to reflect Get-as-LRU-promotion.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the TokenCache.Get method to evict expired entries and maintain proper LRU ordering by promoting accessed items. To support these modifications, the locking mechanism was changed from a read-lock to a write-lock. Corresponding tests were updated to verify cache eviction and the new LRU behavior. Feedback suggests also updating the lastAccess timestamp during retrieval to implement a sliding TTL, which would prevent frequently used tokens from expiring prematurely.

Comment on lines +249 to +250
// Move to front on access for proper LRU ordering
c.lruList.MoveToFront(entry.element)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the entry is correctly promoted in the LRU list, entry.lastAccess is not updated. This results in a fixed TTL from the time the entry was first added or updated via Set(), rather than a sliding TTL from the last time it was accessed. Given the field name lastAccess and the LRU nature of the cache, it is more idiomatic to reset the expiration timer on each Get() call. This ensures that frequently used tokens remain in the cache as long as they are active, reducing unnecessary re-validation calls to the Kubernetes TokenReview API.

Suggested change
// Move to front on access for proper LRU ordering
c.lruList.MoveToFront(entry.element)
// Move to front on access for proper LRU ordering and update access time for sliding TTL
entry.lastAccess = time.Now()
c.lruList.MoveToFront(entry.element)

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.17%. Comparing base (524e55e) to head (82543e1).
⚠️ Report is 54 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #334      +/-   ##
==========================================
+ Coverage   47.57%   49.17%   +1.60%     
==========================================
  Files          30       30              
  Lines        2819     2861      +42     
==========================================
+ Hits         1341     1407      +66     
+ Misses       1338     1301      -37     
- Partials      140      153      +13     
Flag Coverage Δ
unittests 49.17% <100.00%> (+1.60%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@HarshitPal25
Copy link
Copy Markdown
Contributor Author

@hzxuzhonghu hello sir please checkout my work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants