fix: evict expired entries in TokenCache.Get() to prevent memory leak by HarshitPal25 · Pull Request #334 · volcano-sh/agentcube

HarshitPal25 · 2026-05-13T19:59:49Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a memory leak in TokenCache.Get() where expired entries are never evicted from the cache.

Previously, when TokenCache.Get() encountered an expired entry, it returned (false, false, "") but left the stale entry in the cache map and LRU list. Under high-cardinality token usage (e.g., short-lived K8s service account tokens rotated frequently), dead entries accumulate until LRU eviction pressure pushes them out, starving valid tokens from being cached.

ClientCache.Get() in the same file (client_cache.go:110-114) correctly evicts expired entries inline, proving this was an oversight in TokenCache.

The root cause is that TokenCache.Get() used RLock() (read lock), which prevented it from mutating the cache to remove stale entries. ClientCache.Get() uses a full Lock() and removes expired entries correctly.

This PR:

Upgrades TokenCache.Get() from RLock → Lock so it can mutate state on expiry.
Evicts expired entries inline (removes from both the LRU list and the map), matching the pattern established by ClientCache.Get().
Promotes accessed entries in the LRU list on Get() for consistent eviction ordering.
Adds a missing assertion in TestTokenCache_Get_Expired verifying that expired entries are actually removed (cache.Size() == 0), not just hidden by the return value.
Updates TestTokenCache_LRUBehavior to reflect that Get() now promotes entries, so eviction order changes correctly.

Which issue(s) this PR fixes:

None (discovered during code review )

Special notes for your reviewer:

The lock upgrade from RLock → Lock in Get() trades a small amount of read concurrency for correctness. Under production workloads, Get() calls are short-lived (map lookup + time check), so the contention impact is negligible. This matches how ClientCache.Get() already operates with a full Lock().

All existing tests pass with go test -race.

Does this PR introduce a user-facing change?:

NONE

TokenCache.Get() was returning (false, false, "") for expired entries without actually removing them from the cache, causing stale entries to accumulate until LRU eviction pressure pushed them out. Under high-cardinality token usage (e.g., short-lived K8s service account tokens), this could fill the cache with dead entries and starve valid tokens. ClientCache.Get() in the same file correctly evicts expired entries, proving this was an oversight in TokenCache. Changes: - Upgrade TokenCache.Get() from RLock to Lock so it can mutate state. - Evict expired entries inline (remove from LRU list and map), matching the pattern established by ClientCache.Get(). - Promote accessed entries in the LRU list on Get(), ensuring consistent eviction ordering (previously only Set() promoted). - Add missing assertion in TestTokenCache_Get_Expired verifying that expired entries are actually removed (cache.Size() == 0), not just hidden by the return value. - Update TestTokenCache_LRUBehavior to reflect that Get() now promotes entries, so eviction order changes correctly. Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>

volcano-sh-bot · 2026-05-13T19:59:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/workloadmanager/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Fixes a memory leak in TokenCache.Get() where expired entries remained in the cache map and LRU list, mirroring the eviction behavior already present in ClientCache.Get(). Also adds LRU promotion on Get() for consistent ordering.

Changes:

Upgrade TokenCache.Get() from RLock to Lock and evict expired entries inline (remove from both LRU list and map).
Promote accessed entries to the front of the LRU list on Get().
Update tests: assert size goes to 0 after expiry; adjust LRU behavior test to reflect promotion semantics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
pkg/workloadmanager/client_cache.go	Replaces read lock with write lock in `TokenCache.Get`, evicts expired entries, and promotes entries on access.
pkg/workloadmanager/client_cache_test.go	Adds size assertion to expiry test and updates LRU behavior test to reflect Get-as-LRU-promotion.

gemini-code-assist

Code Review

This pull request updates the TokenCache.Get method to evict expired entries and maintain proper LRU ordering by promoting accessed items. To support these modifications, the locking mechanism was changed from a read-lock to a write-lock. Corresponding tests were updated to verify cache eviction and the new LRU behavior. Feedback suggests also updating the lastAccess timestamp during retrieval to implement a sliding TTL, which would prevent frequently used tokens from expiring prematurely.

gemini-code-assist · 2026-05-13T20:01:40Z

+	// Move to front on access for proper LRU ordering
+	c.lruList.MoveToFront(entry.element)


While the entry is correctly promoted in the LRU list, entry.lastAccess is not updated. This results in a fixed TTL from the time the entry was first added or updated via Set(), rather than a sliding TTL from the last time it was accessed. Given the field name lastAccess and the LRU nature of the cache, it is more idiomatic to reset the expiration timer on each Get() call. This ensures that frequently used tokens remain in the cache as long as they are active, reducing unnecessary re-validation calls to the Kubernetes TokenReview API.

Suggested change

// Move to front on access for proper LRU ordering

c.lruList.MoveToFront(entry.element)

// Move to front on access for proper LRU ordering and update access time for sliding TTL

entry.lastAccess = time.Now()

c.lruList.MoveToFront(entry.element)

codecov-commenter · 2026-05-13T20:05:15Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.17%. Comparing base (524e55e) to head (82543e1).
⚠️ Report is 54 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #334      +/-   ##
==========================================
+ Coverage   47.57%   49.17%   +1.60%     
==========================================
  Files          30       30              
  Lines        2819     2861      +42     
==========================================
+ Hits         1341     1407      +66     
+ Misses       1338     1301      -37     
- Partials      140      153      +13

Flag	Coverage Δ
unittests	`49.17% <100.00%> (+1.60%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

HarshitPal25 · 2026-05-15T12:13:01Z

@hzxuzhonghu hello sir please checkout my work

Copilot AI review requested due to automatic review settings May 13, 2026 19:59

volcano-sh-bot added the kind/bug Something isn't working label May 13, 2026

volcano-sh-bot requested review from LiZhenCheng9527 and acsoto May 13, 2026 19:59

volcano-sh-bot added the size/S label May 13, 2026

Copilot started reviewing on behalf of HarshitPal25 May 13, 2026 20:00 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: evict expired entries in TokenCache.Get() to prevent memory leak#334

fix: evict expired entries in TokenCache.Get() to prevent memory leak#334
HarshitPal25 wants to merge 1 commit into
volcano-sh:mainfrom
HarshitPal25:fix/token-cache-stale-entry-leak

HarshitPal25 commented May 13, 2026

Uh oh!

volcano-sh-bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

codecov-commenter commented May 13, 2026

Uh oh!

HarshitPal25 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Move to front on access for proper LRU ordering
		c.lruList.MoveToFront(entry.element)

Conversation

HarshitPal25 commented May 13, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Uh oh!

volcano-sh-bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 13, 2026

Codecov Report

Uh oh!

HarshitPal25 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants