fix: evict expired entries in TokenCache.Get() to prevent memory leak#334
fix: evict expired entries in TokenCache.Get() to prevent memory leak#334HarshitPal25 wants to merge 1 commit into
Conversation
TokenCache.Get() was returning (false, false, "") for expired entries without actually removing them from the cache, causing stale entries to accumulate until LRU eviction pressure pushed them out. Under high-cardinality token usage (e.g., short-lived K8s service account tokens), this could fill the cache with dead entries and starve valid tokens. ClientCache.Get() in the same file correctly evicts expired entries, proving this was an oversight in TokenCache. Changes: - Upgrade TokenCache.Get() from RLock to Lock so it can mutate state. - Evict expired entries inline (remove from LRU list and map), matching the pattern established by ClientCache.Get(). - Promote accessed entries in the LRU list on Get(), ensuring consistent eviction ordering (previously only Set() promoted). - Add missing assertion in TestTokenCache_Get_Expired verifying that expired entries are actually removed (cache.Size() == 0), not just hidden by the return value. - Update TestTokenCache_LRUBehavior to reflect that Get() now promotes entries, so eviction order changes correctly. Signed-off-by: HarshitPal25 <harshit13082006@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Fixes a memory leak in TokenCache.Get() where expired entries remained in the cache map and LRU list, mirroring the eviction behavior already present in ClientCache.Get(). Also adds LRU promotion on Get() for consistent ordering.
Changes:
- Upgrade
TokenCache.Get()fromRLocktoLockand evict expired entries inline (remove from both LRU list and map). - Promote accessed entries to the front of the LRU list on
Get(). - Update tests: assert size goes to 0 after expiry; adjust LRU behavior test to reflect promotion semantics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/workloadmanager/client_cache.go | Replaces read lock with write lock in TokenCache.Get, evicts expired entries, and promotes entries on access. |
| pkg/workloadmanager/client_cache_test.go | Adds size assertion to expiry test and updates LRU behavior test to reflect Get-as-LRU-promotion. |
There was a problem hiding this comment.
Code Review
This pull request updates the TokenCache.Get method to evict expired entries and maintain proper LRU ordering by promoting accessed items. To support these modifications, the locking mechanism was changed from a read-lock to a write-lock. Corresponding tests were updated to verify cache eviction and the new LRU behavior. Feedback suggests also updating the lastAccess timestamp during retrieval to implement a sliding TTL, which would prevent frequently used tokens from expiring prematurely.
| // Move to front on access for proper LRU ordering | ||
| c.lruList.MoveToFront(entry.element) |
There was a problem hiding this comment.
While the entry is correctly promoted in the LRU list, entry.lastAccess is not updated. This results in a fixed TTL from the time the entry was first added or updated via Set(), rather than a sliding TTL from the last time it was accessed. Given the field name lastAccess and the LRU nature of the cache, it is more idiomatic to reset the expiration timer on each Get() call. This ensures that frequently used tokens remain in the cache as long as they are active, reducing unnecessary re-validation calls to the Kubernetes TokenReview API.
| // Move to front on access for proper LRU ordering | |
| c.lruList.MoveToFront(entry.element) | |
| // Move to front on access for proper LRU ordering and update access time for sliding TTL | |
| entry.lastAccess = time.Now() | |
| c.lruList.MoveToFront(entry.element) |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #334 +/- ##
==========================================
+ Coverage 47.57% 49.17% +1.60%
==========================================
Files 30 30
Lines 2819 2861 +42
==========================================
+ Hits 1341 1407 +66
+ Misses 1338 1301 -37
- Partials 140 153 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@hzxuzhonghu hello sir please checkout my work |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a memory leak in
TokenCache.Get()where expired entries are never evicted from the cache.Previously, when
TokenCache.Get()encountered an expired entry, it returned(false, false, "")but left the stale entry in the cache map and LRU list. Under high-cardinality token usage (e.g., short-lived K8s service account tokens rotated frequently), dead entries accumulate until LRU eviction pressure pushes them out, starving valid tokens from being cached.ClientCache.Get()in the same file (client_cache.go:110-114) correctly evicts expired entries inline, proving this was an oversight inTokenCache.The root cause is that
TokenCache.Get()usedRLock()(read lock), which prevented it from mutating the cache to remove stale entries.ClientCache.Get()uses a fullLock()and removes expired entries correctly.This PR:
TokenCache.Get()fromRLock→Lockso it can mutate state on expiry.ClientCache.Get().Get()for consistent eviction ordering.TestTokenCache_Get_Expiredverifying that expired entries are actually removed (cache.Size() == 0), not just hidden by the return value.TestTokenCache_LRUBehaviorto reflect thatGet()now promotes entries, so eviction order changes correctly.Which issue(s) this PR fixes:
None (discovered during code review )
Special notes for your reviewer:
The lock upgrade from
RLock→LockinGet()trades a small amount of read concurrency for correctness. Under production workloads,Get()calls are short-lived (map lookup + time check), so the contention impact is negligible. This matches howClientCache.Get()already operates with a fullLock().All existing tests pass with
go test -race.Does this PR introduce a user-facing change?: