-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Question] Best practices for token-efficient incremental code modifications via SDK sessions? #1024
Description
Use Case
Building a web-based code generation platform using the Copilot SDK (Python, v0.2.0). Users create and iteratively modify single-file projects (500–5,000+ lines of HTML/CSS/JS). A typical session involves 5–15 modification requests like "make the character bigger" or "change the background color" on an existing project.
Current Approach (Pseudocode)
For each modification request:
1. Reset session (disconnect + create fresh session) ← clears history
2. Build prompt:
- System message (~1K tokens, same every time)
- Full current project code (2K–30K tokens)
- User's modification request (~50 tokens)
3. Send prompt via session.send()
4. Parse response for line-based patches (REPLACE_LINES / INSERT_AFTER / DELETE_LINES)
5. If patches fail → retry with "return full updated code" (~doubles tokens)
6. Apply patches/extract code → validate → finalize
Optional thinking pre-pass: For complex requests, a separate model call analyzes the code first (GPT-5.2 with reasoning), then the plan is prepended to the main prompt → effectively 2× input tokens.
The Problem: Token Waste
| Scenario | Est. Input Tokens | Est. Output Tokens | Notes |
|---|---|---|---|
| Small project modification | ~4K | ~2K | 500-line project |
| Medium project modification | ~12K | ~8K | 2000-line project |
| Large project modification | ~35K | ~30K | 5000+ lines |
| + Thinking pre-pass | 2× input | same | Two full model calls |
| + Patch failure retry | 3× total | 2× output | Re-sends full code asking for complete output |
Key observations:
cache_read_tokensis consistently 0 in oursession.usageevents — even though we track it- Every modification resets the session → no context reuse between turns
- For a "change the button color" request on a 3000-line project, we're sending ~15K tokens of unchanged code
- With 10 modifications per session, that's ~150K+ input tokens for what should be incremental edits
Specific Questions
1. Prompt Caching
I noticed from other issues (e.g. #1005 session logs) that cacheReadTokens can be non-zero. Our setup always shows 0.
- Is prompt caching automatic when the system message + prefix stays the same?
- Does resetting the session (disconnect + create_session) break cache eligibility?
- Would keeping the session alive across turns (instead of resetting) enable caching of the static system message portion?
2. Session Continuity vs. Reset
We currently reset the session before each modification to avoid accumulating conversation history (since we always embed the full current code in the prompt anyway). But this may be preventing prompt caching.
Trade-off question: Is it better to:
- (A) Keep the session alive, let history accumulate, and rely on compaction (ref Expose used token information after compaction #1012) when context grows too large?
- (B) Reset each time but find a way to enable prompt caching?
- (C) Some hybrid — e.g., keep session alive for N turns, then reset?
What's the recommended pattern for repeated modifications to the same large context?
3. Reducing Output Tokens
The model frequently ignores patch/diff instructions and returns the entire file instead of targeted changes. This wastes output tokens proportional to project size.
- Are there SDK-level mechanisms to constrain output format?
- Has anyone found reliable prompting strategies that consistently produce diffs rather than full rewrites?
- Would
reasoning_effort: "low"help for simple modifications while keeping output focused?
4. Thinking Pre-Pass Overhead
For complex requests, we run a separate "thinking" model call (GPT-5.2 with reasoning) to produce a plan, then feed that plan + full code to the main model. This doubles the input token cost.
- With
reasoning_efforton the main model, is a separate thinking pre-pass still justified? - Any patterns for "think-then-act" that avoid sending the full context twice?
5. Large File Strategies
For projects >5000 lines, we currently force a full rewrite (no patches).
- Are there recommended patterns for chunked/windowed modifications — e.g., only sending the relevant portion of the file + surrounding context?
- Does the SDK's file handling (edit tool) use any internal optimization we could leverage instead of manual prompt construction?
Environment
- SDK:
github-copilot-sdk==0.2.0(Python) - Models: GPT-5.2 (thinking), Claude Sonnet 4 / Claude Opus (main generation)
- Session config:
streaming=True,available_tools=["ask_user"], session reset per modification
Related
- Expose used token information after compaction #1012 — Token info after compaction (relevant to question 2)
Would love to hear how others in the community handle similar "iterative code modification" workflows with the SDK. Any insights on which of these optimizations yield the biggest token savings would be greatly appreciated! 🙏