Skip to content

fix: centralize UTF-8 file I/O for Windows compatibility#45

Draft
joshbouncesecurity wants to merge 2 commits intoknostic:masterfrom
joshbouncesecurity:fix/issue16-09-utf8-io
Draft

fix: centralize UTF-8 file I/O for Windows compatibility#45
joshbouncesecurity wants to merge 2 commits intoknostic:masterfrom
joshbouncesecurity:fix/issue16-09-utf8-io

Conversation

@joshbouncesecurity
Copy link
Copy Markdown
Contributor

Summary

Bare open() calls use the system encoding (cp1252 on Windows), causing charmap codec can't decode errors on any target codebase containing non-ASCII characters (e.g. curly quotes U+2019, accented characters, CJK).

Adds utilities/file_io.py with centralized UTF-8 helpers (open_utf8, read_json, write_json, run_utf8) and migrates ~190 bare open() call sites across libs/openant-core/ (core, parsers, utilities, openant CLI, top-level scripts) to specify encoding="utf-8" explicitly. Also sets encoding/errors on the docker_executor subprocess.run that captures container stdout/stderr as text.

A regression test scans non-test Python files for any bare open() call without an encoding= argument and fails if a regression reappears.

Addresses item 9 from #16 (does not close the issue).

Test plan

  • Helper unit tests: non-ASCII round-trip via open_utf8, read_json, write_json.
  • run_utf8 captures non-ASCII subprocess output (incl. universal_newlines=True alias).
  • run_utf8 only injects encoding/errors when caller asks for text mode; explicit encoding= is respected.
  • Regression: no bare open( in non-test code (test scans libs/openant-core/ excluding tests/ and the helper).
  • Existing tests pass: pytest tests/ -> 96 passed, 22 skipped (env-dependent).
  • Manual: run a parse against a repo with non-ASCII content on a cp1252 console -- no charmap errors.

joshbouncesecurity and others added 2 commits May 4, 2026 21:24
Bare open() calls use the system encoding (cp1252 on Windows), causing
'charmap codec can't decode byte ...' errors when parsing repositories
containing non-ASCII characters such as curly quotes.

Adds utilities/file_io.py with open_utf8, read_json, write_json, and
run_utf8 helpers, and migrates ~190 bare open() call sites across
libs/openant-core/ (core, parsers, utilities, openant CLI, top-level
scripts) to specify encoding="utf-8" explicitly. Also sets
encoding/errors on the docker_executor subprocess.run that captures
container stdout/stderr as text.

Includes a regression test that scans non-test code for any bare open()
call without an encoding= argument and fails if a regression reappears.

Addresses item 9 from #16.
… UTF-8

Round 1 review fixes for PR knostic#45:

- application_context.py, ast_parser.py, dataset_enhancer.py, report/__main__.py,
  report/generator.py: pass encoding='utf-8' on every Path.read_text() /
  write_text() call. The previous migration only covered open() calls; pathlib's
  text helpers also default to the system locale on Windows (cp1252) and crash
  on non-ASCII source code.
- parsers/{c,go,javascript,php,ruby}/test_pipeline.py: pass
  encoding='utf-8', errors='replace' on subprocess.run(text=True) invocations
  of parser binaries and CodeQL. Only docker_executor.py was migrated before;
  these other call sites had the same Windows cp1252 hazard.
- tests/test_file_io.py: extend regression scan with two new asserts —
  Path.read_text/write_text without encoding=, and subprocess.run(text=True)
  without encoding=. Refactored the call-walking logic into a shared helper.

All 14 file_io tests pass; full tests/ suite: 98 passed, 22 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Manual verification

  • Windows cp1252 console: scan a target repo containing non-ASCII content (e.g., Japanese strings, accented chars in comments). Confirm no UnicodeDecodeError / charmap codec can't decode errors.
  • Linux/macOS: same scan succeeds (no regression).
  • Subprocess paths: docker_executor capturing non-ASCII output: no decode crash.
  • Path-method I/O: any Path.read_text() / write_text() calls covered by the round-2 follow-up still work cross-platform.
  • git grep 'open(' libs/openant-core --include='*.py' (excluding tests + the helper module): no bare open( calls remain. The new regression test enforces this in CI.
  • JSON round-trip: utilities/file_io.py write_json then read_json of a dict with {"name": "héllo 日本語"} returns the same dict (not escaped sequences).

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Local test results

Verified the migration on Windows by static-grepping the worktree and round-tripping non-ASCII content through utilities.file_io.

Commands run:

# 1. No bare open() in non-test code
git grep -nE "open\([^)]*\)" -- 'libs/openant-core/**/*.py' \
    ':!libs/openant-core/tests/**' ':!libs/openant-core/utilities/file_io.py' \
    | grep -v "encoding=" | grep -v "open_utf8"
# Two hits, both legitimate:
#   parsers/zig/function_extractor.py:43:  with open(full_path, "rb") as f:   <- binary mode, no encoding needed
#   utilities/agentic_enhancer/prompts.py:68:  - Files: open() with external paths   <- text inside a docstring/prompt

# 2. Round-trip via utilities.file_io
python check45.py

Output:

OK default round-trip preserves non-ASCII via \uXXXX escapes
OK ensure_ascii=False writes raw UTF-8 bytes; read_json still round-trips
OK open_utf8 round-trips curly quote on this Windows shell

Outcome:

  • No bare open() in non-test code (only legitimate "rb" and a docstring fragment) ✅
  • read_json / write_json round-trip a dict with {"name": "héllo 日本語", "list": [..., "’"]} byte-identically ✅
  • write_json(..., ensure_ascii=False) writes raw UTF-8 bytes (verified e6 97 a5 e6 9c ac e8 aa 9e for the JP run); read_json still parses it back to the same dict ✅
  • open_utf8 round-trips a U+2019 curly quote on this Windows shell (default cp1252 console) ✅
  • Did not run a full openant scan against a non-ASCII codebase end-to-end — would have required an LLM call. The above confirms the helper works on Windows; the migration's call-site coverage is enforced by the new regression test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant