gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218
gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218ambv wants to merge 6 commits intopython:mainfrom
Conversation
New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, using the unicodedata.grapheme_cluster_break(), extended_pictographic(), and indic_conjunct_break() property accessors. Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a pure-Python implementation of Unicode TR29 extended grapheme cluster segmentation to mirror unicodedata.iter_graphemes(), and refactors the existing grapheme-break tests so both the C and Python implementations can share the same conformance suite.
Changes:
- Introduces
Lib/_py_grapheme.pyimplementing TR29 Extended Grapheme Cluster segmentation usingunicodedataproperty accessors. - Refactors
GraphemeBreakTestinto aBaseGraphemeBreakTestmixin and addsPyGraphemeBreakTestto exercise the Python implementation. - Shares the TR29 conformance test (GraphemeBreakTest.txt) across both implementations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| Lib/_py_grapheme.py | New pure-Python TR29 grapheme cluster iterator returning Segment objects. |
| Lib/test/test_unicodedata.py | Test refactor into a shared base mixin + new test class targeting _py_grapheme.iter_graphemes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add makegraphemedata() to Tools/unicode/makeunicodedata.py that generates Lib/_py_grapheme_db.py from the Unicode data files (GraphemeBreakProperty.txt, emoji-data.txt, DerivedCoreProperties.txt). _py_grapheme.py now imports property tables from _py_grapheme_db and uses bisect for lookups instead of calling unicodedata functions added in 3.15. This makes the module usable on Python 3.13 and 3.14 by regenerating the tables for the appropriate Unicode version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@StanFromIreland the exercise to see a pure Python reimplementation is interesting to me in a grander scheme that is providing the entirety of unicodedata in a pure Python version. That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future. In any case, I asked Serhiy on gh-142529 to decide. |
|
The traceback issue gh-130273 has been fixed in the main branch using Adding the pure Python
Does the JIT already support calling a different implementation of a function? Here two modules have to be imported to call a single function ( I'm not convinced by the JIT argument. IMO the I suggest to not backport the traceback fix to 3.13 and 3.14 stable branches. And I don't think that it's worth it to add a pure Python implementation of |
The argument isn't about Me exercising the re-implementation of a subset of Now the I agree that we usually keep backported changes small, but it's not unheard of to backport several hundred lines of code. Here, I'd argue that I hear your argument, let's see what Serhiy's got to say. |
malemburg
left a comment
There was a problem hiding this comment.
The new test cases look fine, but I don't see much point in adding a pure Python version of the huge unicodedata database to Python, so -1 on those parts.
If people want to use such a pure Python implementation, they should download a package from PyPI which provides this.
|
When you're done making the requested changes, leave the comment: |
New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, without relying on
unicodedata.grapheme_cluster_break(),extended_pictographic(), andindic_conjunct_break()that were also added in Python 3.15.Refactored
GraphemeBreakTestinto aBaseGraphemeBreakTestmixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt.