Conversation
📝 WalkthroughWalkthroughDutch language support is added to the normalization package through a new Changes
Sequence DiagramsequenceDiagram
participant Client as Client Code
participant Operators as DutchOperators
participant Normalizer as DutchNumberNormalizer
participant Regex as Regex Engine
Client->>Operators: expand_written_numbers(text)
Operators->>Normalizer: __call__(text)
Normalizer->>Normalizer: preprocess(text)
Note over Normalizer: Clean spacing/ordinal patterns
Normalizer->>Normalizer: tokenize by whitespace
Note over Normalizer: Split into word tokens
Normalizer->>Normalizer: process_words(tokens)
Note over Normalizer: Parse Dutch number grammar<br/>Handle composition, ordinals,<br/>multipliers, currency, signs
Normalizer->>Regex: postprocess(result)
Note over Regex: Normalize euro-cent patterns<br/>Combine currency forms
Normalizer-->>Operators: return normalized string
Operators-->>Client: return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (7)
tests/unit/languages/dutch_operators_test.py (2)
45-51: Consolidate repeatedget_word_replacements()calls.Each assertion calls
operators.get_word_replacements()again. The function is cheap, but it's cleaner and faster to bind once:Proposed change
def test_word_replacements(operators): - assert operators.get_word_replacements()["ge"] == "je" - assert operators.get_word_replacements()["da"] == "dat" - assert operators.get_word_replacements()["u"] == "je" - assert operators.get_word_replacements()["uw"] == "je" - assert operators.get_word_replacements()["okee"] == "oke" - assert operators.get_word_replacements()["euro"] == "euros" + replacements = operators.get_word_replacements() + assert replacements["ge"] == "je" + assert replacements["da"] == "dat" + assert replacements["u"] == "je" + assert replacements["uw"] == "je" + assert replacements["okee"] == "oke" + assert replacements["euro"] == "euros"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/dutch_operators_test.py` around lines 45 - 51, The test repeatedly calls operators.get_word_replacements(); change test_word_replacements to call operators.get_word_replacements() once, assign the result to a local variable (e.g., replacements) and use replacements["ge"], replacements["da"], replacements["u"], replacements["uw"], replacements["okee"], replacements["euro"] in the assertions to avoid repeated calls and improve clarity.
54-62: Consider adding a negative/edge test for the compound-number Dutch order.The existing tests cover single-word amounts (
tien euro) and numeric+symbol forms (€10,$3.50). Given the Dutchones + "en" + tenscomposition is a non-trivial branch inDutchNumberNormalizer.process_words(thepending_onesstate), please add at least one assertion exercising it end-to-end, e.g."vijfentwintig euro"or"een en twintig euro"→"25 euros". Without it, regressions in the compound-order logic won't be caught by this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/dutch_operators_test.py` around lines 54 - 62, Add a negative/edge unit test exercising the Dutch compound-number branch by asserting that operators.expand_written_numbers correctly normalizes compound words like "vijfentwintig euro" (or the spaced variant "een en twintig euro") to "25 euros"; this targets the DutchNumberNormalizer.process_words logic and its pending_ones handling, so add an assertion in tests/unit/languages/dutch_operators_test.py alongside test_expand_written_numbers_euro_after_amount_dutch_order that calls operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros" (and optionally another for "een en twintig euro").normalization/languages/dutch/operators.py (1)
108-111: Move theDUTCH_REPLACEMENTSimport to module top-level.There's no circular-import reason to defer this —
replacements.pyimports nothing from this module. Other language implementations import the replacements dict at module top. A local import here adds per-call overhead for no gain.Proposed change
from normalization.languages.dutch.number_normalizer import DutchNumberNormalizer +from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS from normalization.languages.dutch.sentence_replacements import ( DUTCH_SENTENCE_REPLACEMENTS, ) @@ def get_word_replacements(self) -> dict[str, str]: - from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS - return DUTCH_REPLACEMENTS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/operators.py` around lines 108 - 111, The get_word_replacements method currently performs a local import of DUTCH_REPLACEMENTS which adds per-call overhead; move the import of DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top of this module and update get_word_replacements (the function name) to simply return the top-level DUTCH_REPLACEMENTS constant so other language modules follow the same pattern and avoid repeated local imports.normalization/languages/dutch/replacements.py (1)
26-27:euro → euroscanonical form is non-standard Dutch.The Dutch plural of
euroiseuro(official) or colloquiallyeuro's;eurosis the English form. Since the pipeline also maps the currency symbol€to"euros"inDUTCH_CONFIG.currency_symbol_to_word, this is internally consistent as a canonical evaluation token, but it's worth a one-line comment here documenting thateurosis an internal canonical form (not a Dutch word) so future contributors don't "fix" it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/replacements.py` around lines 26 - 27, The mapping "euro": "euros" uses a non-standard Dutch canonical token; update replacements.py to add a one-line comment next to the "euro": "euros" entry clarifying that "euros" is an internal canonical form (not the Dutch plural) and noting that DUTCH_CONFIG.currency_symbol_to_word also maps '€' to "euros" for consistency, so future contributors don't "fix" it to a real Dutch word.normalization/languages/dutch/number_normalizer.py (3)
5-13: Move module docstring to the top of the file.The triple-quoted string at lines 5–13 appears after the imports, so Python does not treat it as the module docstring (
__doc__will beNone). Place it above theimportstatements for it to be picked up by tooling (help(), IDE hover, Sphinx).Proposed change
-import re -from fractions import Fraction -from typing import Iterator, Match - -""" -Dutch number normalizer: spelled-out numbers to digits. - -- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21). -- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ... -- Multipliers: honderd, duizend, miljoen, miljard, biljoen. -- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma). -- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros"). -""" +""" +Dutch number normalizer: spelled-out numbers to digits. + +- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21). +- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ... +- Multipliers: honderd, duizend, miljoen, miljard, biljoen. +- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma). +- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros"). +""" + +import re +from fractions import Fraction +from typing import Iterator, Match🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 5 - 13, The module docstring currently appears after the import statements so __doc__ is None; move the triple-quoted Dutch number normalizer string to the very top of the file before any import lines so it becomes the module docstring available to help()/IDE/Sphinx, ensuring you preserve the exact text and spacing and leave the imports unchanged below it.
210-212:has_prefixcheck can misfire on 1-char numeric tokens like"0".
self.prefixescontains the string"0"(from"nul": "0"inpreceding_prefixers), so a token of literal"0"givescurrent[0] == "0" in self.prefixes→has_prefix = True, andcurrent_without_prefix = ""fails the regex. Control falls through to thecurrent_lower not in self.wordsbranch (since"0"isn't a Dutch word) and the token is yielded viaoutput(current)— so functionally it survives, but the logic is accidentally load-bearing.Two small cleanups worth considering:
- Exclude numeric digits from
self.prefixes, or- Short-circuit
has_prefixto only trigger on non-digit first characters:- has_prefix = current[0] in self.prefixes + has_prefix = not current[0].isdigit() and current[0] in self.prefixesNo user-visible bug today as far as I can tell; flagging for defensive clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 210 - 212, The has_prefix check can mis-detect single-digit tokens (e.g., "0") because self.prefixes may contain digit strings; change the predicate so it only treats a leading prefix when the first character is not a digit (e.g., set has_prefix = (not current[0].isdigit()) and current[0] in self.prefixes) or alternatively remove digit entries from self.prefixes; update the logic around has_prefix/current_without_prefix and the regex match in the function that processes tokens (refer to variables has_prefix, current, current_without_prefix, and self.prefixes) so single-digit numeric tokens are not stripped into empty strings before the numeric regex check.
79-85:tens_pluralkeys liketwintigenare not standard Dutch plurals.Building plural variants as
"twintig" + "en"→"twintigen","dertig" + "en"→"dertigen", etc. is not a real Dutch form (Dutch usestientallen,in de twintig, or baretwintigfor "twenties"). Same critique applies tomultipliers_plural(honderden,duizenden,miljoenen,miljarden,biljoenenare actually correct and common — those are fine).For
tens_pluralspecifically, the generated keys are unlikely to appear in real transcripts but will still match if an ASR produces them, potentially yielding surprising output. If these were copy-pasted from the English/French normalizer template, consider droppingtens_pluralentirely for Dutch, or verify at least one of these forms against corpus data before shipping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 79 - 85, The tens_plural dict is generating nonstandard Dutch keys (e.g., "twintigen") and should be removed to avoid false matches: delete the creation of tens_plural and change tens_suffixed to not include it (keep only tens_ordinal and any genuinely valid suffixed forms), leaving multipliers_plural as-is; update any references that expect tens_plural to instead rely on tens_ordinal or validated corpus-backed forms so tens_suffixed = {**self.tens_ordinal} (or merge only validated entries) and adjust downstream logic that used tens_plural accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/dutch/operators.py`:
- Around line 16-24: The current _RE_CLITIC_S unconditionally matches any
leading non-word then _APOST s and will incorrectly convert genitive/adverbial
constructions like "'s werelds" to "is werelds"; update the replacement regex
logic so we only convert reduced copula instances: replace _RE_CLITIC_S with a
pattern that requires a copula-like preceding token (e.g. use a positive
lookbehind such as (?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or
alternatively add exclusions by extending _TEMPORAL_S_AFTER to include known
genitive words like werelds|lands if you prefer the whitelist approach; adjust
the code that compiles _RE_CLITIC_S to use the chosen narrower pattern so only
true "'s" copula cases are replaced.
In `@normalization/languages/dutch/replacements.py`:
- Around line 21-24: The unconditional mappings "u": "je" and "uw": "je" in
normalization/languages/dutch/replacements.py aggressively conflate formal and
informal pronouns and should be either documented or made opt-in; update the
module/class docstring (e.g. DutchOperators) to clearly state these rules are
tuned for Flemish informal customer-service transcripts and their impact on WER,
or refactor the mappings into a guarded/parameterized option (e.g. a boolean
flag passed to the normalization routine that enables the "u"/"uw" → "je"
replacement) so callers can opt in when appropriate; ensure the change
references the exact mapping keys "u" and "uw" and the normalization entry point
(the Dutch normalization function or DutchOperators) so callers know how to
enable/disable it.
---
Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 5-13: The module docstring currently appears after the import
statements so __doc__ is None; move the triple-quoted Dutch number normalizer
string to the very top of the file before any import lines so it becomes the
module docstring available to help()/IDE/Sphinx, ensuring you preserve the exact
text and spacing and leave the imports unchanged below it.
- Around line 210-212: The has_prefix check can mis-detect single-digit tokens
(e.g., "0") because self.prefixes may contain digit strings; change the
predicate so it only treats a leading prefix when the first character is not a
digit (e.g., set has_prefix = (not current[0].isdigit()) and current[0] in
self.prefixes) or alternatively remove digit entries from self.prefixes; update
the logic around has_prefix/current_without_prefix and the regex match in the
function that processes tokens (refer to variables has_prefix, current,
current_without_prefix, and self.prefixes) so single-digit numeric tokens are
not stripped into empty strings before the numeric regex check.
- Around line 79-85: The tens_plural dict is generating nonstandard Dutch keys
(e.g., "twintigen") and should be removed to avoid false matches: delete the
creation of tens_plural and change tens_suffixed to not include it (keep only
tens_ordinal and any genuinely valid suffixed forms), leaving multipliers_plural
as-is; update any references that expect tens_plural to instead rely on
tens_ordinal or validated corpus-backed forms so tens_suffixed =
{**self.tens_ordinal} (or merge only validated entries) and adjust downstream
logic that used tens_plural accordingly.
In `@normalization/languages/dutch/operators.py`:
- Around line 108-111: The get_word_replacements method currently performs a
local import of DUTCH_REPLACEMENTS which adds per-call overhead; move the import
of DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top
of this module and update get_word_replacements (the function name) to simply
return the top-level DUTCH_REPLACEMENTS constant so other language modules
follow the same pattern and avoid repeated local imports.
In `@normalization/languages/dutch/replacements.py`:
- Around line 26-27: The mapping "euro": "euros" uses a non-standard Dutch
canonical token; update replacements.py to add a one-line comment next to the
"euro": "euros" entry clarifying that "euros" is an internal canonical form (not
the Dutch plural) and noting that DUTCH_CONFIG.currency_symbol_to_word also maps
'€' to "euros" for consistency, so future contributors don't "fix" it to a real
Dutch word.
In `@tests/unit/languages/dutch_operators_test.py`:
- Around line 45-51: The test repeatedly calls
operators.get_word_replacements(); change test_word_replacements to call
operators.get_word_replacements() once, assign the result to a local variable
(e.g., replacements) and use replacements["ge"], replacements["da"],
replacements["u"], replacements["uw"], replacements["okee"],
replacements["euro"] in the assertions to avoid repeated calls and improve
clarity.
- Around line 54-62: Add a negative/edge unit test exercising the Dutch
compound-number branch by asserting that operators.expand_written_numbers
correctly normalizes compound words like "vijfentwintig euro" (or the spaced
variant "een en twintig euro") to "25 euros"; this targets the
DutchNumberNormalizer.process_words logic and its pending_ones handling, so add
an assertion in tests/unit/languages/dutch_operators_test.py alongside
test_expand_written_numbers_euro_after_amount_dutch_order that calls
operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros"
(and optionally another for "een en twintig euro").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b667681a-4f7b-4a0d-b733-df033b4abe17
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3/nl.csvis excluded by!**/*.csv
📒 Files selected for processing (7)
normalization/languages/__init__.pynormalization/languages/dutch/__init__.pynormalization/languages/dutch/number_normalizer.pynormalization/languages/dutch/operators.pynormalization/languages/dutch/replacements.pynormalization/languages/dutch/sentence_replacements.pytests/unit/languages/dutch_operators_test.py
| _TEMPORAL_S_AFTER = ( | ||
| r"ochtends|morgens|middags|namiddags|avonds|nachts|" | ||
| r"zaterdags|zondags|weekend|weekends" | ||
| ) | ||
| _RE_TEMPORAL_S = re.compile( | ||
| rf"(?<!\w){_APOST}s(\s+)({_TEMPORAL_S_AFTER})\b", | ||
| re.IGNORECASE, | ||
| ) | ||
| _RE_CLITIC_S = re.compile(rf"(?<!\w){_APOST}s\b", re.IGNORECASE) |
There was a problem hiding this comment.
_RE_CLITIC_S may over-expand non-temporal 's to is.
After the temporal pass consumes 's ochtends/avonds/..., any remaining 's preceded by a non-word char (start of string, space, punctuation) is unconditionally replaced by is. That corrupts genuine Dutch constructions where standalone 's is a reduced genitive/adverbial article, e.g.:
's werelds grootste(= "of the world's largest") →is werelds grootste's lands belangen→is lands belangen
These are less frequent than the informal "dat 's goed" → "dat is goed" case you want to cover, but silently wrong when they do appear. Consider either (a) extending _TEMPORAL_S_AFTER to include werelds|lands|..., or (b) only replacing 's with is when the previous token ends with a vowel/word consistent with a reduced copula (e.g. require (?:dat|hoe|wat|er|hier|daar)\s+ before it).
Also applies to: 99-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/dutch/operators.py` around lines 16 - 24, The current
_RE_CLITIC_S unconditionally matches any leading non-word then _APOST s and will
incorrectly convert genitive/adverbial constructions like "'s werelds" to "is
werelds"; update the replacement regex logic so we only convert reduced copula
instances: replace _RE_CLITIC_S with a pattern that requires a copula-like
preceding token (e.g. use a positive lookbehind such as
(?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or alternatively add
exclusions by extending _TEMPORAL_S_AFTER to include known genitive words like
werelds|lands if you prefer the whitelist approach; adjust the code that
compiles _RE_CLITIC_S to use the chosen narrower pattern so only true "'s"
copula cases are replaced.
| # Formal / informal pronoun conflation (Flemish customer service) | ||
| # ref uses formal u/uw; models transcribe je — normalise to je | ||
| "u": "je", | ||
| "uw": "je", |
There was a problem hiding this comment.
Aggressive conflation of formal u / possessive uw → je can corrupt evaluation.
Both the subject/object formal pronoun u and the possessive uw collapse to je. While the comment says this is intentional for Flemish customer-service transcripts, applying it unconditionally means:
- Possessive semantics are lost (
uw auto→je autois ok, but a model outputjouw autowould normalize tojouw auto, producing asymmetric WER behavior between reference and hypothesis if only one side uses the formal register). - Any non-customer-service Dutch transcript with formal register (medical, legal, news) will be silently destructured — e.g.
u bent→je bent, even when the reference actually saidu.
At minimum, consider documenting in the operator/module docstring that DutchOperators is tuned for Flemish informal CS transcripts, or guard these mappings behind an opt-in config. If this is truly the desired canonical form for WER, the current doc comment is fine but worth a prominent note so downstream users don't get surprised.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/dutch/replacements.py` around lines 21 - 24, The
unconditional mappings "u": "je" and "uw": "je" in
normalization/languages/dutch/replacements.py aggressively conflate formal and
informal pronouns and should be either documented or made opt-in; update the
module/class docstring (e.g. DutchOperators) to clearly state these rules are
tuned for Flemish informal customer-service transcripts and their impact on WER,
or refactor the mappings into a guarded/parameterized option (e.g. a boolean
flag passed to the normalization routine that enables the "u"/"uw" → "je"
replacement) so callers can opt in when appropriate; ensure the change
references the exact mapping keys "u" and "uw" and the normalization entry point
(the Dutch normalization function or DutchOperators) so callers know how to
enable/disable it.
There was a problem hiding this comment.
Test for steps must be in corresponding test files in /tests/unit/step/text
There was a problem hiding this comment.
Maybe it would be relevant adding more tests testing this behavior, as it looks very complex
What does this PR do?
This change adds Dutch (nl) normalization—operators, number parsing, Flemish clitic expansion, dialect/word and phrase replacements, plus unit and e2e tests—so transcripts normalize consistently for evaluation.
Type of change
Checklist
Only fill in the section(s) that match your change — delete the rest.
New language
normalization/languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py(not hardcoded inoperators.py)LanguageConfigis filled in with the language's data (separators, currency words, digit words, …)LanguageOperators— only override methods where the logic changes, not just the data@register_languageand imported innormalization/languages/__init__.pytests/unit/languages/tests/e2e/files/{preset}/{lang}.csv(e.g.tests/e2e/files/gladia-3/fr.csv)Edit existing language
replacements.py, not inline inoperators.pyNone: the step reading it still handlesNonegracefullyNew step
nameclass attribute set (this is the key used in YAML presets)@register_stepand imported insteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*insteadsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyEdit existing step
nameis unchanged — if the output changes, create a new step name + new preset insteaduv run scripts/generate_step_docs.pyPreset change
pipeline.validate()passes (runs automatically vialoader.py)How was this tested?
Summary by CodeRabbit