Skip to content

Add dutch language normalization#19

Open
egenthon-cmd wants to merge 2 commits intomainfrom
feat/dutch-normalizer-bis
Open

Add dutch language normalization#19
egenthon-cmd wants to merge 2 commits intomainfrom
feat/dutch-normalizer-bis

Conversation

@egenthon-cmd
Copy link
Copy Markdown
Contributor

@egenthon-cmd egenthon-cmd commented Apr 17, 2026

What does this PR do?

This change adds Dutch (nl) normalization—operators, number parsing, Flemish clitic expansion, dialect/word and phrase replacements, plus unit and e2e tests—so transcripts normalize consistently for evaluation.

Type of change

  • New language
  • Edit existing language (fix a replacement, tweak config, …)
  • New normalization step
  • Edit existing step (bug fix, behaviour change)
  • New preset version
  • Bug fix (other)
  • Refactor / docs / CI

Checklist

Only fill in the section(s) that match your change — delete the rest.


New language

  • Created normalization/languages/{lang}/ with operators.py, replacements.py, __init__.py
  • Word substitutions are in replacements.py (not hardcoded in operators.py)
  • LanguageConfig is filled in with the language's data (separators, currency words, digit words, …)
  • Subclassed LanguageOperators — only override methods where the logic changes, not just the data
  • Class is decorated with @register_language and imported in normalization/languages/__init__.py
  • Unit tests added in tests/unit/languages/
  • E2e CSV added in tests/e2e/files/{preset}/{lang}.csv (e.g. tests/e2e/files/gladia-3/fr.csv)

Edit existing language

  • New/changed word substitutions go in replacements.py, not inline in operators.py
  • If you changed a config field that can be None: the step reading it still handles None gracefully
  • Unit tests updated or added
  • E2e CSV updated if the expected output changed

New step

  • Unique name class attribute set (this is the key used in YAML presets)
  • Decorated with @register_step and imported in steps/text/__init__.py or steps/word/__init__.py
  • No hardcoded language values — read data from operators.config.* instead
  • If placeholder-based: protect + restore are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Unit tests added in tests/unit/steps/
  • Step name added to the relevant preset YAML — or a new preset file created if existing presets are affected
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Edit existing step

  • Step name is unchanged — if the output changes, create a new step name + new preset instead
  • No language-specific logic or string literals added inside the step
  • Unit tests updated or added
  • If the docstring changed: ran uv run scripts/generate_step_docs.py

Preset change

  • Existing preset files are not modified — new behaviour goes in a new preset file
  • pipeline.validate() passes (runs automatically via loader.py)

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

  • New Features
    • Added Dutch language support with comprehensive text normalization
    • Converts written Dutch numbers to numeric formats
    • Expands Dutch contractions and dialectal variants to standard Dutch
    • Normalizes common Dutch words and phrases
    • Supports Dutch currency formatting and symbols

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Dutch language support is added to the normalization package through a new normalization/languages/dutch module. This includes operators for text normalization, a number converter for Dutch written-number expansion, word and sentence replacement dictionaries, and comprehensive unit tests validating the implementation.

Changes

Cohort / File(s) Summary
Package Initialization
normalization/languages/__init__.py, normalization/languages/dutch/__init__.py
Parent package imports and exports the new Dutch module; Dutch package exports DutchOperators and DUTCH_REPLACEMENTS for external use.
Dutch Language Operators
normalization/languages/dutch/operators.py
Implements DutchOperators class with methods for expanding written numbers via DutchNumberNormalizer, normalizing contractions/clitics (Flemish variants and temporal forms), and returning Dutch word replacement rules. Registered as "nl" in language registry.
Dutch Number Normalization
normalization/languages/dutch/number_normalizer.py
Implements DutchNumberNormalizer converting Dutch spelled-out numbers into numeric strings. Handles composition patterns (e.g., "een en twintig"), ordinals/suffixed forms, multipliers (honderd/duizend/miljoen), decimals (komma), currency symbols, sign affixes, and fractional inputs. Postprocesses euro-cent combinations.
Dutch Replacement Data
normalization/languages/dutch/replacements.py, normalization/languages/dutch/sentence_replacements.py
Define DUTCH_REPLACEMENTS mapping Flemish/colloquial variants to standard Dutch forms (e.g., "ge"→"je", "da"→"dat") and DUTCH_SENTENCE_REPLACEMENTS for phrase-level normalizations (e.g., numeric variants, spelling corrections).
Tests
tests/unit/languages/dutch_operators_test.py
Validates Dutch operator integration, contraction expansion, configuration correctness, word replacements, and written-number expansion with currency handling.

Sequence Diagram

sequenceDiagram
    participant Client as Client Code
    participant Operators as DutchOperators
    participant Normalizer as DutchNumberNormalizer
    participant Regex as Regex Engine
    
    Client->>Operators: expand_written_numbers(text)
    Operators->>Normalizer: __call__(text)
    
    Normalizer->>Normalizer: preprocess(text)
    Note over Normalizer: Clean spacing/ordinal patterns
    
    Normalizer->>Normalizer: tokenize by whitespace
    Note over Normalizer: Split into word tokens
    
    Normalizer->>Normalizer: process_words(tokens)
    Note over Normalizer: Parse Dutch number grammar<br/>Handle composition, ordinals,<br/>multipliers, currency, signs
    
    Normalizer->>Regex: postprocess(result)
    Note over Regex: Normalize euro-cent patterns<br/>Combine currency forms
    
    Normalizer-->>Operators: return normalized string
    Operators-->>Client: return result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • Karamouche

Poem

🐰 A fuzzy-eared hop through Dutch words so fine,
Numbers and contractions now perfectly align!
From "een en twintig" to euro cents neat,
The Dutch normalization is wonderfully complete! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add dutch language normalization' clearly and specifically summarizes the main change: adding Dutch language support to the normalization system.
Description check ✅ Passed PR description is complete with all required sections for a new language addition, including the checklist with all relevant items marked as completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dutch-normalizer-bis

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
tests/unit/languages/dutch_operators_test.py (2)

45-51: Consolidate repeated get_word_replacements() calls.

Each assertion calls operators.get_word_replacements() again. The function is cheap, but it's cleaner and faster to bind once:

Proposed change
 def test_word_replacements(operators):
-    assert operators.get_word_replacements()["ge"] == "je"
-    assert operators.get_word_replacements()["da"] == "dat"
-    assert operators.get_word_replacements()["u"] == "je"
-    assert operators.get_word_replacements()["uw"] == "je"
-    assert operators.get_word_replacements()["okee"] == "oke"
-    assert operators.get_word_replacements()["euro"] == "euros"
+    replacements = operators.get_word_replacements()
+    assert replacements["ge"] == "je"
+    assert replacements["da"] == "dat"
+    assert replacements["u"] == "je"
+    assert replacements["uw"] == "je"
+    assert replacements["okee"] == "oke"
+    assert replacements["euro"] == "euros"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_operators_test.py` around lines 45 - 51, The test
repeatedly calls operators.get_word_replacements(); change
test_word_replacements to call operators.get_word_replacements() once, assign
the result to a local variable (e.g., replacements) and use replacements["ge"],
replacements["da"], replacements["u"], replacements["uw"], replacements["okee"],
replacements["euro"] in the assertions to avoid repeated calls and improve
clarity.

54-62: Consider adding a negative/edge test for the compound-number Dutch order.

The existing tests cover single-word amounts (tien euro) and numeric+symbol forms (€10, $3.50). Given the Dutch ones + "en" + tens composition is a non-trivial branch in DutchNumberNormalizer.process_words (the pending_ones state), please add at least one assertion exercising it end-to-end, e.g. "vijfentwintig euro" or "een en twintig euro""25 euros". Without it, regressions in the compound-order logic won't be caught by this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/languages/dutch_operators_test.py` around lines 54 - 62, Add a
negative/edge unit test exercising the Dutch compound-number branch by asserting
that operators.expand_written_numbers correctly normalizes compound words like
"vijfentwintig euro" (or the spaced variant "een en twintig euro") to "25
euros"; this targets the DutchNumberNormalizer.process_words logic and its
pending_ones handling, so add an assertion in
tests/unit/languages/dutch_operators_test.py alongside
test_expand_written_numbers_euro_after_amount_dutch_order that calls
operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros"
(and optionally another for "een en twintig euro").
normalization/languages/dutch/operators.py (1)

108-111: Move the DUTCH_REPLACEMENTS import to module top-level.

There's no circular-import reason to defer this — replacements.py imports nothing from this module. Other language implementations import the replacements dict at module top. A local import here adds per-call overhead for no gain.

Proposed change
 from normalization.languages.dutch.number_normalizer import DutchNumberNormalizer
+from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS
 from normalization.languages.dutch.sentence_replacements import (
     DUTCH_SENTENCE_REPLACEMENTS,
 )
@@
     def get_word_replacements(self) -> dict[str, str]:
-        from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS
-
         return DUTCH_REPLACEMENTS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` around lines 108 - 111, The
get_word_replacements method currently performs a local import of
DUTCH_REPLACEMENTS which adds per-call overhead; move the import of
DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top of
this module and update get_word_replacements (the function name) to simply
return the top-level DUTCH_REPLACEMENTS constant so other language modules
follow the same pattern and avoid repeated local imports.
normalization/languages/dutch/replacements.py (1)

26-27: euro → euros canonical form is non-standard Dutch.

The Dutch plural of euro is euro (official) or colloquially euro's; euros is the English form. Since the pipeline also maps the currency symbol to "euros" in DUTCH_CONFIG.currency_symbol_to_word, this is internally consistent as a canonical evaluation token, but it's worth a one-line comment here documenting that euros is an internal canonical form (not a Dutch word) so future contributors don't "fix" it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/replacements.py` around lines 26 - 27, The
mapping "euro": "euros" uses a non-standard Dutch canonical token; update
replacements.py to add a one-line comment next to the "euro": "euros" entry
clarifying that "euros" is an internal canonical form (not the Dutch plural) and
noting that DUTCH_CONFIG.currency_symbol_to_word also maps '€' to "euros" for
consistency, so future contributors don't "fix" it to a real Dutch word.
normalization/languages/dutch/number_normalizer.py (3)

5-13: Move module docstring to the top of the file.

The triple-quoted string at lines 5–13 appears after the imports, so Python does not treat it as the module docstring (__doc__ will be None). Place it above the import statements for it to be picked up by tooling (help(), IDE hover, Sphinx).

Proposed change
-import re
-from fractions import Fraction
-from typing import Iterator, Match
-
-"""
-Dutch number normalizer: spelled-out numbers to digits.
-
-- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21).
-- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ...
-- Multipliers: honderd, duizend, miljoen, miljard, biljoen.
-- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma).
-- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros").
-"""
+"""
+Dutch number normalizer: spelled-out numbers to digits.
+
+- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21).
+- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ...
+- Multipliers: honderd, duizend, miljoen, miljard, biljoen.
+- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma).
+- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros").
+"""
+
+import re
+from fractions import Fraction
+from typing import Iterator, Match
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 5 - 13, The
module docstring currently appears after the import statements so __doc__ is
None; move the triple-quoted Dutch number normalizer string to the very top of
the file before any import lines so it becomes the module docstring available to
help()/IDE/Sphinx, ensuring you preserve the exact text and spacing and leave
the imports unchanged below it.

210-212: has_prefix check can misfire on 1-char numeric tokens like "0".

self.prefixes contains the string "0" (from "nul": "0" in preceding_prefixers), so a token of literal "0" gives current[0] == "0" in self.prefixeshas_prefix = True, and current_without_prefix = "" fails the regex. Control falls through to the current_lower not in self.words branch (since "0" isn't a Dutch word) and the token is yielded via output(current) — so functionally it survives, but the logic is accidentally load-bearing.

Two small cleanups worth considering:

  1. Exclude numeric digits from self.prefixes, or
  2. Short-circuit has_prefix to only trigger on non-digit first characters:
-            has_prefix = current[0] in self.prefixes
+            has_prefix = not current[0].isdigit() and current[0] in self.prefixes

No user-visible bug today as far as I can tell; flagging for defensive clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 210 - 212,
The has_prefix check can mis-detect single-digit tokens (e.g., "0") because
self.prefixes may contain digit strings; change the predicate so it only treats
a leading prefix when the first character is not a digit (e.g., set has_prefix =
(not current[0].isdigit()) and current[0] in self.prefixes) or alternatively
remove digit entries from self.prefixes; update the logic around
has_prefix/current_without_prefix and the regex match in the function that
processes tokens (refer to variables has_prefix, current,
current_without_prefix, and self.prefixes) so single-digit numeric tokens are
not stripped into empty strings before the numeric regex check.

79-85: tens_plural keys like twintigen are not standard Dutch plurals.

Building plural variants as "twintig" + "en""twintigen", "dertig" + "en""dertigen", etc. is not a real Dutch form (Dutch uses tientallen, in de twintig, or bare twintig for "twenties"). Same critique applies to multipliers_plural (honderden, duizenden, miljoenen, miljarden, biljoenen are actually correct and common — those are fine).

For tens_plural specifically, the generated keys are unlikely to appear in real transcripts but will still match if an ASR produces them, potentially yielding surprising output. If these were copy-pasted from the English/French normalizer template, consider dropping tens_plural entirely for Dutch, or verify at least one of these forms against corpus data before shipping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/number_normalizer.py` around lines 79 - 85, The
tens_plural dict is generating nonstandard Dutch keys (e.g., "twintigen") and
should be removed to avoid false matches: delete the creation of tens_plural and
change tens_suffixed to not include it (keep only tens_ordinal and any genuinely
valid suffixed forms), leaving multipliers_plural as-is; update any references
that expect tens_plural to instead rely on tens_ordinal or validated
corpus-backed forms so tens_suffixed = {**self.tens_ordinal} (or merge only
validated entries) and adjust downstream logic that used tens_plural
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/dutch/operators.py`:
- Around line 16-24: The current _RE_CLITIC_S unconditionally matches any
leading non-word then _APOST s and will incorrectly convert genitive/adverbial
constructions like "'s werelds" to "is werelds"; update the replacement regex
logic so we only convert reduced copula instances: replace _RE_CLITIC_S with a
pattern that requires a copula-like preceding token (e.g. use a positive
lookbehind such as (?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or
alternatively add exclusions by extending _TEMPORAL_S_AFTER to include known
genitive words like werelds|lands if you prefer the whitelist approach; adjust
the code that compiles _RE_CLITIC_S to use the chosen narrower pattern so only
true "'s" copula cases are replaced.

In `@normalization/languages/dutch/replacements.py`:
- Around line 21-24: The unconditional mappings "u": "je" and "uw": "je" in
normalization/languages/dutch/replacements.py aggressively conflate formal and
informal pronouns and should be either documented or made opt-in; update the
module/class docstring (e.g. DutchOperators) to clearly state these rules are
tuned for Flemish informal customer-service transcripts and their impact on WER,
or refactor the mappings into a guarded/parameterized option (e.g. a boolean
flag passed to the normalization routine that enables the "u"/"uw" → "je"
replacement) so callers can opt in when appropriate; ensure the change
references the exact mapping keys "u" and "uw" and the normalization entry point
(the Dutch normalization function or DutchOperators) so callers know how to
enable/disable it.

---

Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 5-13: The module docstring currently appears after the import
statements so __doc__ is None; move the triple-quoted Dutch number normalizer
string to the very top of the file before any import lines so it becomes the
module docstring available to help()/IDE/Sphinx, ensuring you preserve the exact
text and spacing and leave the imports unchanged below it.
- Around line 210-212: The has_prefix check can mis-detect single-digit tokens
(e.g., "0") because self.prefixes may contain digit strings; change the
predicate so it only treats a leading prefix when the first character is not a
digit (e.g., set has_prefix = (not current[0].isdigit()) and current[0] in
self.prefixes) or alternatively remove digit entries from self.prefixes; update
the logic around has_prefix/current_without_prefix and the regex match in the
function that processes tokens (refer to variables has_prefix, current,
current_without_prefix, and self.prefixes) so single-digit numeric tokens are
not stripped into empty strings before the numeric regex check.
- Around line 79-85: The tens_plural dict is generating nonstandard Dutch keys
(e.g., "twintigen") and should be removed to avoid false matches: delete the
creation of tens_plural and change tens_suffixed to not include it (keep only
tens_ordinal and any genuinely valid suffixed forms), leaving multipliers_plural
as-is; update any references that expect tens_plural to instead rely on
tens_ordinal or validated corpus-backed forms so tens_suffixed =
{**self.tens_ordinal} (or merge only validated entries) and adjust downstream
logic that used tens_plural accordingly.

In `@normalization/languages/dutch/operators.py`:
- Around line 108-111: The get_word_replacements method currently performs a
local import of DUTCH_REPLACEMENTS which adds per-call overhead; move the import
of DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top
of this module and update get_word_replacements (the function name) to simply
return the top-level DUTCH_REPLACEMENTS constant so other language modules
follow the same pattern and avoid repeated local imports.

In `@normalization/languages/dutch/replacements.py`:
- Around line 26-27: The mapping "euro": "euros" uses a non-standard Dutch
canonical token; update replacements.py to add a one-line comment next to the
"euro": "euros" entry clarifying that "euros" is an internal canonical form (not
the Dutch plural) and noting that DUTCH_CONFIG.currency_symbol_to_word also maps
'€' to "euros" for consistency, so future contributors don't "fix" it to a real
Dutch word.

In `@tests/unit/languages/dutch_operators_test.py`:
- Around line 45-51: The test repeatedly calls
operators.get_word_replacements(); change test_word_replacements to call
operators.get_word_replacements() once, assign the result to a local variable
(e.g., replacements) and use replacements["ge"], replacements["da"],
replacements["u"], replacements["uw"], replacements["okee"],
replacements["euro"] in the assertions to avoid repeated calls and improve
clarity.
- Around line 54-62: Add a negative/edge unit test exercising the Dutch
compound-number branch by asserting that operators.expand_written_numbers
correctly normalizes compound words like "vijfentwintig euro" (or the spaced
variant "een en twintig euro") to "25 euros"; this targets the
DutchNumberNormalizer.process_words logic and its pending_ones handling, so add
an assertion in tests/unit/languages/dutch_operators_test.py alongside
test_expand_written_numbers_euro_after_amount_dutch_order that calls
operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros"
(and optionally another for "een en twintig euro").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b667681a-4f7b-4a0d-b733-df033b4abe17

📥 Commits

Reviewing files that changed from the base of the PR and between baec872 and b458749.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3/nl.csv is excluded by !**/*.csv
📒 Files selected for processing (7)
  • normalization/languages/__init__.py
  • normalization/languages/dutch/__init__.py
  • normalization/languages/dutch/number_normalizer.py
  • normalization/languages/dutch/operators.py
  • normalization/languages/dutch/replacements.py
  • normalization/languages/dutch/sentence_replacements.py
  • tests/unit/languages/dutch_operators_test.py

Comment on lines +16 to +24
_TEMPORAL_S_AFTER = (
r"ochtends|morgens|middags|namiddags|avonds|nachts|"
r"zaterdags|zondags|weekend|weekends"
)
_RE_TEMPORAL_S = re.compile(
rf"(?<!\w){_APOST}s(\s+)({_TEMPORAL_S_AFTER})\b",
re.IGNORECASE,
)
_RE_CLITIC_S = re.compile(rf"(?<!\w){_APOST}s\b", re.IGNORECASE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

_RE_CLITIC_S may over-expand non-temporal 's to is.

After the temporal pass consumes 's ochtends/avonds/..., any remaining 's preceded by a non-word char (start of string, space, punctuation) is unconditionally replaced by is. That corrupts genuine Dutch constructions where standalone 's is a reduced genitive/adverbial article, e.g.:

  • 's werelds grootste (= "of the world's largest") → is werelds grootste
  • 's lands belangenis lands belangen

These are less frequent than the informal "dat 's goed" → "dat is goed" case you want to cover, but silently wrong when they do appear. Consider either (a) extending _TEMPORAL_S_AFTER to include werelds|lands|..., or (b) only replacing 's with is when the previous token ends with a vowel/word consistent with a reduced copula (e.g. require (?:dat|hoe|wat|er|hier|daar)\s+ before it).

Also applies to: 99-100

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/operators.py` around lines 16 - 24, The current
_RE_CLITIC_S unconditionally matches any leading non-word then _APOST s and will
incorrectly convert genitive/adverbial constructions like "'s werelds" to "is
werelds"; update the replacement regex logic so we only convert reduced copula
instances: replace _RE_CLITIC_S with a pattern that requires a copula-like
preceding token (e.g. use a positive lookbehind such as
(?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or alternatively add
exclusions by extending _TEMPORAL_S_AFTER to include known genitive words like
werelds|lands if you prefer the whitelist approach; adjust the code that
compiles _RE_CLITIC_S to use the chosen narrower pattern so only true "'s"
copula cases are replaced.

Comment on lines +21 to +24
# Formal / informal pronoun conflation (Flemish customer service)
# ref uses formal u/uw; models transcribe je — normalise to je
"u": "je",
"uw": "je",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Aggressive conflation of formal u / possessive uwje can corrupt evaluation.

Both the subject/object formal pronoun u and the possessive uw collapse to je. While the comment says this is intentional for Flemish customer-service transcripts, applying it unconditionally means:

  • Possessive semantics are lost (uw autoje auto is ok, but a model output jouw auto would normalize to jouw auto, producing asymmetric WER behavior between reference and hypothesis if only one side uses the formal register).
  • Any non-customer-service Dutch transcript with formal register (medical, legal, news) will be silently destructured — e.g. u bentje bent, even when the reference actually said u.

At minimum, consider documenting in the operator/module docstring that DutchOperators is tuned for Flemish informal CS transcripts, or guard these mappings behind an opt-in config. If this is truly the desired canonical form for WER, the current doc comment is fine but worth a prominent note so downstream users don't get surprised.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/dutch/replacements.py` around lines 21 - 24, The
unconditional mappings "u": "je" and "uw": "je" in
normalization/languages/dutch/replacements.py aggressively conflate formal and
informal pronouns and should be either documented or made opt-in; update the
module/class docstring (e.g. DutchOperators) to clearly state these rules are
tuned for Flemish informal customer-service transcripts and their impact on WER,
or refactor the mappings into a guarded/parameterized option (e.g. a boolean
flag passed to the normalization routine that enables the "u"/"uw" → "je"
replacement) so callers can opt in when appropriate; ensure the change
references the exact mapping keys "u" and "uw" and the normalization entry point
(the Dutch normalization function or DutchOperators) so callers know how to
enable/disable it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for steps must be in corresponding test files in /tests/unit/step/text

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be relevant adding more tests testing this behavior, as it looks very complex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants