Skip to content

Feature/grib1 to grib2 error handling#267

Open
dsarmany wants to merge 9 commits intodevelopfrom
feature/grib1-to-grib2-error-handling
Open

Feature/grib1 to grib2 error handling#267
dsarmany wants to merge 9 commits intodevelopfrom
feature/grib1-to-grib2-error-handling

Conversation

@dsarmany
Copy link
Copy Markdown
Collaborator

@dsarmany dsarmany commented Apr 27, 2026

Description

Contributor Declaration

By opening this pull request, I affirm the following:

  • All authors agree to the Contributor License Agreement.
  • The code follows the project's coding standards.
  • I have performed self-review and added comments where needed.
  • I have added or updated tests to verify that my changes are effective and functional.
  • I have run all existing tests and confirmed they pass.

🌈🌦️📖🚧 Documentation 🚧📖🌦️🌈
https://sites.ecmwf.int/docs/dev-section/multio/pull-requests/PR-267

Batch conversion runs over large archives commonly hit a handful of
per-message defects (e.g. producer-side metadata such as
numberOfForecastsInEnsemble=0 while number!=0, or fields that no
recipe can encode). Previously any single such message aborted the
whole run.

- Add --on-error {abort|log-and-skip|skip}, defaulting to
  log-and-skip, wrapping the per-message conversion body in a
  try/catch and reporting a skipped-message tally at the end.
- Add --default-ensemble-size <N>: when numberOfForecastsInEnsemble
  is 0 but number!=0, substitute N instead of throwing. Default 0
  preserves the previous strict behaviour.
- Thread defaultEnsembleSize through mapGrib1ToGrib2 as a new
  trailing parameter (default 0) so existing call sites are
  unaffected.

These flags make the tool usable in large parallel conversion
pipelines while keeping the safe strict default available.
If misc.lengthOfTimeWindow is not set from the standard keys, read the
GRIB1 local-section key lengthOf4DvarWindow (hours) from the input
handle and use it, skipping the 0xFFFF missing sentinel. Fixes cases
where 4D-Var analysis fields produced GRIB2 with a missing
lengthOf4DvarWindow.
For mars type=eme (4D-var model errors, EDA long-window 4Dvar system)
the matching GRIB2 local definition 39 requires numberOfComponents
and modelErrorType in addition to the standard window keys. These
two values have no MARS equivalent and must be forwarded from the
input GRIB1 handle (which already carries local definition 39).

Adds the corresponding misc EntryDefs and MiscRecord slots, and
populates them from the input handle in mapGrib1ToGrib2 when
mars.type == "eme". The metkit encoder now consumes them via the
new AnalysisType::ModelErrors variant.
Hybrid-level messages whose source GRIB carried no vertical-coordinate
parameters (NV=0, e.g. lnsp on ml) previously round-tripped with a
spurious PV137 (1002-double) array because the metkit encoder's
resolve_PvArray_or_throw fallback fires on an empty par dict.

Forward an explicit suppression signal:
  - Add PVPresent to MiscRecord so it round-trips through
    dumpUnscopedRecord into par["PVPresent"].
  - In mapGrib1ToGrib2, when the input handle has no pv key and reports
    NV=0, set misc.pvPresent=false. The matching metkit change honours
    this flag in LevelOp StageAllocate and emits PVPresent=0 with no
    pv key.

Verified end-to-end on an:ml:12 lnsp sh: output NV=0 preserved (was
NV=1002). No regression on hybrid messages with valid pv (NV=276).

NOTE (revisit): pairs with the metkit-side commit honouring
par["PVPresent"]=false. Once both land we can revisit whether
resolve_PvArray_or_throw should still default to PV137 on an empty par.
…erator=9.9 sentinel

Some IFS-produced spectral_complex messages (notably SPP random fields,
type=al, paramId 213131-213160) carry laplacianOperator=9.9 as an IFS
sentinel without actually pre-laplacian-scaling the stored coefficients.
Re-encoding such messages causes ecCodes' grib_ieee_to_long to assert
because (T*(T+1))^9.9 multiplied by the raw coefficients overflows
IEEE32 (FLT_MAX). The assertion is unrecoverable from inside ecCodes
and bypasses the existing log-and-skip path, aborting the whole tool.

Reject these messages up front with eckit::BadValue. The existing
top-level catch in grib1-to-grib2.cc treats BadValue as recoverable:
  - --on-error=abort: terminates with "Bad value:" prefix
  - --on-error=log-and-skip: reports "skipped 1 message(s)", exits 0

SeriousBug would have been semantically wrong here: the precondition
violation is both expected (legacy IFS encoding bug) and recoverable.

Verified by encoding fc:al:3 reproducer (180 msgs): 90 real laplacian
messages (paramIds 213101-213130) succeed; 90 sentinel messages are
skipped cleanly under --on-error=log-and-skip.

NOTE (revisit): upstream IFS encoding bug. Once IFS stops emitting
sentinel laplacianOperator=9.9 (or sets it to 0 when no scaling was
actually applied), this guard becomes dead code.
The existing ensemble branch requires numberOfForecastsInEnsemble
(absent for eme) to set mars.number. The encoder's analysis ModelErrors
path needs mars.number to deduce componentIndex; forward it
unconditionally inside the eme block.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the grib1-to-grib2 conversion tool’s resilience by adding per-message error handling and additional metadata/validation logic to avoid hard failures in known problematic GRIB encodings, plus new GRIB sample templates used by multio test/config tooling.

Changes:

  • Add per-message error handling (--on-error) and a pre-encode guard for known spectral_complex laplacian-scaling overflow cases.
  • Improve metadata forwarding/extraction (explicit PV absence via pvPresent, support for type=eme local definition 39 keys, and a lengthOf4DvarWindow fallback).
  • Add additional unstructured GRIB template samples under tests/multio/config/grib-samples/samples.unstructured/.

Reviewed changes

Copilot reviewed 3 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/multio/tools/grib1-to-grib2.cc Adds per-message error handling, overflow validation, and multiple metadata extraction/forwarding improvements.
src/multio/datamod/MarsMiscGeo.h Extends MiscRecord with additional optional entries (numberOfComponents, modelErrorType, pvPresent).
src/multio/datamod/GribKeys.h Introduces optional GRIB key defs for local definition 39 (numberOfComponents, modelErrorType).
tests/multio/config/grib-samples/samples.unstructured/unstr_latlon.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_inst_pf.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_inst_fc.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_inst_an.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_avg_tpa.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_avg_pf.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_avg_fc.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_avg_an.tmpl Adds new unstructured GRIB sample template.
tests/multio/config/grib-samples/samples.unstructured/unstr_avg.tmpl Adds new unstructured GRIB sample template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
}

if (skippedCount > 0) {
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--on-error=skip is documented as “silently skip failing messages”, but the tool still emits the end-of-run summary ("grib1-to-grib2: skipped ...") whenever skippedCount > 0. Either suppress this summary when onErrorHandling_ == OnErrorHandling::Skip, or update the option help text to state that a summary will still be printed.

Suggested change
if (skippedCount > 0) {
if (skippedCount > 0 && onErrorHandling_ != OnErrorHandling::Skip) {

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 3.94737% with 73 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.26%. Comparing base (273f60f) to head (99b2261).
⚠️ Report is 3 commits behind head on develop.

Files with missing lines Patch % Lines
src/multio/tools/grib1-to-grib2.cc 0.00% 73 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #267      +/-   ##
===========================================
- Coverage    61.40%   61.26%   -0.14%     
===========================================
  Files          320      320              
  Lines        21854    21708     -146     
  Branches      1655     1674      +19     
===========================================
- Hits         13419    13300     -119     
+ Misses        8435     8408      -27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… for type=me

Extend the existing type=eme forwarding block to also fire for
type=me (model errors). Both types use GRIB1 local definition 39 with
componentIndex (= mars.number), numberOfComponents and modelErrorType.
The metkit encoder's analysis ModelErrors path now handles both.
…ations for type=4i

For type=4i (4D-var analysis iterations) the GRIB1 input carries local
definition 38 with iterationNumber and totalNumberOfIterations.

iterationNumber is now deduced by metkit's IterationConcept from
mars.iteration (= ECMWF MARS keyword "iteration"), so this tool only
needs to populate that mars key. totalNumberOfIterations has no MARS
representation and is forwarded via the parameter dictionary.

Concretely:
  * mars.iteration = dm::parseEntry(dm::ITERATION, h) is set after
    mars.anoffset.
  * The misc data model loses the IterationNumber EntryDef
    (GribKeys.h) and the iterationNumber field (MarsMiscGeo.h);
    multio no longer forwards iterationNumber explicitly.
  * The new type=4i block in grib1-to-grib2.cc forwards only
    totalNumberOfIterations.

Pairs with metkit commit
  mars2grib: Refactor type=4i/eme/me handling into orthogonal Iteration,
  ModelErrors and Analysis concepts
which adds the IterationConcept that consumes mars.iteration.
@dsarmany dsarmany force-pushed the feature/grib1-to-grib2-error-handling branch from fc10c74 to 99b2261 Compare April 28, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants