[MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0 by rahil-c · Pull Request #18498 · apache/hudi

rahil-c · 2026-04-13T18:16:15Z

Bumps lance-core from 1.0.2 to 4.0.0 and the lance-spark connector from 0.0.15 to 0.4.0, adapts Hudi to the lance-spark repackaging (com.lancedb → org.lance) and API signature changes, fixes a column-ordering regression in LanceRecordIterator introduced by the new connector's batch layout, and renames Hudi's internal ShowIndexes logical plan to HoodieShowIndexes so it no longer clashes with Spark 4.0's built-in ShowIndexes.

Describe the issue this Pull Request addresses

Hudi's Lance integration targets an older lance-spark / lance-core release. With lance-spark 0.4.0 the Maven coordinates (com.lancedb → org.lance), package paths, and a few public signatures changed, and the in-memory Arrow batch layout returned by VectorSchemaRoot.getFieldVectors() now reflects the file's on-disk column order rather than the order requested in LanceFileReader.readAll(columnNames, ...). That last change silently broke MoR reads through LanceRecordIterator, because the UnsafeProjection is built from the Spark schema's column order but the ColumnVector[] we were handing it was in the file's on-disk order — so UnsafeProjection would dispatch e.g. getInt(0) against a VarCharVector and throw UnsupportedOperationException. Additionally, Spark 4.0 ships a ShowIndexes logical plan of its own, so Hudi's identically-named case class needed to be renamed to avoid conflicts once the codebase starts cross-compiling against Spark 4.

Summary and Changelog

1. Dependency bump ([MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0)

pom.xml: lance.version 1.0.2 → 4.0.0, lance.spark.connector.version 0.0.15 → 0.4.0, groupId com.lancedb → org.lance for both lance-core and the lance-spark-3.5_${scala.binary.version} artifact.
HoodieSparkLanceWriter.java: update import com.lancedb.lance.spark.arrow.LanceArrowWriter → org.lance.spark.arrow.LanceArrowWriter, and adapt the LanceArrowUtils.toArrowSchema(...) call site — the 0.4.0 signature drops the errorOnDuplicatedFieldNames parameter.
LanceRecordIterator.java: update import com.lancedb.lance.spark.vectorized.LanceArrowColumnVector → org.lance.spark.vectorized.LanceArrowColumnVector.

2. Fix column-order regression (fix(lance): look up Arrow vectors by field name in LanceRecordIterator)

In hasNext(), the first batch now builds a Map<String, FieldVector> from VectorSchemaRoot.getFieldVectors() and walks the Spark StructField[] in order, looking each vector up by name, instead of trusting positional order.
Adds a size/name sanity check that throws HoodieException with the available column set if Lance returns a mismatched batch.
This restores the ordering contract the previously-built UnsafeProjection expects and unblocks MoR reads (the symptom was UnsupportedOperationException from ArrowVectorAccessor.getInt during UnsafeProjection.apply in TestLanceDataSource.testBasicUpsertModifyExistingRow / testBasicDeleteOperation on the MoR parameters).

3. Rename ShowIndexes → HoodieShowIndexes ([MINOR] Rename Hudi's ShowIndexes logical plan to HoodieShowIndexes)

Index.scala: rename the case class and its companion.
Pattern-match call-sites updated in HoodieSpark33CatalystPlanUtils.scala, HoodieSpark34CatalystPlanUtils.scala, HoodieSpark35CatalystPlanUtils.scala, HoodieSpark40CatalystPlanUtils.scala.
Constructor call-sites updated in HoodieSpark3_3ExtendedSqlAstBuilder.scala, HoodieSpark3_4ExtendedSqlAstBuilder.scala, HoodieSpark3_5ExtendedSqlAstBuilder.scala, HoodieSpark4_0ExtendedSqlAstBuilder.scala.
Doc reference updated in IndexCommands.scala.

Impact

User-facing: none. Public Hudi APIs, on-disk formats, and config are unchanged. All changes are internal to how Hudi invokes lance-spark and how Hudi's own SHOW INDEXES logical plan is named internally.
Dependencies: downstream artifacts will now resolve org.lance:lance-core:4.0.0 and org.lance:lance-spark-3.5_${scala.binary.version}:0.4.0 instead of the old com.lancedb coordinates. Any external consumers that shade/relocate Lance classes should update their relocations accordingly.

Risk Level

low — the bump is localized to the Lance reader/writer path, the ordering fix is narrowly scoped to LanceRecordIterator.hasNext(), and the rename is a name-only change that mechanically updates every call-site. CI (Azure + GitHub Actions) covers COW and MoR paths in TestLanceDataSource, which was the failure mode caught and fixed by commit 2.

Documentation Update

none.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable (existing TestLanceDataSource MoR cases exercise the fixed path)

Bumps lance-core from 1.0.2 to 4.0.0 and lance-spark connector from 0.0.15 to 0.4.0. Updates affected import paths and adapts to the LanceArrowUtils.toArrowSchema signature change (drops the errorOnDuplicatedFieldNames parameter).

Lance-spark 0.4.0 (bumped in 7e4967c) ships its own `org.apache.spark.sql.catalyst.plans.logical.ShowIndexes` inside `lance-spark-base_*.jar`. This collides with Hudi's own same-FQCN case class (added in hudi-spark-common). Both jars end up on the classpath of hudi-spark3.3.x/3.4.x/3.5.x/4.0.x, and since the two classes have different case-class arity (Lance's is 1-arg, Hudi's is 2-arg), Scala pattern matches like `case ShowIndexes(table, output)` fail to compile. Rename Hudi's class to `HoodieShowIndexes` (and its companion object) to sidestep the collision. This is an internal logical-plan class consumed only by Hudi's own parser / CatalystPlanUtils / analyzer — no public SQL or API surface changes. Call-sites updated: - Index.scala (definition + companion) - HoodieSpark{33,34,35,40}CatalystPlanUtils.scala (pattern match) - HoodieSpark{3_3,3_4,3_5,4_0}ExtendedSqlAstBuilder.scala (construct) - IndexCommands.scala (doc reference) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

With lance-spark 0.4.0, VectorSchemaRoot.getFieldVectors() returns vectors in the file's on-disk order rather than in the order of the projection requested via LanceFileReader.readAll(). Wrapping vectors positionally therefore mismatches the UnsafeProjection built from the requested schema, causing UnsafeProjection to call type accessors on the wrong column (e.g. getInt on a VarCharVector) and fail with UnsupportedOperationException for MoR reads where the FileGroupRecord Buffer rearranges columns relative to the file's write order. Fix by looking up each vector by field name from the requested schema so the ColumnVector[] order matches what UnsafeProjection expects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rahil-c · 2026-04-16T16:44:32Z

revist renaming of show indexes other options

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — the column-ordering fix in LanceRecordIterator is well-implemented with proper name-based lookup and clear error messages, and the ShowIndexes → HoodieShowIndexes rename is applied consistently across all Spark version modules.

voonhous · 2026-04-20T10:40:07Z

retriggering CI

yihua · 2026-04-20T15:16:34Z

+case class HoodieShowIndexes(table: LogicalPlan,
+                             override val output: Seq[Attribute] = HoodieShowIndexes.getOutputAttrs) extends Command {


Does this affect the SQL syntax support for Hudi? If not and the "show indexes" is tested to be working, I think this is fine.

Let me do a check on this.

@yihua So technically the CI is green for this PR, and we know that we already have tests in hudi code base around SHOW INDEXES. For example the TestIndexSyntax.scala, TestSecondaryIndex and TestExpressionIndex run show indexes sql in their tests.

Thereforce I think there is no breaking change with this

yihua · 2026-04-20T15:49:44Z

                                 long maxFileSize) {
    super(file, DEFAULT_BATCH_SIZE, bloomFilterOpt.map(HoodieBloomFilterRowWriteSupport::new));
    this.sparkSchema = sparkSchema;
-    this.arrowSchema = LanceArrowUtils.toArrowSchema(sparkSchema, DEFAULT_TIMEZONE, true, false);


What is the removed fourth argument?

I think it newer lance verison this 4th argument is likely not present around `largeVarTypes`

yihua · 2026-04-20T15:54:30Z

+          List<FieldVector> fieldVectors = root.getFieldVectors();
+          Map<String, FieldVector> byName = new HashMap<>(fieldVectors.size() * 2);
+          for (FieldVector fv : fieldVectors) {
+            byName.put(fv.getName(), fv);
+          }
+          StructField[] sparkFields = sparkSchema.fields();
+          if (sparkFields.length != fieldVectors.size()) {
+            throw new HoodieException("Lance batch column count " + fieldVectors.size()
+                + " does not match expected Spark schema size " + sparkFields.length
+                + " for file: " + path);
+          }
+          columnVectors = new ColumnVector[sparkFields.length];
+          for (int i = 0; i < sparkFields.length; i++) {
+            String name = sparkFields[i].name();
+            FieldVector fv = byName.get(name);
+            if (fv == null) {
+              throw new HoodieException("Lance batch missing expected column '" + name
+                  + "' for file: " + path + "; available columns: " + byName.keySet());
+            }
+            columnVectors[i] = new LanceArrowColumnVector(fv);
+          }


Non-blocker: is the schema per batch/record or per file? Could this schema processing be extracted out per file to reduce overhead?

Let's create a follow-up to track this.

yihua

Overall LGTM

rahil-c · 2026-04-20T16:06:58Z

@yihua @voonhous lets merge this only after we have landed some of the larger changes on the unstructured track

Shorten the prose blocks above the column-order remapping in LanceRecordIterator and above HoodieShowIndexes to 2-3 sentences each, keeping the why (lance-spark 0.4.0 on-disk column order; FQCN shadow from lance-spark-base) without the full incident narrative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-04-21T07:11:56Z

Codecov Report

❌ Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.85%. Comparing base (613fc49) to head (6288d8b).
⚠️ Report is 33 commits behind head on master.

Files with missing lines	Patch %	Lines
...rg/apache/hudi/io/storage/LanceRecordIterator.java	70.58%	3 Missing and 2 partials ⚠️
...pache/spark/sql/catalyst/plans/logical/Index.scala	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18498      +/-   ##
============================================
+ Coverage     68.84%   68.85%   +0.01%     
- Complexity    28195    28455     +260     
============================================
  Files          2459     2475      +16     
  Lines        135152   136499    +1347     
  Branches      16379    16594     +215     
============================================
+ Hits          93039    93985     +946     
- Misses        34746    34958     +212     
- Partials       7367     7556     +189

Flag	Coverage Δ
common-and-other-modules	`44.47% <0.00%> (-0.09%)`	⬇️
hadoop-mr-java-client	`44.80% <ø> (-0.04%)`	⬇️
spark-client-hadoop-common	`48.40% <0.00%> (-0.07%)`	⬇️
spark-java-tests	`49.37% <58.62%> (+0.46%)`	⬆️
spark-scala-tests	`45.31% <34.48%> (-0.21%)`	⬇️
utilities	`38.02% <4.34%> (-0.22%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...apache/hudi/io/storage/HoodieSparkLanceWriter.java	`96.15% <100.00%> (ø)`
.../apache/spark/sql/hudi/command/IndexCommands.scala	`86.90% <ø> (ø)`
...che/spark/sql/HoodieSpark33CatalystPlanUtils.scala	`74.46% <100.00%> (ø)`
...l/parser/HoodieSpark3_3ExtendedSqlAstBuilder.scala	`20.41% <100.00%> (+1.29%)`	⬆️
...che/spark/sql/HoodieSpark34CatalystPlanUtils.scala	`78.94% <100.00%> (ø)`
...l/parser/HoodieSpark3_4ExtendedSqlAstBuilder.scala	`20.07% <100.00%> (+1.23%)`	⬆️
...che/spark/sql/HoodieSpark35CatalystPlanUtils.scala	`73.58% <100.00%> (ø)`
...l/parser/HoodieSpark3_5ExtendedSqlAstBuilder.scala	`20.69% <100.00%> (+1.22%)`	⬆️
...che/spark/sql/HoodieSpark40CatalystPlanUtils.scala	`73.58% <100.00%> (ø)`
...l/parser/HoodieSpark4_0ExtendedSqlAstBuilder.scala	`20.24% <100.00%> (+1.16%)`	⬆️
... and 2 more

... and 78 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-04-21T07:25:34Z

CI report:

6288d8b Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 13, 2026

[MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0

42ea17a

Bumps lance-core from 1.0.2 to 4.0.0 and lance-spark connector from 0.0.15 to 0.4.0. Updates affected import paths and adapts to the LanceArrowUtils.toArrowSchema signature change (drops the errorOnDuplicatedFieldNames parameter).

rahil-c force-pushed the bump/lance-4.0.0 branch from bb48ac0 to 42ea17a Compare April 14, 2026 17:45

rahil-c requested review from voonhous and yihua April 15, 2026 20:45

rahil-c mentioned this pull request Apr 16, 2026

feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists #18497

Open

3 tasks

rahil-c marked this pull request as ready for review April 16, 2026 14:44

yihua reviewed Apr 16, 2026

View reviewed changes

yihua reviewed Apr 20, 2026

View reviewed changes

yihua approved these changes Apr 20, 2026

View reviewed changes

		case class HoodieShowIndexes(table: LogicalPlan,
		override val output: Seq[Attribute] = HoodieShowIndexes.getOutputAttrs) extends Command {

Conversation

rahil-c commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

rahil-c commented Apr 16, 2026

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

voonhous commented Apr 20, 2026

Uh oh!

yihua Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Apr 20, 2026

Uh oh!

codecov-commenter commented Apr 21, 2026

Codecov Report

Uh oh!

hudi-bot commented Apr 21, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rahil-c commented Apr 13, 2026 •

edited

Loading