Skip to content

[MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0#18498

Open
rahil-c wants to merge 4 commits intoapache:masterfrom
rahil-c:bump/lance-4.0.0
Open

[MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0#18498
rahil-c wants to merge 4 commits intoapache:masterfrom
rahil-c:bump/lance-4.0.0

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented Apr 13, 2026

Bumps lance-core from 1.0.2 to 4.0.0 and the lance-spark connector from 0.0.15 to 0.4.0, adapts Hudi to the lance-spark repackaging (com.lancedborg.lance) and API signature changes, fixes a column-ordering regression in LanceRecordIterator introduced by the new connector's batch layout, and renames Hudi's internal ShowIndexes logical plan to HoodieShowIndexes so it no longer clashes with Spark 4.0's built-in ShowIndexes.

Describe the issue this Pull Request addresses

Hudi's Lance integration targets an older lance-spark / lance-core release. With lance-spark 0.4.0 the Maven coordinates (com.lancedborg.lance), package paths, and a few public signatures changed, and the in-memory Arrow batch layout returned by VectorSchemaRoot.getFieldVectors() now reflects the file's on-disk column order rather than the order requested in LanceFileReader.readAll(columnNames, ...). That last change silently broke MoR reads through LanceRecordIterator, because the UnsafeProjection is built from the Spark schema's column order but the ColumnVector[] we were handing it was in the file's on-disk order — so UnsafeProjection would dispatch e.g. getInt(0) against a VarCharVector and throw UnsupportedOperationException. Additionally, Spark 4.0 ships a ShowIndexes logical plan of its own, so Hudi's identically-named case class needed to be renamed to avoid conflicts once the codebase starts cross-compiling against Spark 4.

Summary and Changelog

1. Dependency bump ([MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0)

  • pom.xml: lance.version 1.0.24.0.0, lance.spark.connector.version 0.0.150.4.0, groupId com.lancedborg.lance for both lance-core and the lance-spark-3.5_${scala.binary.version} artifact.
  • HoodieSparkLanceWriter.java: update import com.lancedb.lance.spark.arrow.LanceArrowWriterorg.lance.spark.arrow.LanceArrowWriter, and adapt the LanceArrowUtils.toArrowSchema(...) call site — the 0.4.0 signature drops the errorOnDuplicatedFieldNames parameter.
  • LanceRecordIterator.java: update import com.lancedb.lance.spark.vectorized.LanceArrowColumnVectororg.lance.spark.vectorized.LanceArrowColumnVector.

2. Fix column-order regression (fix(lance): look up Arrow vectors by field name in LanceRecordIterator)

  • In hasNext(), the first batch now builds a Map<String, FieldVector> from VectorSchemaRoot.getFieldVectors() and walks the Spark StructField[] in order, looking each vector up by name, instead of trusting positional order.
  • Adds a size/name sanity check that throws HoodieException with the available column set if Lance returns a mismatched batch.
  • This restores the ordering contract the previously-built UnsafeProjection expects and unblocks MoR reads (the symptom was UnsupportedOperationException from ArrowVectorAccessor.getInt during UnsafeProjection.apply in TestLanceDataSource.testBasicUpsertModifyExistingRow / testBasicDeleteOperation on the MoR parameters).

3. Rename ShowIndexesHoodieShowIndexes ([MINOR] Rename Hudi's ShowIndexes logical plan to HoodieShowIndexes)

  • Index.scala: rename the case class and its companion.
  • Pattern-match call-sites updated in HoodieSpark33CatalystPlanUtils.scala, HoodieSpark34CatalystPlanUtils.scala, HoodieSpark35CatalystPlanUtils.scala, HoodieSpark40CatalystPlanUtils.scala.
  • Constructor call-sites updated in HoodieSpark3_3ExtendedSqlAstBuilder.scala, HoodieSpark3_4ExtendedSqlAstBuilder.scala, HoodieSpark3_5ExtendedSqlAstBuilder.scala, HoodieSpark4_0ExtendedSqlAstBuilder.scala.
  • Doc reference updated in IndexCommands.scala.

Impact

  • User-facing: none. Public Hudi APIs, on-disk formats, and config are unchanged. All changes are internal to how Hudi invokes lance-spark and how Hudi's own SHOW INDEXES logical plan is named internally.
  • Dependencies: downstream artifacts will now resolve org.lance:lance-core:4.0.0 and org.lance:lance-spark-3.5_${scala.binary.version}:0.4.0 instead of the old com.lancedb coordinates. Any external consumers that shade/relocate Lance classes should update their relocations accordingly.

Risk Level

low — the bump is localized to the Lance reader/writer path, the ordering fix is narrowly scoped to LanceRecordIterator.hasNext(), and the rename is a name-only change that mechanically updates every call-site. CI (Azure + GitHub Actions) covers COW and MoR paths in TestLanceDataSource, which was the failure mode caught and fixed by commit 2.

Documentation Update

none.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable (existing TestLanceDataSource MoR cases exercise the fixed path)

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 13, 2026
Bumps lance-core from 1.0.2 to 4.0.0 and lance-spark connector
from 0.0.15 to 0.4.0. Updates affected import paths and adapts to
the LanceArrowUtils.toArrowSchema signature change (drops the
errorOnDuplicatedFieldNames parameter).
Lance-spark 0.4.0 (bumped in 7e4967c) ships its own
`org.apache.spark.sql.catalyst.plans.logical.ShowIndexes` inside
`lance-spark-base_*.jar`. This collides with Hudi's own same-FQCN
case class (added in hudi-spark-common). Both jars end up on the
classpath of hudi-spark3.3.x/3.4.x/3.5.x/4.0.x, and since the two
classes have different case-class arity (Lance's is 1-arg, Hudi's
is 2-arg), Scala pattern matches like `case ShowIndexes(table, output)`
fail to compile.

Rename Hudi's class to `HoodieShowIndexes` (and its companion
object) to sidestep the collision. This is an internal logical-plan
class consumed only by Hudi's own parser / CatalystPlanUtils /
analyzer — no public SQL or API surface changes.

Call-sites updated:
- Index.scala (definition + companion)
- HoodieSpark{33,34,35,40}CatalystPlanUtils.scala (pattern match)
- HoodieSpark{3_3,3_4,3_5,4_0}ExtendedSqlAstBuilder.scala (construct)
- IndexCommands.scala (doc reference)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rahil-c rahil-c requested review from voonhous and yihua April 15, 2026 20:45
With lance-spark 0.4.0, VectorSchemaRoot.getFieldVectors() returns
vectors in the file's on-disk order rather than in the order of the
projection requested via LanceFileReader.readAll(). Wrapping vectors
positionally therefore mismatches the UnsafeProjection built from the
requested schema, causing UnsafeProjection to call type accessors on
the wrong column (e.g. getInt on a VarCharVector) and fail with
UnsupportedOperationException for MoR reads where the FileGroupRecord
Buffer rearranges columns relative to the file's write order.

Fix by looking up each vector by field name from the requested schema
so the ColumnVector[] order matches what UnsafeProjection expects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 16, 2026

revist renaming of show indexes other options

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — the column-ordering fix in LanceRecordIterator is well-implemented with proper name-based lookup and clear error messages, and the ShowIndexesHoodieShowIndexes rename is applied consistently across all Spark version modules.

@voonhous
Copy link
Copy Markdown
Member

retriggering CI

Comment on lines +68 to +69
case class HoodieShowIndexes(table: LogicalPlan,
override val output: Seq[Attribute] = HoodieShowIndexes.getOutputAttrs) extends Command {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect the SQL syntax support for Hudi? If not and the "show indexes" is tested to be working, I think this is fine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do a check on this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua So technically the CI is green for this PR, and we know that we already have tests in hudi code base around SHOW INDEXES. For example the TestIndexSyntax.scala, TestSecondaryIndex and TestExpressionIndex run show indexes sql in their tests.

Thereforce I think there is no breaking change with this

long maxFileSize) {
super(file, DEFAULT_BATCH_SIZE, bloomFilterOpt.map(HoodieBloomFilterRowWriteSupport::new));
this.sparkSchema = sparkSchema;
this.arrowSchema = LanceArrowUtils.toArrowSchema(sparkSchema, DEFAULT_TIMEZONE, true, false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the removed fourth argument?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2026-04-20 at 9 05 11 AM I think it newer lance verison this 4th argument is likely not present around `largeVarTypes`

Comment on lines +121 to +141
List<FieldVector> fieldVectors = root.getFieldVectors();
Map<String, FieldVector> byName = new HashMap<>(fieldVectors.size() * 2);
for (FieldVector fv : fieldVectors) {
byName.put(fv.getName(), fv);
}
StructField[] sparkFields = sparkSchema.fields();
if (sparkFields.length != fieldVectors.size()) {
throw new HoodieException("Lance batch column count " + fieldVectors.size()
+ " does not match expected Spark schema size " + sparkFields.length
+ " for file: " + path);
}
columnVectors = new ColumnVector[sparkFields.length];
for (int i = 0; i < sparkFields.length; i++) {
String name = sparkFields[i].name();
FieldVector fv = byName.get(name);
if (fv == null) {
throw new HoodieException("Lance batch missing expected column '" + name
+ "' for file: " + path + "; available columns: " + byName.keySet());
}
columnVectors[i] = new LanceArrowColumnVector(fv);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocker: is the schema per batch/record or per file? Could this schema processing be extracted out per file to reduce overhead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a follow-up to track this.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 20, 2026

@yihua @voonhous lets merge this only after we have landed some of the larger changes on the unstructured track

Shorten the prose blocks above the column-order remapping in
LanceRecordIterator and above HoodieShowIndexes to 2-3 sentences each,
keeping the why (lance-spark 0.4.0 on-disk column order; FQCN shadow
from lance-spark-base) without the full incident narrative.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.85%. Comparing base (613fc49) to head (6288d8b).
⚠️ Report is 33 commits behind head on master.

Files with missing lines Patch % Lines
...rg/apache/hudi/io/storage/LanceRecordIterator.java 70.58% 3 Missing and 2 partials ⚠️
...pache/spark/sql/catalyst/plans/logical/Index.scala 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18498      +/-   ##
============================================
+ Coverage     68.84%   68.85%   +0.01%     
- Complexity    28195    28455     +260     
============================================
  Files          2459     2475      +16     
  Lines        135152   136499    +1347     
  Branches      16379    16594     +215     
============================================
+ Hits          93039    93985     +946     
- Misses        34746    34958     +212     
- Partials       7367     7556     +189     
Flag Coverage Δ
common-and-other-modules 44.47% <0.00%> (-0.09%) ⬇️
hadoop-mr-java-client 44.80% <ø> (-0.04%) ⬇️
spark-client-hadoop-common 48.40% <0.00%> (-0.07%) ⬇️
spark-java-tests 49.37% <58.62%> (+0.46%) ⬆️
spark-scala-tests 45.31% <34.48%> (-0.21%) ⬇️
utilities 38.02% <4.34%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/hudi/io/storage/HoodieSparkLanceWriter.java 96.15% <100.00%> (ø)
.../apache/spark/sql/hudi/command/IndexCommands.scala 86.90% <ø> (ø)
...che/spark/sql/HoodieSpark33CatalystPlanUtils.scala 74.46% <100.00%> (ø)
...l/parser/HoodieSpark3_3ExtendedSqlAstBuilder.scala 20.41% <100.00%> (+1.29%) ⬆️
...che/spark/sql/HoodieSpark34CatalystPlanUtils.scala 78.94% <100.00%> (ø)
...l/parser/HoodieSpark3_4ExtendedSqlAstBuilder.scala 20.07% <100.00%> (+1.23%) ⬆️
...che/spark/sql/HoodieSpark35CatalystPlanUtils.scala 73.58% <100.00%> (ø)
...l/parser/HoodieSpark3_5ExtendedSqlAstBuilder.scala 20.69% <100.00%> (+1.22%) ⬆️
...che/spark/sql/HoodieSpark40CatalystPlanUtils.scala 73.58% <100.00%> (ø)
...l/parser/HoodieSpark4_0ExtendedSqlAstBuilder.scala 20.24% <100.00%> (+1.16%) ⬆️
... and 2 more

... and 78 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants