[MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0#18498
[MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0#18498rahil-c wants to merge 4 commits intoapache:masterfrom
Conversation
Bumps lance-core from 1.0.2 to 4.0.0 and lance-spark connector from 0.0.15 to 0.4.0. Updates affected import paths and adapts to the LanceArrowUtils.toArrowSchema signature change (drops the errorOnDuplicatedFieldNames parameter).
bb48ac0 to
42ea17a
Compare
Lance-spark 0.4.0 (bumped in 7e4967c) ships its own `org.apache.spark.sql.catalyst.plans.logical.ShowIndexes` inside `lance-spark-base_*.jar`. This collides with Hudi's own same-FQCN case class (added in hudi-spark-common). Both jars end up on the classpath of hudi-spark3.3.x/3.4.x/3.5.x/4.0.x, and since the two classes have different case-class arity (Lance's is 1-arg, Hudi's is 2-arg), Scala pattern matches like `case ShowIndexes(table, output)` fail to compile. Rename Hudi's class to `HoodieShowIndexes` (and its companion object) to sidestep the collision. This is an internal logical-plan class consumed only by Hudi's own parser / CatalystPlanUtils / analyzer — no public SQL or API surface changes. Call-sites updated: - Index.scala (definition + companion) - HoodieSpark{33,34,35,40}CatalystPlanUtils.scala (pattern match) - HoodieSpark{3_3,3_4,3_5,4_0}ExtendedSqlAstBuilder.scala (construct) - IndexCommands.scala (doc reference) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With lance-spark 0.4.0, VectorSchemaRoot.getFieldVectors() returns vectors in the file's on-disk order rather than in the order of the projection requested via LanceFileReader.readAll(). Wrapping vectors positionally therefore mismatches the UnsafeProjection built from the requested schema, causing UnsafeProjection to call type accessors on the wrong column (e.g. getInt on a VarCharVector) and fail with UnsupportedOperationException for MoR reads where the FileGroupRecord Buffer rearranges columns relative to the file's write order. Fix by looking up each vector by field name from the requested schema so the ColumnVector[] order matches what UnsafeProjection expects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
revist renaming of show indexes other options |
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
LGTM — the column-ordering fix in LanceRecordIterator is well-implemented with proper name-based lookup and clear error messages, and the ShowIndexes → HoodieShowIndexes rename is applied consistently across all Spark version modules.
|
retriggering CI |
| case class HoodieShowIndexes(table: LogicalPlan, | ||
| override val output: Seq[Attribute] = HoodieShowIndexes.getOutputAttrs) extends Command { |
There was a problem hiding this comment.
Does this affect the SQL syntax support for Hudi? If not and the "show indexes" is tested to be working, I think this is fine.
There was a problem hiding this comment.
Let me do a check on this.
There was a problem hiding this comment.
@yihua So technically the CI is green for this PR, and we know that we already have tests in hudi code base around SHOW INDEXES. For example the TestIndexSyntax.scala, TestSecondaryIndex and TestExpressionIndex run show indexes sql in their tests.
Thereforce I think there is no breaking change with this
| long maxFileSize) { | ||
| super(file, DEFAULT_BATCH_SIZE, bloomFilterOpt.map(HoodieBloomFilterRowWriteSupport::new)); | ||
| this.sparkSchema = sparkSchema; | ||
| this.arrowSchema = LanceArrowUtils.toArrowSchema(sparkSchema, DEFAULT_TIMEZONE, true, false); |
There was a problem hiding this comment.
What is the removed fourth argument?
| List<FieldVector> fieldVectors = root.getFieldVectors(); | ||
| Map<String, FieldVector> byName = new HashMap<>(fieldVectors.size() * 2); | ||
| for (FieldVector fv : fieldVectors) { | ||
| byName.put(fv.getName(), fv); | ||
| } | ||
| StructField[] sparkFields = sparkSchema.fields(); | ||
| if (sparkFields.length != fieldVectors.size()) { | ||
| throw new HoodieException("Lance batch column count " + fieldVectors.size() | ||
| + " does not match expected Spark schema size " + sparkFields.length | ||
| + " for file: " + path); | ||
| } | ||
| columnVectors = new ColumnVector[sparkFields.length]; | ||
| for (int i = 0; i < sparkFields.length; i++) { | ||
| String name = sparkFields[i].name(); | ||
| FieldVector fv = byName.get(name); | ||
| if (fv == null) { | ||
| throw new HoodieException("Lance batch missing expected column '" + name | ||
| + "' for file: " + path + "; available columns: " + byName.keySet()); | ||
| } | ||
| columnVectors[i] = new LanceArrowColumnVector(fv); | ||
| } |
There was a problem hiding this comment.
Non-blocker: is the schema per batch/record or per file? Could this schema processing be extracted out per file to reduce overhead?
There was a problem hiding this comment.
Let's create a follow-up to track this.
Shorten the prose blocks above the column-order remapping in LanceRecordIterator and above HoodieShowIndexes to 2-3 sentences each, keeping the why (lance-spark 0.4.0 on-disk column order; FQCN shadow from lance-spark-base) without the full incident narrative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18498 +/- ##
============================================
+ Coverage 68.84% 68.85% +0.01%
- Complexity 28195 28455 +260
============================================
Files 2459 2475 +16
Lines 135152 136499 +1347
Branches 16379 16594 +215
============================================
+ Hits 93039 93985 +946
- Misses 34746 34958 +212
- Partials 7367 7556 +189
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|

Bumps lance-core from 1.0.2 to 4.0.0 and the lance-spark connector from 0.0.15 to 0.4.0, adapts Hudi to the lance-spark repackaging (
com.lancedb→org.lance) and API signature changes, fixes a column-ordering regression inLanceRecordIteratorintroduced by the new connector's batch layout, and renames Hudi's internalShowIndexeslogical plan toHoodieShowIndexesso it no longer clashes with Spark 4.0's built-inShowIndexes.Describe the issue this Pull Request addresses
Hudi's Lance integration targets an older lance-spark / lance-core release. With lance-spark 0.4.0 the Maven coordinates (
com.lancedb→org.lance), package paths, and a few public signatures changed, and the in-memory Arrow batch layout returned byVectorSchemaRoot.getFieldVectors()now reflects the file's on-disk column order rather than the order requested inLanceFileReader.readAll(columnNames, ...). That last change silently broke MoR reads throughLanceRecordIterator, because theUnsafeProjectionis built from the Spark schema's column order but theColumnVector[]we were handing it was in the file's on-disk order — soUnsafeProjectionwould dispatch e.g.getInt(0)against aVarCharVectorand throwUnsupportedOperationException. Additionally, Spark 4.0 ships aShowIndexeslogical plan of its own, so Hudi's identically-named case class needed to be renamed to avoid conflicts once the codebase starts cross-compiling against Spark 4.Summary and Changelog
1. Dependency bump (
[MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0)pom.xml:lance.version1.0.2→4.0.0,lance.spark.connector.version0.0.15→0.4.0, groupIdcom.lancedb→org.lancefor bothlance-coreand thelance-spark-3.5_${scala.binary.version}artifact.HoodieSparkLanceWriter.java: update importcom.lancedb.lance.spark.arrow.LanceArrowWriter→org.lance.spark.arrow.LanceArrowWriter, and adapt theLanceArrowUtils.toArrowSchema(...)call site — the 0.4.0 signature drops theerrorOnDuplicatedFieldNamesparameter.LanceRecordIterator.java: update importcom.lancedb.lance.spark.vectorized.LanceArrowColumnVector→org.lance.spark.vectorized.LanceArrowColumnVector.2. Fix column-order regression (
fix(lance): look up Arrow vectors by field name in LanceRecordIterator)hasNext(), the first batch now builds aMap<String, FieldVector>fromVectorSchemaRoot.getFieldVectors()and walks the SparkStructField[]in order, looking each vector up by name, instead of trusting positional order.HoodieExceptionwith the available column set if Lance returns a mismatched batch.UnsafeProjectionexpects and unblocks MoR reads (the symptom wasUnsupportedOperationExceptionfromArrowVectorAccessor.getIntduringUnsafeProjection.applyinTestLanceDataSource.testBasicUpsertModifyExistingRow/testBasicDeleteOperationon the MoR parameters).3. Rename
ShowIndexes→HoodieShowIndexes([MINOR] Rename Hudi's ShowIndexes logical plan to HoodieShowIndexes)Index.scala: rename the case class and its companion.HoodieSpark33CatalystPlanUtils.scala,HoodieSpark34CatalystPlanUtils.scala,HoodieSpark35CatalystPlanUtils.scala,HoodieSpark40CatalystPlanUtils.scala.HoodieSpark3_3ExtendedSqlAstBuilder.scala,HoodieSpark3_4ExtendedSqlAstBuilder.scala,HoodieSpark3_5ExtendedSqlAstBuilder.scala,HoodieSpark4_0ExtendedSqlAstBuilder.scala.IndexCommands.scala.Impact
SHOW INDEXESlogical plan is named internally.org.lance:lance-core:4.0.0andorg.lance:lance-spark-3.5_${scala.binary.version}:0.4.0instead of the oldcom.lancedbcoordinates. Any external consumers that shade/relocate Lance classes should update their relocations accordingly.Risk Level
low — the bump is localized to the Lance reader/writer path, the ordering fix is narrowly scoped to
LanceRecordIterator.hasNext(), and the rename is a name-only change that mechanically updates every call-site. CI (Azure + GitHub Actions) covers COW and MoR paths inTestLanceDataSource, which was the failure mode caught and fixed by commit 2.Documentation Update
none.
Contributor's checklist
TestLanceDataSourceMoR cases exercise the fixed path)