[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader by b-enedict · Pull Request #2459 · apache/systemds

b-enedict · 2026-04-21T20:11:24Z

Summary

This PR introduces new functionality for multimodal learning in Scuro, including a contrastive learning operator, a modality alignment operator, and additional data loaders.

Changes

Contrastive Learning Operator

Constructs modality pairs via a Cartesian product
Uses a user-defined function to label pairs as positive or negative
Enables dynamic generation of contrastive samples

Modality Alignment Operator

Aligns previously unaligned modalities using feature-based similarity (e.g., ORB, perceptual hashing)
Outputs a matching between a primary and secondary modality
Matching is applied after representation learning and before fusion

Data Loaders

PDF loader: converts document pages into NumPy arrays for OpenCV processing
Audio loader: converts audio to text using faster-whisper

This patch adds a new loaders for loading PDF files by converting all pages of the document into numpy arrays processable by openCV. Furthermore it adds a loader for loading and converting an audio file into a transcript using faster-whisper.

This patch introduces a modality alignment operator to match previously unaligned data based on feature similarity. The operator computes similarities (e.g., ORB descriptors or perceptual hashing) between a primary and a secondary modality and determines an optimal matching. The implementation includes an abstract alignment interface and concrete methods for ORB-based and p-hash-based image alignment. Instead of producing reordered modalities, the operator outputs a matching that is applied after representation learning and before fusion. This ensures consistent ordering and equal-length modalities for downstream processing.

…pairing This patch introduces a new operator to Scuro for building contrastive learning pipelines with greater flexibility in handling input modalities. Previously, contrastive pairs had to be structurally aligned in a preprocessing step before being used in Scuro. This limited the ability to work with independently transformed or dynamically generated modalities. The new operator constructs contrastive pairs via a Cartesian product of modalities and optionally extends them with additional modalities that are already aligned. The resulting combinations are evaluated using a user-defined function to determine whether a pair represents a positive or negative sample. Based on this evaluation, the operator outputs both the assigned label and the corresponding modality pair. This design enables dynamic label generation and supports scenarios where modalities are windowed, reshuffled, or transformed differently. It also allows flexible fusion of modalities after contrastive pairing, improving the expressiveness of contrastive learning workflows. Limitations: The Cartesian product can introduce significant computational overhead for large modality sets, which may require further optimization.

b-enedict added 3 commits April 21, 2026 21:23

github-project-automation Bot added this to SystemDS PR Queue Apr 21, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459
b-enedict wants to merge 3 commits intoapache:mainfrom
b-enedict:feat/alignment-operator

b-enedict commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

b-enedict commented Apr 21, 2026

Summary

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant