Skip to content

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459

Open
b-enedict wants to merge 3 commits intoapache:mainfrom
b-enedict:feat/alignment-operator
Open

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459
b-enedict wants to merge 3 commits intoapache:mainfrom
b-enedict:feat/alignment-operator

Conversation

@b-enedict
Copy link
Copy Markdown

Summary

This PR introduces new functionality for multimodal learning in Scuro, including a contrastive learning operator, a modality alignment operator, and additional data loaders.

Changes

Contrastive Learning Operator

  • Constructs modality pairs via a Cartesian product
  • Uses a user-defined function to label pairs as positive or negative
  • Enables dynamic generation of contrastive samples

Modality Alignment Operator

  • Aligns previously unaligned modalities using feature-based similarity (e.g., ORB, perceptual hashing)
  • Outputs a matching between a primary and secondary modality
  • Matching is applied after representation learning and before fusion

Data Loaders

  • PDF loader: converts document pages into NumPy arrays for OpenCV processing
  • Audio loader: converts audio to text using faster-whisper

This patch adds a new loaders for loading PDF files by converting all pages of the document into numpy arrays processable by openCV.
Furthermore it adds a loader for loading and converting an audio file into a transcript using faster-whisper.
This patch introduces a modality alignment operator to match previously
unaligned data based on feature similarity.

The operator computes similarities (e.g., ORB descriptors or perceptual
hashing) between a primary and a secondary modality and determines an
optimal matching. The implementation includes an abstract alignment
interface and concrete methods for ORB-based and p-hash-based image
alignment.

Instead of producing reordered modalities, the operator outputs a
matching that is applied after representation learning and before
fusion. This ensures consistent ordering and equal-length modalities for
downstream processing.
…pairing

This patch introduces a new operator to Scuro for building contrastive
learning pipelines with greater flexibility in handling input modalities.

Previously, contrastive pairs had to be structurally aligned in a
preprocessing step before being used in Scuro. This limited the ability
to work with independently transformed or dynamically generated
modalities.

The new operator constructs contrastive pairs via a Cartesian product of
modalities and optionally extends them with additional modalities that
are already aligned. The resulting combinations are evaluated using a
user-defined function to determine whether a pair represents a positive
or negative sample. Based on this evaluation, the operator outputs both
the assigned label and the corresponding modality pair.

This design enables dynamic label generation and supports scenarios where
modalities are windowed, reshuffled, or transformed differently. It also
allows flexible fusion of modalities after contrastive pairing, improving
the expressiveness of contrastive learning workflows.

Limitations: The Cartesian product can introduce significant
computational overhead for large modality sets, which may require further
optimization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant