[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459
Open
b-enedict wants to merge 3 commits intoapache:mainfrom
Open
[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459b-enedict wants to merge 3 commits intoapache:mainfrom
b-enedict wants to merge 3 commits intoapache:mainfrom
Conversation
This patch adds a new loaders for loading PDF files by converting all pages of the document into numpy arrays processable by openCV. Furthermore it adds a loader for loading and converting an audio file into a transcript using faster-whisper.
This patch introduces a modality alignment operator to match previously unaligned data based on feature similarity. The operator computes similarities (e.g., ORB descriptors or perceptual hashing) between a primary and a secondary modality and determines an optimal matching. The implementation includes an abstract alignment interface and concrete methods for ORB-based and p-hash-based image alignment. Instead of producing reordered modalities, the operator outputs a matching that is applied after representation learning and before fusion. This ensures consistent ordering and equal-length modalities for downstream processing.
…pairing This patch introduces a new operator to Scuro for building contrastive learning pipelines with greater flexibility in handling input modalities. Previously, contrastive pairs had to be structurally aligned in a preprocessing step before being used in Scuro. This limited the ability to work with independently transformed or dynamically generated modalities. The new operator constructs contrastive pairs via a Cartesian product of modalities and optionally extends them with additional modalities that are already aligned. The resulting combinations are evaluated using a user-defined function to determine whether a pair represents a positive or negative sample. Based on this evaluation, the operator outputs both the assigned label and the corresponding modality pair. This design enables dynamic label generation and supports scenarios where modalities are windowed, reshuffled, or transformed differently. It also allows flexible fusion of modalities after contrastive pairing, improving the expressiveness of contrastive learning workflows. Limitations: The Cartesian product can introduce significant computational overhead for large modality sets, which may require further optimization.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces new functionality for multimodal learning in Scuro, including a contrastive learning operator, a modality alignment operator, and additional data loaders.
Changes
Contrastive Learning Operator
Modality Alignment Operator
Data Loaders