Skip to content

Add ArrayRecordDataSource based on Grain.#327

Open
copybara-service[bot] wants to merge 1 commit intomainfrom
test_474507172
Open

Add ArrayRecordDataSource based on Grain.#327
copybara-service[bot] wants to merge 1 commit intomainfrom
test_474507172

Conversation

@copybara-service
Copy link
Copy Markdown

Add ArrayRecordDataSource based on Grain.

Motivation

Using Grain and ArrayRecord files allows us to perform global shuffling at the very beginning of the pipeline without a shuffle buffer. This should reduce the size of checkpoints a lot.

Summary of changes

  • Add supports_global_shuffle property to DataSource.
  • Implement ArrayRecordDataSource.
  • Change Task.get_dataset() to disable shuffle buffer if the data source support global shuffling.

## Motivation
Using Grain and ArrayRecord files allows us to perform global shuffling at the very beginning of the pipeline without a shuffle buffer. This should reduce the size of checkpoints a lot.

## Summary of changes
- Add `supports_global_shuffle` property to `DataSource`.
- Implement `ArrayRecordDataSource`.
- Change `Task.get_dataset()` to disable shuffle buffer if the data source support global shuffling.

PiperOrigin-RevId: 474507172
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant