Skip to content

aws-samples/sample-ont-basecalling-on-aws

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Cost-effective and Scalable Oxford Nanopore Technologies Basecalling on AWS

This repository provides an example implementation for cost-effective and scalable Oxford Nanopore Technologies (ONT) basecalling with Dorado on AWS. The solution uses horizontal scaling of basecalling workloads to achieve superior price-performance compared to traditional vertical scaling while enabling flexible capacity adjustment to meet varying turnaround time requirements.

Vertical vs Horizontal Scaling

Vertical vs. horizontal scaling of GPU compute for ONT basecalling. A) Vertical scaling involves increasing the power of a single machine by adding more or more powerful CPU, GPU, RAM, or storage to handle heavier workloads. B) Horizontal scaling distributes the computational load across multiple machines. This hypothetical example compares processing 400 POD5 files on a single p5.48xlarge instance versus distributing them in batches of 20 files across 20 g6e.xlarge instances.

Architecture

The architecture combines Nextflow for input data batching and parallelization of basecalling task with AWS Batch for managed compute resource provisioning and job scheduling enabling automatic distribution of basecalling jobs across multiple Amazon EC2 GPU instances. The solution provides the option to use Amazon EC2 Spot Instances to optimize compute costs further. Nextflow's built-in retry and resume functionality provides resilience against spot interruption.

Key Features:

  • Horizontal scaling: Parallelizes basecalling tasks across multiple smaller GPU instances for superior price-performance
  • Containerized tooling: Dorado basecaller with basecalling models packaged in Docker container for consistent, reproducible execution
  • Spot Instance support: Optional use of Amazon EC2 Spot Instances for up to 90% additional cost savings
  • Amazon EC2 Instance Storage: Use of instance storage for data staging achieves reduced storage costs and enhanced I/O performance
  • S3 Intelligent-Tiering: Amazon S3 Intelligent-Tiering storage class used by default for automatic storage cost optimization
  • Fault tolerance: Nextflow's built-in retry and resume functionality provides resilience against Spot interruptions
  • Flexible throughput: Dynamic scaling enabling flexible capacity adjustment to meet varying turnaround time requirements
  • Monitoring: Amazon CloudWatch integration for job monitoring and logging

The solution leverages the following AWS services and tools:

  • Nextflow: Workflow management system for automating input data batching and task parallelization
  • AWS Batch: Managed batch compute service for dynamic resource provisioning and job scheduling
  • Amazon EC2: Flexible GPU-accelerated compute (G4, G5, G6, P4 and P5 instance types)
  • Amazon ECR: Container registry for storing Dorado basecaller Docker images
  • Amazon S3: Durable, highly available storage for input and output data
  • AWS CDK: Infrastructure-as-code framework for automated deployment

Architecture Diagram

ONT basecalling with Nextflow on AWS Batch. Nextflow automates POD5 input file batching and parallelization of basecalling tasks. AWS Batch distributes basecalling jobs across multiple GPU instances. By utilizing Amazon EC2 Instance Storage, the solution achieves both reduced storage costs and enhanced I/O performance. Docker container images are stored in Amazon ECR. Input and output data are stored in Amazon S3.

Pipeline Overview

ONT Basecalling Metro Map

_ONT basecalling pipeline overview. The pipeline accepts POD5 or FAST5 files as input. FAST5 are converted to POD5 format (orange line). POD5 files are split into batches based on the number of GPU instances configured and each batch processed in parallel using the Dorado basecaller. The output is one or more BAM files per basecalling task.

Pre-configured Amazon EC2 GPU Instance Types

AWS Batch compute environments will be created for the following instance types if available in the chosen AWS Region:

Instance Type GPU vCPUs Memory [GB] GPU Memory [GB]
g4dn.xlarge 1x NVIDIA T4 4 16 16
g5.xlarge 1x NVIDIA A10G 4 16 24
g6e.xlarge 1x NVIDIA L40S 4 32 48
p4d.24xlarge 8x NVIDIA A100 96 1152 320
p5.48xlarge 8x NVIDIA H100 192 2048 640

The CDK deployment supports Amazon EC2 Capacity Blocks with AWS Batch compute environments for Amazon EC2 P5 instances, enabling advance reservation of GPU capacity for defined durations at scheduled future dates.

Test Data

The repository includes a Nextflow workflow to ingest and pre-process test data from the publicly available nanopore sequencing data set for Genome in a Bottle (GIAB) sample HG001 (GM12878):

Data Set Name Oxford Nanopore Technologies - Genome in a Bottle Data Release 2025.01
Sample ID HG001 (GM12878)
Flowcell ID PAW79146
Data Set Source Registry of Open Data on AWS
Data Set URI s3://ont-open-data/giab_2025.01/flowcells/HG001/PAW79146/pod5/
File Format POD5
File Count 73
Total Size 1,046 GB
Read Count ~6M
Gigabases ~99.35
Genome Coverage ~31x

POD5 files of flow cell PAW79146 are downloaded from the Registry of Open Data on AWS. The original 73 POD5 files ranging in size from 731.7MB to 31.6GB are resized into 743 uniformly sized 1.5GB files using the pod5 tool. The workflow runtime is approximately one hour and the compute costs are approximately $0.90 based on Amazon EC2 on-demand pricing in the US Est (Northern Virginia) Region. The cost for storing the test data set in S3 Standard/S3 Intelligent-Tiering Frequent Access tier are $24.06 per month based on Amazon S3 pricing in the US East (Northern Virginia) Region. For instructions on loading the test data, see the Load Test Data section.

Project Structure

ont-basecalling/
├── cdk/                            # AWS CDK infrastructure code
│   ├── ont_basecalling_on_aws/     # CDK constructs
│   │   ├── batch.py                # AWS Batch configuration
│   │   ├── docker.py               # ECR repositories and image copying
│   │   ├── network.py              # VPC, subnets, security groups
│   │   ├── storage.py              # S3 bucket configuration
│   │   └── ont_basecalling_on_aws_stack.py
│   ├── assets/                     # Lambda functions and assets
│   └── app.py                      # CDK application entry point
├── docker/                         # Docker image definitions
│   ├── dorado/                     # Dorado basecaller containers
│   └── pod5/                       # POD5 tools containers
├── nextflow/                       # Nextflow workflow
│   └── ontbasecalling/
│       ├── main.nf                 # Main workflow
│       ├── nextflow.config         # Workflow configuration
│       ├── nextflow_schema.json    # Pipeline parameter specification
│       ├── modules/local/          # Pipeline-specific modules
│       │   ├── dorado_basecalling/ # Dorado basecalling process definition
│       │   ├── fast5_to_pod5/      # FAST5 to PODD5 conversion process definition
│       │   ├── split_pod5/         # Split POD5 process definition
│       ├── workflows/              # Main pipeline workflows
│       │   ├── load_test_data.nf   # Test Data Processing Workflow
│       │   ├── ontbasecalling.nf   # Basecalling Workflow
│       └── conf/                   # Configuration profiles
│           ├── base.config         # Basic pipeline configurations
│           └── modules.config      # Module-specific configurations
└── images/                         # Documentation images

Prerequisites

Software Requirements

  • AWS CLI - configured with appropriate permissions
  • AWS CDK - for infrastructure deployment
  • Docker - for image building
  • Nextflow - for workflow execution
  • Python - for CDK development

AWS Credentials

Active AWS credentials are required before running any CDK commands (cdk synth, cdk deploy). The CDK app makes live AWS API calls at synth time to resolve the AWS account ID and to check GPU instance type availability in the target region. Without valid credentials, cdk synth will fail with botocore.exceptions.NoCredentialsError.

Verify your credentials are active before proceeding:

aws sts get-caller-identity

AWS Permissions

Your AWS credentials need permissions for:

  • Amazon EC2 (VPC, Security Groups, Launch Templates)
  • AWS Batch (Compute Environments, Job Queues)
  • Amazon ECR (Repository management, image push/pull)
  • Amazon S3 (Bucket operations, object access)
  • Amazon IAM (Role and policy management)
  • Amazon CloudWatch (Logging and monitoring)
  • AWS Lambda (Custom resource functions)

Quick Start

1. Set AWS Account ID & Region

export AWS_ACCOUNT_ID=<aws_account_id>
export AWS_REGION=<aws_region> [default: us-east-1]

2. Deploy Infrastructure & Build Docker containers

# clone the repository
git clone https://github.com/aws-samples/sample-ont-basecalling-on-aws.git

# change into the directory containing the CDK app
cd sample-ont-basecalling-on-aws/cdk

# create a virtualenv:
python3 -m venv .venv

# activate virtualenv.
source .venv/bin/activate

# update pip
pip install --upgrade pip

# install the required dependencies.
pip install -r requirements.txt

# Deploy the CDK stack with Dorado 1.3.0
cdk bootstrap  # First time only
cdk deploy

# Optional: Build specific Dorado version(s)
cdk deploy -c dorado_versions='["1.2.0"]'
cdk deploy -c dorado_versions='["1.2.0","1.3.0"]'  # Multiple versions

# Optional: Deploy with EC2 Capacity Block support for P5 instances
# Requires pre-purchased Capacity Block reservation ID and Availability Zone
cdk deploy -c capacity_reservation_id=<reservation-id> -c capacity_reservation_az=<availability-zone>

3. Load Test Data

cd nextflow/ontbasecalling

# load and pre-process Genome in a Bottle test data set
nextflow run main.nf -profile awsbatch --load_test_data=true

Note: The workflow uses --test_data_pub_dir_mode=move by default, which moves split POD5 files from the Nextflow work directory to the output directory instead of copying them. This avoids duplication of large volumes of data (~1TB) and reduces storage costs.

4. Run Basecalling Workflow

cd nextflow/ontbasecalling

nextflow run main.nf -profile awsbatch

Nextflow Workflow Configuration

To display the pipeline configuration documentation

cd nextflow/ontbasecalling

nextflow main.nf --help

AWS Configuration

Option Type Description Default
--aws_account_id integer The AWS account the workflow will be executed in -
--aws_region string The AWS Region the workflow will be executed in us-east-1
--use_spot boolean Use Amazon EC2 spot instances false

Software Options

Option Type Description Default
--container_registry string The Docker container registry <aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com
--dorado_version string The version of the Dorado basecaller 1.3.0
--pod5_version string The version of the Pod5 tool 0.3.35

Input/Output Options

Option Type Description Default
--input_dir string The FAST5/POD5 input directory s3://ont-basecalling-<aws_region>-<aws_account_id>/test_data/
--input_is_fast5 boolean Input data in FAST5 format false
--output_dir string The workflow output directory s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/output/<yyyy-mm-dd_hh-mm-ss>
--report_dir string The workflow report directory s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/reports/<yyyy-mm-dd_hh-mm-ss>
--work_dir string The workflow work directory s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/workdir
--publish_dir_mode string The workflow publishing mode copy

Basecalling Options

Option Type Description Default
--model string The basecalling model dna_r10.4.1_e8.2_400bps_hac@v5.2.0
--gpu_instance_type string The Amazon EC2 GPU instance type used for basecalling g5
--gpu_instance_count integer The number of GPU instances used to parallelise basecalling; if set to -1 the number of GPU instances will be set automatically to match the throughput of a p5.48xlarge instance -1
--merge_output boolean Create a single BAM file per basecalling task true

Test Data Options

Option Type Description Default
--load_test_data boolean Pre-process and load test data false
--test_data_input_dir string The source location of the POD5 test data s3://ont-open-data/giab_2025.01/flowcells/HG001/PAW79146/pod5/
--test_data_output_dir string The destination location for the POD5 test data s3://ont-basecalling-<aws_region_id>-<aws_account>/test_data/
--test_data_pub_dir_mode string The publishing mode for the POD5 test data move
--split_pod5_size_gb number The target file size for the test data POD5 files 1.5

Data Storage

All data is stored in the Amazon S3 bucket created during CDK deployment: s3://ont-basecalling-<aws_region>-<aws_account_id>. By default the S3 Intelligent-Tiering storage class is applied to all objects.

Data Type S3 Location Description
Test Data s3://ont-basecalling-<aws_region>-<aws_account_id>/test_data/ Pre-processed POD5 test files (743 files, ~1TB total)
Nextflow Work Directory s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/workdir/ Intermediate files and task execution data
Workflow Output s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/output/<yyyy-mm-dd_hh-mm-ss>/ Basecalling results (BAM files)
Workflow Reports s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/reports/<yyyy-mm-dd_hh-mm-ss>/ Execution reports, timeline, and resource usage

Usage Examples

Basic Usage

# Process POD5 files with default settings (horizontal scaling of basecalling of test data set across 11 x G6e.xlarge)
nextflow run main.nf -profile awsbatch

# Process POD5 files with high accuracy model
nextflow run main.nf -profile awsbatch --model=hac

# Use specific GPU instance type
nextflow run main.nf -profile awsbatch --gpu_instance_type=g4

Advanced Configuration

# Custom S3 paths, model selection and scaling options
nextflow run main.nf -profile awsbatch \
  --input_dir=s3://my-bucket/input/ \
  --output_dir=s3://my-bucket/output/ \
  --model=sup \
  --gpu_instance_type=g6
  --gpu_instance_count=30
# Use Amazon EC2 Spot Instances
nextflow run main.nf -profile awsbatch --use_spot=true

Throughput Configuration & Cost Estimation

Basecalling throughput and compute cost estimates based on analysing flow cell PAW79146 of the ONT Genome in a Bottle Data Release 2025.01 with Dorado v1.3.0 and basecalling model dna_r10.4.1_e8.2_400bps_hac@v5.2.0. Compute cost estimates are based on Amazon EC2 on-demand pricing in the US East (Northern Virginia) Region and include instance time consumed for data staging and unstaging.

Instance Type Basecalling Throughput [samples/s] Instances Required for P5 Throughput Parity Pipeline Runtime [hh:mm:ss] Compute Cost Per Flowcall [$] Compute Cost Per Gbases [$] Compute Cost Per 30x Human Genome [$]
g4dn.xlarge 6.28 x 106 182 00:22:47 33.15 0.33 32.08
g5.xlarge 2.94 x 107 39 00:21:47 13.27 0.13 12.84
g6e.xlarge 1.05 x 108 11 00:28:47 9.50 0.10 9.20
p4d.24xlarge 6.94 x 108 2 00:19:55 14.40 0.14 13.94
p5.48xlarge 1.14 x 109 1 00:29:16 26.85 0.27 25.98

Cleanup

To avoid ongoing charges, remove all resources created by this solution when they are no longer needed.

1. Delete S3 Data (Optional)

If you loaded test data or ran basecalling workflows, you may want to verify the data in the S3 bucket before destroying the stack. The CDK stack is configured to automatically delete the S3 buckets and their contents on stack deletion, so no manual cleanup is required. However, if you want to inspect or selectively preserve data first:

# List bucket contents
aws s3 ls s3://ont-basecalling-${AWS_REGION}-${AWS_ACCOUNT_ID}/ --recursive --summarize

# Optional: Download results you want to keep before destroying the stack
aws s3 cp s3://ont-basecalling-${AWS_REGION}-${AWS_ACCOUNT_ID}/nextflow/output/ ./local-output/ --recursive

2. Clean Up Nextflow Work Directory (Optional)

If you want to remove intermediate Nextflow files before destroying the stack:

nextflow clean -f

3. Destroy the CDK Stack

cd cdk

# Activate virtualenv if not already active
source .venv/bin/activate

# Destroy all resources (S3 buckets and their contents are deleted automatically)
cdk destroy

This removes all AWS resources created by the stack, including:

  • VPC, subnets, and security groups
  • AWS Batch compute environments and job queues
  • Amazon ECR repositories and Docker images
  • Amazon S3 buckets and all stored objects
  • Lambda functions and IAM roles
  • CloudTrail trail and CloudWatch log groups
  • SSM parameters
  • Budget alerts (if configured)

4. Remove the CDK Bootstrap Stack (Optional)

If this was the only CDK application in the account/region and you no longer need CDK bootstrapping:

aws cloudformation delete-stack --stack-name CDKToolkit

Security Considerations

  • All compute instances run in private subnets
  • Amazon ECR repositories use image scanning
  • Amazon S3 buckets enforce SSL and block public access
  • Amazon IAM roles follow least-privilege principles
  • Amazon EC2 VPC Flow Logs enabled for network monitoring

Support and Contributing

  • Report issues via GitHub Issues
  • Follow AWS best practices for security and cost optimization
  • Test changes in development environments before production deployment

Disclaimers

Third Party Packages

This package depends on and may incorporate or retrieve a number of third-party software packages (such as open source packages) at install-time or build-time or run-time ("External Dependencies"). The External Dependencies are subject to license terms that you must accept in order to use this package. If you do not accept all of the applicable license terms, you should not use this package. We recommend that you consult your company's open source approval policy before proceeding.

Provided below is a list of External Dependencies and the applicable license identification as indicated by the documentation associated with the External Dependencies as of Amazon's most recent review.

THIS INFORMATION IS PROVIDED FOR CONVENIENCE ONLY. AMAZON DOES NOT PROMISE THAT THE LIST OR THE APPLICABLE TERMS AND CONDITIONS ARE COMPLETE, ACCURATE, OR UP-TO-DATE, AND AMAZON WILL HAVE NO LIABILITY FOR ANY INACCURACIES. YOU SHOULD CONSULT THE DOWNLOAD SITES FOR THE EXTERNAL DEPENDENCIES FOR THE MOST COMPLETE AND UP-TO-DATE LICENSING INFORMATION.

YOUR USE OF THE EXTERNAL DEPENDENCIES IS AT YOUR SOLE RISK. IN NO EVENT WILL AMAZON BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, OR PUNITIVE DAMAGES (INCLUDING FOR ANY LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, OR COMPUTER FAILURE OR MALFUNCTION) ARISING FROM OR RELATING TO THE EXTERNAL DEPENDENCIES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, EVEN IF AMAZON HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS AND DISCLAIMERS APPLY EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW.

General

AWS does not represent or warrant that this AWS Content is production ready. You are responsible for making your own independent assessment of the information, guidance, code and other AWS Content provided by AWS, which may include you performing your own independent testing, securing, and optimizing. You should take independent measures to ensure that you comply with your own specific quality control practices and standards, and to ensure that you comply with the local rules, laws, regulations, licenses and terms that apply to you and your content. If you are in a regulated industry, you should take extra care to ensure that your use of this AWS Content, in combination with your own content, complies with applicable regulations (for example, the Health Insurance Portability and Accountability Act of 1996). AWS does not make any representations, warranties or guarantees that this AWS Content will result in a particular outcome or result.

About

Cost-effective and scalable Oxford Nanopore Technologies (ONT) basecalling on AWS using Dorado, Nextflow, and AWS Batch with horizontal GPU scaling

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors