Cost-effective and Scalable Oxford Nanopore Technologies Basecalling on AWS

This repository provides an example implementation for cost-effective and scalable Oxford Nanopore Technologies (ONT) basecalling with Dorado on AWS. The solution uses horizontal scaling of basecalling workloads to achieve superior price-performance compared to traditional vertical scaling while enabling flexible capacity adjustment to meet varying turnaround time requirements.

Vertical vs. horizontal scaling of GPU compute for ONT basecalling. A) Vertical scaling involves increasing the power of a single machine by adding more or more powerful CPU, GPU, RAM, or storage to handle heavier workloads. B) Horizontal scaling distributes the computational load across multiple machines. This hypothetical example compares processing 400 POD5 files on a single p5.48xlarge instance versus distributing them in batches of 20 files across 20 g6e.xlarge instances.

Architecture

The architecture combines Nextflow for input data batching and parallelization of basecalling task with AWS Batch for managed compute resource provisioning and job scheduling enabling automatic distribution of basecalling jobs across multiple Amazon EC2 GPU instances. The solution provides the option to use Amazon EC2 Spot Instances to optimize compute costs further. Nextflow's built-in retry and resume functionality provides resilience against spot interruption.

Key Features:

Horizontal scaling: Parallelizes basecalling tasks across multiple smaller GPU instances for superior price-performance
Containerized tooling: Dorado basecaller with basecalling models packaged in Docker container for consistent, reproducible execution
Spot Instance support: Optional use of Amazon EC2 Spot Instances for up to 90% additional cost savings
Amazon EC2 Instance Storage: Use of instance storage for data staging achieves reduced storage costs and enhanced I/O performance
S3 Intelligent-Tiering: Amazon S3 Intelligent-Tiering storage class used by default for automatic storage cost optimization
Fault tolerance: Nextflow's built-in retry and resume functionality provides resilience against Spot interruptions
Flexible throughput: Dynamic scaling enabling flexible capacity adjustment to meet varying turnaround time requirements
Monitoring: Amazon CloudWatch integration for job monitoring and logging

The solution leverages the following AWS services and tools:

Nextflow: Workflow management system for automating input data batching and task parallelization
AWS Batch: Managed batch compute service for dynamic resource provisioning and job scheduling
Amazon EC2: Flexible GPU-accelerated compute (G4, G5, G6, P4 and P5 instance types)
Amazon ECR: Container registry for storing Dorado basecaller Docker images
Amazon S3: Durable, highly available storage for input and output data
AWS CDK: Infrastructure-as-code framework for automated deployment

ONT basecalling with Nextflow on AWS Batch. Nextflow automates POD5 input file batching and parallelization of basecalling tasks. AWS Batch distributes basecalling jobs across multiple GPU instances. By utilizing Amazon EC2 Instance Storage, the solution achieves both reduced storage costs and enhanced I/O performance. Docker container images are stored in Amazon ECR. Input and output data are stored in Amazon S3.

Pipeline Overview

_ONT basecalling pipeline overview. The pipeline accepts POD5 or FAST5 files as input. FAST5 are converted to POD5 format (orange line). POD5 files are split into batches based on the number of GPU instances configured and each batch processed in parallel using the Dorado basecaller. The output is one or more BAM files per basecalling task.

Pre-configured Amazon EC2 GPU Instance Types

AWS Batch compute environments will be created for the following instance types if available in the chosen AWS Region:

Instance Type	GPU	vCPUs	Memory [GB]	GPU Memory [GB]
g4dn.xlarge	1x NVIDIA T4	4	16	16
g5.xlarge	1x NVIDIA A10G	4	16	24
g6e.xlarge	1x NVIDIA L40S	4	32	48
p4d.24xlarge	8x NVIDIA A100	96	1152	320
p5.48xlarge	8x NVIDIA H100	192	2048	640

The CDK deployment supports Amazon EC2 Capacity Blocks with AWS Batch compute environments for Amazon EC2 P5 instances, enabling advance reservation of GPU capacity for defined durations at scheduled future dates.

Test Data

The repository includes a Nextflow workflow to ingest and pre-process test data from the publicly available nanopore sequencing data set for Genome in a Bottle (GIAB) sample HG001 (GM12878):


Data Set Name	Oxford Nanopore Technologies - Genome in a Bottle Data Release 2025.01
Sample ID	HG001 (GM12878)
Flowcell ID	PAW79146
Data Set Source	Registry of Open Data on AWS
Data Set URI	`s3://ont-open-data/giab_2025.01/flowcells/HG001/PAW79146/pod5/`
File Format	POD5
File Count	73
Total Size	1,046 GB
Read Count	~6M
Gigabases	~99.35
Genome Coverage	~31x

POD5 files of flow cell PAW79146 are downloaded from the Registry of Open Data on AWS. The original 73 POD5 files ranging in size from 731.7MB to 31.6GB are resized into 743 uniformly sized 1.5GB files using the pod5 tool. The workflow runtime is approximately one hour and the compute costs are approximately $0.90 based on Amazon EC2 on-demand pricing in the US Est (Northern Virginia) Region. The cost for storing the test data set in S3 Standard/S3 Intelligent-Tiering Frequent Access tier are $24.06 per month based on Amazon S3 pricing in the US East (Northern Virginia) Region. For instructions on loading the test data, see the Load Test Data section.

Project Structure

ont-basecalling/
├── cdk/                            # AWS CDK infrastructure code
│   ├── ont_basecalling_on_aws/     # CDK constructs
│   │   ├── batch.py                # AWS Batch configuration
│   │   ├── docker.py               # ECR repositories and image copying
│   │   ├── network.py              # VPC, subnets, security groups
│   │   ├── storage.py              # S3 bucket configuration
│   │   └── ont_basecalling_on_aws_stack.py
│   ├── assets/                     # Lambda functions and assets
│   └── app.py                      # CDK application entry point
├── docker/                         # Docker image definitions
│   ├── dorado/                     # Dorado basecaller containers
│   └── pod5/                       # POD5 tools containers
├── nextflow/                       # Nextflow workflow
│   └── ontbasecalling/
│       ├── main.nf                 # Main workflow
│       ├── nextflow.config         # Workflow configuration
│       ├── nextflow_schema.json    # Pipeline parameter specification
│       ├── modules/local/          # Pipeline-specific modules
│       │   ├── dorado_basecalling/ # Dorado basecalling process definition
│       │   ├── fast5_to_pod5/      # FAST5 to PODD5 conversion process definition
│       │   ├── split_pod5/         # Split POD5 process definition
│       ├── workflows/              # Main pipeline workflows
│       │   ├── load_test_data.nf   # Test Data Processing Workflow
│       │   ├── ontbasecalling.nf   # Basecalling Workflow
│       └── conf/                   # Configuration profiles
│           ├── base.config         # Basic pipeline configurations
│           └── modules.config      # Module-specific configurations
└── images/                         # Documentation images

Prerequisites

Software Requirements

AWS CLI - configured with appropriate permissions
AWS CDK - for infrastructure deployment
Docker - for image building
Nextflow - for workflow execution
Python - for CDK development

AWS Credentials

Active AWS credentials are required before running any CDK commands (cdk synth, cdk deploy). The CDK app makes live AWS API calls at synth time to resolve the AWS account ID and to check GPU instance type availability in the target region. Without valid credentials, cdk synth will fail with botocore.exceptions.NoCredentialsError.

Verify your credentials are active before proceeding:

aws sts get-caller-identity

AWS Permissions

Your AWS credentials need permissions for:

Amazon EC2 (VPC, Security Groups, Launch Templates)
AWS Batch (Compute Environments, Job Queues)
Amazon ECR (Repository management, image push/pull)
Amazon S3 (Bucket operations, object access)
Amazon IAM (Role and policy management)
Amazon CloudWatch (Logging and monitoring)
AWS Lambda (Custom resource functions)

Quick Start

1. Set AWS Account ID & Region

export AWS_ACCOUNT_ID=<aws_account_id>
export AWS_REGION=<aws_region> [default: us-east-1]

2. Deploy Infrastructure & Build Docker containers

# clone the repository
git clone https://github.com/aws-samples/sample-ont-basecalling-on-aws.git

# change into the directory containing the CDK app
cd sample-ont-basecalling-on-aws/cdk

# create a virtualenv:
python3 -m venv .venv

# activate virtualenv.
source .venv/bin/activate

# update pip
pip install --upgrade pip

# install the required dependencies.
pip install -r requirements.txt

# Deploy the CDK stack with Dorado 1.3.0
cdk bootstrap  # First time only
cdk deploy

# Optional: Build specific Dorado version(s)
cdk deploy -c dorado_versions='["1.2.0"]'
cdk deploy -c dorado_versions='["1.2.0","1.3.0"]'  # Multiple versions

# Optional: Deploy with EC2 Capacity Block support for P5 instances
# Requires pre-purchased Capacity Block reservation ID and Availability Zone
cdk deploy -c capacity_reservation_id=<reservation-id> -c capacity_reservation_az=<availability-zone>

3. Load Test Data

cd nextflow/ontbasecalling

# load and pre-process Genome in a Bottle test data set
nextflow run main.nf -profile awsbatch --load_test_data=true

Note: The workflow uses --test_data_pub_dir_mode=move by default, which moves split POD5 files from the Nextflow work directory to the output directory instead of copying them. This avoids duplication of large volumes of data (~1TB) and reduces storage costs.

4. Run Basecalling Workflow

cd nextflow/ontbasecalling

nextflow run main.nf -profile awsbatch

Nextflow Workflow Configuration

To display the pipeline configuration documentation

cd nextflow/ontbasecalling

nextflow main.nf --help

AWS Configuration

Option	Type	Description	Default
`--aws_account_id`	integer	The AWS account the workflow will be executed in	-
`--aws_region`	string	The AWS Region the workflow will be executed in	`us-east-1`
`--use_spot`	boolean	Use Amazon EC2 spot instances	`false`

Software Options

Option	Type	Description	Default
`--container_registry`	string	The Docker container registry	`<aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com`
`--dorado_version`	string	The version of the Dorado basecaller	`1.3.0`
`--pod5_version`	string	The version of the Pod5 tool	`0.3.35`

Input/Output Options

Option	Type	Description	Default
`--input_dir`	string	The FAST5/POD5 input directory	`s3://ont-basecalling-<aws_region>-<aws_account_id>/test_data/`
`--input_is_fast5`	boolean	Input data in FAST5 format	`false`
`--output_dir`	string	The workflow output directory	`s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/output/<yyyy-mm-dd_hh-mm-ss>`
`--report_dir`	string	The workflow report directory	`s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/reports/<yyyy-mm-dd_hh-mm-ss>`
`--work_dir`	string	The workflow work directory	`s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/workdir`
`--publish_dir_mode`	string	The workflow publishing mode	`copy`

Basecalling Options

Option	Type	Description	Default
`--model`	string	The basecalling model	`dna_r10.4.1_e8.2_400bps_hac@v5.2.0`
`--gpu_instance_type`	string	The Amazon EC2 GPU instance type used for basecalling	`g5`
`--gpu_instance_count`	integer	The number of GPU instances used to parallelise basecalling; if set to -1 the number of GPU instances will be set automatically to match the throughput of a p5.48xlarge instance	`-1`
`--merge_output`	boolean	Create a single BAM file per basecalling task	`true`

Test Data Options

Option	Type	Description	Default
`--load_test_data`	boolean	Pre-process and load test data	`false`
`--test_data_input_dir`	string	The source location of the POD5 test data	`s3://ont-open-data/giab_2025.01/flowcells/HG001/PAW79146/pod5/`
`--test_data_output_dir`	string	The destination location for the POD5 test data	`s3://ont-basecalling-<aws_region_id>-<aws_account>/test_data/`
`--test_data_pub_dir_mode`	string	The publishing mode for the POD5 test data	`move`
`--split_pod5_size_gb`	number	The target file size for the test data POD5 files	`1.5`

Data Storage

All data is stored in the Amazon S3 bucket created during CDK deployment: s3://ont-basecalling-<aws_region>-<aws_account_id>. By default the S3 Intelligent-Tiering storage class is applied to all objects.

Data Type	S3 Location	Description
Test Data	`s3://ont-basecalling-<aws_region>-<aws_account_id>/test_data/`	Pre-processed POD5 test files (743 files, ~1TB total)
Nextflow Work Directory	`s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/workdir/`	Intermediate files and task execution data
Workflow Output	`s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/output/<yyyy-mm-dd_hh-mm-ss>/`	Basecalling results (BAM files)
Workflow Reports	`s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/reports/<yyyy-mm-dd_hh-mm-ss>/`	Execution reports, timeline, and resource usage

Usage Examples

Basic Usage

# Process POD5 files with default settings (horizontal scaling of basecalling of test data set across 11 x G6e.xlarge)
nextflow run main.nf -profile awsbatch

# Process POD5 files with high accuracy model
nextflow run main.nf -profile awsbatch --model=hac

# Use specific GPU instance type
nextflow run main.nf -profile awsbatch --gpu_instance_type=g4

Advanced Configuration

# Custom S3 paths, model selection and scaling options
nextflow run main.nf -profile awsbatch \
  --input_dir=s3://my-bucket/input/ \
  --output_dir=s3://my-bucket/output/ \
  --model=sup \
  --gpu_instance_type=g6
  --gpu_instance_count=30

# Use Amazon EC2 Spot Instances
nextflow run main.nf -profile awsbatch --use_spot=true

Throughput Configuration & Cost Estimation

Basecalling throughput and compute cost estimates based on analysing flow cell PAW79146 of the ONT Genome in a Bottle Data Release 2025.01 with Dorado v1.3.0 and basecalling model dna_r10.4.1_e8.2_400bps_hac@v5.2.0. Compute cost estimates are based on Amazon EC2 on-demand pricing in the US East (Northern Virginia) Region and include instance time consumed for data staging and unstaging.

Instance Type	Basecalling Throughput [samples/s]	Instances Required for P5 Throughput Parity	Pipeline Runtime [hh:mm:ss]	Compute Cost Per Flowcall [$]	Compute Cost Per Gbases [$]	Compute Cost Per 30x Human Genome [$]
g4dn.xlarge	6.28 x 10⁶	182	00:22:47	33.15	0.33	32.08
g5.xlarge	2.94 x 10⁷	39	00:21:47	13.27	0.13	12.84
g6e.xlarge	1.05 x 10⁸	11	00:28:47	9.50	0.10	9.20
p4d.24xlarge	6.94 x 10⁸	2	00:19:55	14.40	0.14	13.94
p5.48xlarge	1.14 x 10⁹	1	00:29:16	26.85	0.27	25.98

Cleanup

To avoid ongoing charges, remove all resources created by this solution when they are no longer needed.

1. Delete S3 Data (Optional)

If you loaded test data or ran basecalling workflows, you may want to verify the data in the S3 bucket before destroying the stack. The CDK stack is configured to automatically delete the S3 buckets and their contents on stack deletion, so no manual cleanup is required. However, if you want to inspect or selectively preserve data first:

# List bucket contents
aws s3 ls s3://ont-basecalling-${AWS_REGION}-${AWS_ACCOUNT_ID}/ --recursive --summarize

# Optional: Download results you want to keep before destroying the stack
aws s3 cp s3://ont-basecalling-${AWS_REGION}-${AWS_ACCOUNT_ID}/nextflow/output/ ./local-output/ --recursive

2. Clean Up Nextflow Work Directory (Optional)

If you want to remove intermediate Nextflow files before destroying the stack:

nextflow clean -f

3. Destroy the CDK Stack

cd cdk

# Activate virtualenv if not already active
source .venv/bin/activate

# Destroy all resources (S3 buckets and their contents are deleted automatically)
cdk destroy

This removes all AWS resources created by the stack, including:

VPC, subnets, and security groups
AWS Batch compute environments and job queues
Amazon ECR repositories and Docker images
Amazon S3 buckets and all stored objects
Lambda functions and IAM roles
CloudTrail trail and CloudWatch log groups
SSM parameters
Budget alerts (if configured)

4. Remove the CDK Bootstrap Stack (Optional)

If this was the only CDK application in the account/region and you no longer need CDK bootstrapping:

aws cloudformation delete-stack --stack-name CDKToolkit

Security Considerations

All compute instances run in private subnets
Amazon ECR repositories use image scanning
Amazon S3 buckets enforce SSL and block public access
Amazon IAM roles follow least-privilege principles
Amazon EC2 VPC Flow Logs enabled for network monitoring

Support and Contributing

Report issues via GitHub Issues
Follow AWS best practices for security and cost optimization
Test changes in development environments before production deployment

Disclaimers

Third Party Packages

This package depends on and may incorporate or retrieve a number of third-party software packages (such as open source packages) at install-time or build-time or run-time ("External Dependencies"). The External Dependencies are subject to license terms that you must accept in order to use this package. If you do not accept all of the applicable license terms, you should not use this package. We recommend that you consult your company's open source approval policy before proceeding.

Provided below is a list of External Dependencies and the applicable license identification as indicated by the documentation associated with the External Dependencies as of Amazon's most recent review.

THIS INFORMATION IS PROVIDED FOR CONVENIENCE ONLY. AMAZON DOES NOT PROMISE THAT THE LIST OR THE APPLICABLE TERMS AND CONDITIONS ARE COMPLETE, ACCURATE, OR UP-TO-DATE, AND AMAZON WILL HAVE NO LIABILITY FOR ANY INACCURACIES. YOU SHOULD CONSULT THE DOWNLOAD SITES FOR THE EXTERNAL DEPENDENCIES FOR THE MOST COMPLETE AND UP-TO-DATE LICENSING INFORMATION.

YOUR USE OF THE EXTERNAL DEPENDENCIES IS AT YOUR SOLE RISK. IN NO EVENT WILL AMAZON BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, OR PUNITIVE DAMAGES (INCLUDING FOR ANY LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, OR COMPUTER FAILURE OR MALFUNCTION) ARISING FROM OR RELATING TO THE EXTERNAL DEPENDENCIES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, EVEN IF AMAZON HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS AND DISCLAIMERS APPLY EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW.

NVIDIA deep learning container, repository: https://hub.docker.com/layers/nvidia/cuda/13.1.0-runtime-ubuntu24.04 license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
Dorado high-performance engine for Oxford Nanopore read analysis, repository: https://github.com/nanoporetech/dorado, license: https://github.com/nanoporetech/dorado/blob/master/LICENCE.txt - Oxford Nanopore Technologies PLC. Public License Version 1.0
Pod5 high performance file format for nanopore reads, repository: https://github.com/nanoporetech/pod5-file-format, license: https://github.com/nanoporetech/pod5-file-format/blob/master/LICENSE.md - Mozilla Public License Version 2.0

General

AWS does not represent or warrant that this AWS Content is production ready. You are responsible for making your own independent assessment of the information, guidance, code and other AWS Content provided by AWS, which may include you performing your own independent testing, securing, and optimizing. You should take independent measures to ensure that you comply with your own specific quality control practices and standards, and to ensure that you comply with the local rules, laws, regulations, licenses and terms that apply to you and your content. If you are in a regulated industry, you should take extra care to ensure that your use of this AWS Content, in combination with your own content, complies with applicable regulations (for example, the Health Insurance Portability and Accountability Act of 1996). AWS does not make any representations, warranties or guarantees that this AWS Content will result in a particular outcome or result.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cdk		cdk
docker		docker
images		images
nextflow/ontbasecalling		nextflow/ontbasecalling
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Cost-effective and Scalable Oxford Nanopore Technologies Basecalling on AWS

Architecture

Pipeline Overview

Pre-configured Amazon EC2 GPU Instance Types

Test Data

Project Structure

Prerequisites

Software Requirements

AWS Credentials

AWS Permissions

Quick Start

1. Set AWS Account ID & Region

2. Deploy Infrastructure & Build Docker containers

3. Load Test Data

4. Run Basecalling Workflow

Nextflow Workflow Configuration

AWS Configuration

Software Options

Input/Output Options

Basecalling Options

Test Data Options

Data Storage

Usage Examples

Basic Usage

Advanced Configuration

Throughput Configuration & Cost Estimation

Cleanup

1. Delete S3 Data (Optional)

2. Clean Up Nextflow Work Directory (Optional)

3. Destroy the CDK Stack

4. Remove the CDK Bootstrap Stack (Optional)

Security Considerations

Support and Contributing

Disclaimers

Third Party Packages

General

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages