This repository provides an example implementation for cost-effective and scalable Oxford Nanopore Technologies (ONT) basecalling with Dorado on AWS. The solution uses horizontal scaling of basecalling workloads to achieve superior price-performance compared to traditional vertical scaling while enabling flexible capacity adjustment to meet varying turnaround time requirements.
Vertical vs. horizontal scaling of GPU compute for ONT basecalling. A) Vertical scaling involves increasing the power of a single machine by adding more or more powerful CPU, GPU, RAM, or storage to handle heavier workloads. B) Horizontal scaling distributes the computational load across multiple machines. This hypothetical example compares processing 400 POD5 files on a single p5.48xlarge instance versus distributing them in batches of 20 files across 20 g6e.xlarge instances.
The architecture combines Nextflow for input data batching and parallelization of basecalling task with AWS Batch for managed compute resource provisioning and job scheduling enabling automatic distribution of basecalling jobs across multiple Amazon EC2 GPU instances. The solution provides the option to use Amazon EC2 Spot Instances to optimize compute costs further. Nextflow's built-in retry and resume functionality provides resilience against spot interruption.
Key Features:
- Horizontal scaling: Parallelizes basecalling tasks across multiple smaller GPU instances for superior price-performance
- Containerized tooling: Dorado basecaller with basecalling models packaged in Docker container for consistent, reproducible execution
- Spot Instance support: Optional use of Amazon EC2 Spot Instances for up to 90% additional cost savings
- Amazon EC2 Instance Storage: Use of instance storage for data staging achieves reduced storage costs and enhanced I/O performance
- S3 Intelligent-Tiering: Amazon S3 Intelligent-Tiering storage class used by default for automatic storage cost optimization
- Fault tolerance: Nextflow's built-in retry and resume functionality provides resilience against Spot interruptions
- Flexible throughput: Dynamic scaling enabling flexible capacity adjustment to meet varying turnaround time requirements
- Monitoring: Amazon CloudWatch integration for job monitoring and logging
The solution leverages the following AWS services and tools:
- Nextflow: Workflow management system for automating input data batching and task parallelization
- AWS Batch: Managed batch compute service for dynamic resource provisioning and job scheduling
- Amazon EC2: Flexible GPU-accelerated compute (G4, G5, G6, P4 and P5 instance types)
- Amazon ECR: Container registry for storing Dorado basecaller Docker images
- Amazon S3: Durable, highly available storage for input and output data
- AWS CDK: Infrastructure-as-code framework for automated deployment
ONT basecalling with Nextflow on AWS Batch. Nextflow automates POD5 input file batching and parallelization of basecalling tasks. AWS Batch distributes basecalling jobs across multiple GPU instances. By utilizing Amazon EC2 Instance Storage, the solution achieves both reduced storage costs and enhanced I/O performance. Docker container images are stored in Amazon ECR. Input and output data are stored in Amazon S3.
_ONT basecalling pipeline overview. The pipeline accepts POD5 or FAST5 files as input. FAST5 are converted to POD5 format (orange line). POD5 files are split into batches based on the number of GPU instances configured and each batch processed in parallel using the Dorado basecaller. The output is one or more BAM files per basecalling task.
AWS Batch compute environments will be created for the following instance types if available in the chosen AWS Region:
| Instance Type | GPU | vCPUs | Memory [GB] | GPU Memory [GB] |
|---|---|---|---|---|
| g4dn.xlarge | 1x NVIDIA T4 | 4 | 16 | 16 |
| g5.xlarge | 1x NVIDIA A10G | 4 | 16 | 24 |
| g6e.xlarge | 1x NVIDIA L40S | 4 | 32 | 48 |
| p4d.24xlarge | 8x NVIDIA A100 | 96 | 1152 | 320 |
| p5.48xlarge | 8x NVIDIA H100 | 192 | 2048 | 640 |
The CDK deployment supports Amazon EC2 Capacity Blocks with AWS Batch compute environments for Amazon EC2 P5 instances, enabling advance reservation of GPU capacity for defined durations at scheduled future dates.
The repository includes a Nextflow workflow to ingest and pre-process test data from the publicly available nanopore sequencing data set for Genome in a Bottle (GIAB) sample HG001 (GM12878):
| Data Set Name | Oxford Nanopore Technologies - Genome in a Bottle Data Release 2025.01 |
| Sample ID | HG001 (GM12878) |
| Flowcell ID | PAW79146 |
| Data Set Source | Registry of Open Data on AWS |
| Data Set URI | s3://ont-open-data/giab_2025.01/flowcells/HG001/PAW79146/pod5/ |
| File Format | POD5 |
| File Count | 73 |
| Total Size | 1,046 GB |
| Read Count | ~6M |
| Gigabases | ~99.35 |
| Genome Coverage | ~31x |
POD5 files of flow cell PAW79146 are downloaded from the Registry of Open Data on AWS. The original 73 POD5 files ranging in size from 731.7MB to 31.6GB are resized into 743 uniformly sized 1.5GB files using the pod5 tool. The workflow runtime is approximately one hour and the compute costs are approximately $0.90 based on Amazon EC2 on-demand pricing in the US Est (Northern Virginia) Region. The cost for storing the test data set in S3 Standard/S3 Intelligent-Tiering Frequent Access tier are $24.06 per month based on Amazon S3 pricing in the US East (Northern Virginia) Region. For instructions on loading the test data, see the Load Test Data section.
ont-basecalling/
├── cdk/ # AWS CDK infrastructure code
│ ├── ont_basecalling_on_aws/ # CDK constructs
│ │ ├── batch.py # AWS Batch configuration
│ │ ├── docker.py # ECR repositories and image copying
│ │ ├── network.py # VPC, subnets, security groups
│ │ ├── storage.py # S3 bucket configuration
│ │ └── ont_basecalling_on_aws_stack.py
│ ├── assets/ # Lambda functions and assets
│ └── app.py # CDK application entry point
├── docker/ # Docker image definitions
│ ├── dorado/ # Dorado basecaller containers
│ └── pod5/ # POD5 tools containers
├── nextflow/ # Nextflow workflow
│ └── ontbasecalling/
│ ├── main.nf # Main workflow
│ ├── nextflow.config # Workflow configuration
│ ├── nextflow_schema.json # Pipeline parameter specification
│ ├── modules/local/ # Pipeline-specific modules
│ │ ├── dorado_basecalling/ # Dorado basecalling process definition
│ │ ├── fast5_to_pod5/ # FAST5 to PODD5 conversion process definition
│ │ ├── split_pod5/ # Split POD5 process definition
│ ├── workflows/ # Main pipeline workflows
│ │ ├── load_test_data.nf # Test Data Processing Workflow
│ │ ├── ontbasecalling.nf # Basecalling Workflow
│ └── conf/ # Configuration profiles
│ ├── base.config # Basic pipeline configurations
│ └── modules.config # Module-specific configurations
└── images/ # Documentation images
- AWS CLI - configured with appropriate permissions
- AWS CDK - for infrastructure deployment
- Docker - for image building
- Nextflow - for workflow execution
- Python - for CDK development
Active AWS credentials are required before running any CDK commands (cdk synth, cdk deploy).
The CDK app makes live AWS API calls at synth time to resolve the AWS account ID and to check
GPU instance type availability in the target region. Without valid credentials, cdk synth will
fail with botocore.exceptions.NoCredentialsError.
Verify your credentials are active before proceeding:
aws sts get-caller-identityYour AWS credentials need permissions for:
- Amazon EC2 (VPC, Security Groups, Launch Templates)
- AWS Batch (Compute Environments, Job Queues)
- Amazon ECR (Repository management, image push/pull)
- Amazon S3 (Bucket operations, object access)
- Amazon IAM (Role and policy management)
- Amazon CloudWatch (Logging and monitoring)
- AWS Lambda (Custom resource functions)
export AWS_ACCOUNT_ID=<aws_account_id>
export AWS_REGION=<aws_region> [default: us-east-1]# clone the repository
git clone https://github.com/aws-samples/sample-ont-basecalling-on-aws.git
# change into the directory containing the CDK app
cd sample-ont-basecalling-on-aws/cdk
# create a virtualenv:
python3 -m venv .venv
# activate virtualenv.
source .venv/bin/activate
# update pip
pip install --upgrade pip
# install the required dependencies.
pip install -r requirements.txt
# Deploy the CDK stack with Dorado 1.3.0
cdk bootstrap # First time only
cdk deploy
# Optional: Build specific Dorado version(s)
cdk deploy -c dorado_versions='["1.2.0"]'
cdk deploy -c dorado_versions='["1.2.0","1.3.0"]' # Multiple versions
# Optional: Deploy with EC2 Capacity Block support for P5 instances
# Requires pre-purchased Capacity Block reservation ID and Availability Zone
cdk deploy -c capacity_reservation_id=<reservation-id> -c capacity_reservation_az=<availability-zone>cd nextflow/ontbasecalling
# load and pre-process Genome in a Bottle test data set
nextflow run main.nf -profile awsbatch --load_test_data=trueNote: The workflow uses --test_data_pub_dir_mode=move by default, which moves split POD5 files from the Nextflow
work directory to the output directory instead of copying them. This avoids duplication of large volumes of data (~1TB)
and reduces storage costs.
cd nextflow/ontbasecalling
nextflow run main.nf -profile awsbatchTo display the pipeline configuration documentation
cd nextflow/ontbasecalling
nextflow main.nf --help| Option | Type | Description | Default |
|---|---|---|---|
--aws_account_id |
integer | The AWS account the workflow will be executed in | - |
--aws_region |
string | The AWS Region the workflow will be executed in | us-east-1 |
--use_spot |
boolean | Use Amazon EC2 spot instances | false |
| Option | Type | Description | Default |
|---|---|---|---|
--container_registry |
string | The Docker container registry | <aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com |
--dorado_version |
string | The version of the Dorado basecaller | 1.3.0 |
--pod5_version |
string | The version of the Pod5 tool | 0.3.35 |
| Option | Type | Description | Default |
|---|---|---|---|
--input_dir |
string | The FAST5/POD5 input directory | s3://ont-basecalling-<aws_region>-<aws_account_id>/test_data/ |
--input_is_fast5 |
boolean | Input data in FAST5 format | false |
--output_dir |
string | The workflow output directory | s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/output/<yyyy-mm-dd_hh-mm-ss> |
--report_dir |
string | The workflow report directory | s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/reports/<yyyy-mm-dd_hh-mm-ss> |
--work_dir |
string | The workflow work directory | s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/workdir |
--publish_dir_mode |
string | The workflow publishing mode | copy |
| Option | Type | Description | Default |
|---|---|---|---|
--model |
string | The basecalling model | dna_r10.4.1_e8.2_400bps_hac@v5.2.0 |
--gpu_instance_type |
string | The Amazon EC2 GPU instance type used for basecalling | g5 |
--gpu_instance_count |
integer | The number of GPU instances used to parallelise basecalling; if set to -1 the number of GPU instances will be set automatically to match the throughput of a p5.48xlarge instance | -1 |
--merge_output |
boolean | Create a single BAM file per basecalling task | true |
| Option | Type | Description | Default |
|---|---|---|---|
--load_test_data |
boolean | Pre-process and load test data | false |
--test_data_input_dir |
string | The source location of the POD5 test data | s3://ont-open-data/giab_2025.01/flowcells/HG001/PAW79146/pod5/ |
--test_data_output_dir |
string | The destination location for the POD5 test data | s3://ont-basecalling-<aws_region_id>-<aws_account>/test_data/ |
--test_data_pub_dir_mode |
string | The publishing mode for the POD5 test data | move |
--split_pod5_size_gb |
number | The target file size for the test data POD5 files | 1.5 |
All data is stored in the Amazon S3 bucket created during CDK deployment:
s3://ont-basecalling-<aws_region>-<aws_account_id>.
By default the S3 Intelligent-Tiering
storage class is applied to all objects.
| Data Type | S3 Location | Description |
|---|---|---|
| Test Data | s3://ont-basecalling-<aws_region>-<aws_account_id>/test_data/ |
Pre-processed POD5 test files (743 files, ~1TB total) |
| Nextflow Work Directory | s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/workdir/ |
Intermediate files and task execution data |
| Workflow Output | s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/output/<yyyy-mm-dd_hh-mm-ss>/ |
Basecalling results (BAM files) |
| Workflow Reports | s3://ont-basecalling-<aws_region>-<aws_account_id>/nextflow/reports/<yyyy-mm-dd_hh-mm-ss>/ |
Execution reports, timeline, and resource usage |
# Process POD5 files with default settings (horizontal scaling of basecalling of test data set across 11 x G6e.xlarge)
nextflow run main.nf -profile awsbatch
# Process POD5 files with high accuracy model
nextflow run main.nf -profile awsbatch --model=hac
# Use specific GPU instance type
nextflow run main.nf -profile awsbatch --gpu_instance_type=g4# Custom S3 paths, model selection and scaling options
nextflow run main.nf -profile awsbatch \
--input_dir=s3://my-bucket/input/ \
--output_dir=s3://my-bucket/output/ \
--model=sup \
--gpu_instance_type=g6
--gpu_instance_count=30# Use Amazon EC2 Spot Instances
nextflow run main.nf -profile awsbatch --use_spot=trueBasecalling throughput and compute cost estimates based on analysing flow cell PAW79146 of the ONT Genome
in a Bottle Data Release 2025.01 with
Dorado v1.3.0 and basecalling model
dna_r10.4.1_e8.2_400bps_hac@v5.2.0. Compute cost estimates are based on
Amazon EC2 on-demand pricing in the US East (Northern Virginia) Region and
include instance time consumed for data staging and unstaging.
| Instance Type | Basecalling Throughput [samples/s] | Instances Required for P5 Throughput Parity | Pipeline Runtime [hh:mm:ss] | Compute Cost Per Flowcall [$] | Compute Cost Per Gbases [$] | Compute Cost Per 30x Human Genome [$] |
|---|---|---|---|---|---|---|
| g4dn.xlarge | 6.28 x 106 | 182 | 00:22:47 | 33.15 | 0.33 | 32.08 |
| g5.xlarge | 2.94 x 107 | 39 | 00:21:47 | 13.27 | 0.13 | 12.84 |
| g6e.xlarge | 1.05 x 108 | 11 | 00:28:47 | 9.50 | 0.10 | 9.20 |
| p4d.24xlarge | 6.94 x 108 | 2 | 00:19:55 | 14.40 | 0.14 | 13.94 |
| p5.48xlarge | 1.14 x 109 | 1 | 00:29:16 | 26.85 | 0.27 | 25.98 |
To avoid ongoing charges, remove all resources created by this solution when they are no longer needed.
If you loaded test data or ran basecalling workflows, you may want to verify the data in the S3 bucket before destroying the stack. The CDK stack is configured to automatically delete the S3 buckets and their contents on stack deletion, so no manual cleanup is required. However, if you want to inspect or selectively preserve data first:
# List bucket contents
aws s3 ls s3://ont-basecalling-${AWS_REGION}-${AWS_ACCOUNT_ID}/ --recursive --summarize
# Optional: Download results you want to keep before destroying the stack
aws s3 cp s3://ont-basecalling-${AWS_REGION}-${AWS_ACCOUNT_ID}/nextflow/output/ ./local-output/ --recursiveIf you want to remove intermediate Nextflow files before destroying the stack:
nextflow clean -fcd cdk
# Activate virtualenv if not already active
source .venv/bin/activate
# Destroy all resources (S3 buckets and their contents are deleted automatically)
cdk destroyThis removes all AWS resources created by the stack, including:
- VPC, subnets, and security groups
- AWS Batch compute environments and job queues
- Amazon ECR repositories and Docker images
- Amazon S3 buckets and all stored objects
- Lambda functions and IAM roles
- CloudTrail trail and CloudWatch log groups
- SSM parameters
- Budget alerts (if configured)
If this was the only CDK application in the account/region and you no longer need CDK bootstrapping:
aws cloudformation delete-stack --stack-name CDKToolkit- All compute instances run in private subnets
- Amazon ECR repositories use image scanning
- Amazon S3 buckets enforce SSL and block public access
- Amazon IAM roles follow least-privilege principles
- Amazon EC2 VPC Flow Logs enabled for network monitoring
- Report issues via GitHub Issues
- Follow AWS best practices for security and cost optimization
- Test changes in development environments before production deployment
This package depends on and may incorporate or retrieve a number of third-party software packages (such as open source packages) at install-time or build-time or run-time ("External Dependencies"). The External Dependencies are subject to license terms that you must accept in order to use this package. If you do not accept all of the applicable license terms, you should not use this package. We recommend that you consult your company's open source approval policy before proceeding.
Provided below is a list of External Dependencies and the applicable license identification as indicated by the documentation associated with the External Dependencies as of Amazon's most recent review.
THIS INFORMATION IS PROVIDED FOR CONVENIENCE ONLY. AMAZON DOES NOT PROMISE THAT THE LIST OR THE APPLICABLE TERMS AND CONDITIONS ARE COMPLETE, ACCURATE, OR UP-TO-DATE, AND AMAZON WILL HAVE NO LIABILITY FOR ANY INACCURACIES. YOU SHOULD CONSULT THE DOWNLOAD SITES FOR THE EXTERNAL DEPENDENCIES FOR THE MOST COMPLETE AND UP-TO-DATE LICENSING INFORMATION.
YOUR USE OF THE EXTERNAL DEPENDENCIES IS AT YOUR SOLE RISK. IN NO EVENT WILL AMAZON BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, OR PUNITIVE DAMAGES (INCLUDING FOR ANY LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, OR COMPUTER FAILURE OR MALFUNCTION) ARISING FROM OR RELATING TO THE EXTERNAL DEPENDENCIES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, EVEN IF AMAZON HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS AND DISCLAIMERS APPLY EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW.
- NVIDIA deep learning container, repository: https://hub.docker.com/layers/nvidia/cuda/13.1.0-runtime-ubuntu24.04 license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
- Dorado high-performance engine for Oxford Nanopore read analysis, repository: https://github.com/nanoporetech/dorado, license: https://github.com/nanoporetech/dorado/blob/master/LICENCE.txt - Oxford Nanopore Technologies PLC. Public License Version 1.0
- Pod5 high performance file format for nanopore reads, repository: https://github.com/nanoporetech/pod5-file-format, license: https://github.com/nanoporetech/pod5-file-format/blob/master/LICENSE.md - Mozilla Public License Version 2.0
AWS does not represent or warrant that this AWS Content is production ready. You are responsible for making your own independent assessment of the information, guidance, code and other AWS Content provided by AWS, which may include you performing your own independent testing, securing, and optimizing. You should take independent measures to ensure that you comply with your own specific quality control practices and standards, and to ensure that you comply with the local rules, laws, regulations, licenses and terms that apply to you and your content. If you are in a regulated industry, you should take extra care to ensure that your use of this AWS Content, in combination with your own content, complies with applicable regulations (for example, the Health Insurance Portability and Accountability Act of 1996). AWS does not make any representations, warranties or guarantees that this AWS Content will result in a particular outcome or result.

