When I am working with Oxford Nanopore sequencing data, it is common to only need a subset of raw reads (raw signals in POD5 format): for example, reads that map to a region of interest, a shortlist of problematic reads you want to inspect, or reads you want to share with a collaborator to reproduce an issue. Because we work in our group with raw nanopore data (see https://squidbase.org/), we ran into the problem that extracting reads by read ID from large POD5 datasets can be slower than you would expect, especially when the run is split across many POD5 files.
pod5_fast_extract is a small command-line tool that speeds this up by adding a simple indexing step. It scans your POD5 directory once, builds a lightweight map from read_id → source file, and then uses that map to run pod5 filter only on the files that actually contain the reads you asked for. In practice, this avoids repeatedly scanning files that cannot possibly contain your target reads, which can substantially reduce I/O and wall time.
The core idea
A run directory is like a shelf of books, and each read is like a page. Without an index, extracting a set of pages means opening every book and flipping through it until you find the pages you need. With an index, you first write down which pages are in which book, and from then on you only open the relevant books.
That is exactly what this tool does: build the “table of contents” once, then reuse it.
I will be transparent: my first attempt was more ambitious. I had ChatGPT draft a more elaborate approach that relied on SQL to index and query target read IDs. In principle, that can be a clean solution. In practice, given my limited time for debugging generated code, and the fact that my coding skills are focused on what I use day-to-day as a bioinformatician, I reverted to tools I know well and can maintain confidently: Python, plain dictionaries, and a simple CLI. The result is not over-engineered, but it is robust and practical.
Typical use cases
-
Extracting raw signals for which basecalled reads passed a mapping or QC filter for targeted downstream analysis
-
Pulling a small subset for debugging basecalling or signal processing steps
-
Sharing a compact POD5 subset with a collaborator to reproduce an analysis
-
Repeated “extract and re-run” cycles during method development
-
Removing unwanted reads in POD5 format (e.g human reads) for sharing in public repositories (e.g SquiDBase).
Steps performed or supported
- Clone repo from GitHub (https://github.com/Cuypers-Wim/pod5_fast_extract) and Install nessecary packages using conda from an
environment.yml - Build an index (in json format). This maps readIDs to fileNames
- Create a text file containing one read ID per line (for example
ids.txt). The tool will then invoke the existingpod5 filtercommand only for the relevant files. Outputs are written to thesubsets/directory.
Final thoughts
Working with POD5 files is a common requirement in my own workflows and for others dealing with raw nanopore data, even if most nanopore users only need this functionality occasionally. I hope this tool can be helpful to the community, and that the idea behind it may eventually be incorporated into future releases of the POD5 tools provided by Oxford Nanopore.
Project link: https://github.com/Cuypers-Wim/pod5_fast_extract