Fast processing of fastq files with faster
Description
faster
is a new program I just released for working with fastq files. It is written in Rust and is comparably fast to seqtk
, but offers some useful functionalities:
- get a table with detailed statistics about a fastq file - number of reads, bases, min/max/quartiles lengths, N50, Q20%…, similar to the output of
seqkit stats -a
- get the read lengths for all reads
- get gc content per read
- get geometric mean of phred scores per read
- filter reads based on length
- subsample reads (by proportion of all reads in the file)
Download/run or compile
faster
has a single executable - compiled binaries are available on github for x86_64 Linux, macOS and Windows. You will have to make the file executable (chmod a+x faster
) and for MacOS, allow running external apps in your security settings. If this does not work for you, you will have to compile it yourself (which is pretty easy though). Below is an example on how to setup a Rust toolchain and compile faster
:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
git clone https://github.com/angelovangel/faster.git
cd faster
cargo build --release
# the binary is now under ./target/release/, run it like this:
./target/release/faster /path/to/fastq/file.fastq.gz
# or put the faster executable in your path to make it easier to run
Usage
faster
takes one fastq/fastq.gz file as an argument and, when used with the --table
flag, outputs a tab-separated table with statistics to stdout. There are options to obtain the length, GC-content, ‘mean’ phred scores per read, to filter reads by length, or to sample reads from the fastq file, see -help
for details. For many fastq files, the best way is to run it with gnu parallel
(see example below).
# for help
faster --help # or -h
# get a table with statistics
faster -t /path/to/fastq/file.fastq
# get a table for many files, with parallel
parallel faster -t ::: /path/to/fastq/*.fastq.gz
# again with parallel, but get rid of the table header
parallel faster -t ::: /path/to/fastq/*.fastq.gz | sed -n '/^file/!p'
# subsample a fraction of the reads
faster -s 0.1 /path/to/fastq/file.fastq.gz > subsample.fastq
Benchmarks
To get an idea how faster
compares to other tools, I have benchmarked it with two other popular programs and 3 different datasets. I am aware that these tools have different and often much richer functionality (especially seqkit, I use it all the time), so these comparisons are for orientation only.
The benchmarks were performed with hyperfine (-r 15 --warmup 2
) on a MacBook Pro with an 8-core 2.3 GHz Quad-Core Intel Core i5 and 8 GB RAM. For Illumina reads, faster
is slightly slower than seqstats
(written in C using the klib
library by Heng Li - the fastest thing possible out there), and for Nanopore it is even a bit faster than seqstats
. seqkit stats
performs worse of the three tools tested, but bear in mind the extraordinarily rich functionality it has.
dataset A - a small Nanopore fastq file with 37k reads and 350M bases
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
faster -s datasetA.fastq |
398.1 ± 21.2 | 380.4 | 469.6 | 1.00 |
seqstats datasetA.fastq |
633.6 ± 54.1 | 593.3 | 773.6 | 1.59 ± 0.16 |
seqkit stats -a datasetA.fastq |
1864.5 ± 70.3 | 1828.7 | 2117.3 | 4.68 ± 0.31 |
dataset B - a small Illumina fastq.gz file with ~100k reads
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
faster -s datasetB.fastq.gz |
181.7 ± 2.3 | 177.7 | 184.6 | 1.36 ± 0.09 |
seqstats datasetB.fastq.gz |
133.4 ± 8.4 | 125.7 | 154.2 | 1.00 |
seqkit stats -a datasetB.fastq.gz |
932.6 ± 37.0 | 873.8 | 1028.9 | 6.99 ± 0.52 |
dataset C - a small Illumina iSeq run, 11.5M reads and 1.7G bases, using gnu parallel
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
parallel faster -s ::: *.fastq.gz |
6.438 ± 0.384 | 6.009 | 7.062 | 1.43 ± 0.15 |
parallel seqstats ::: *.fastq.gz |
4.488 ± 0.394 | 4.120 | 5.312 | 1.00 |
parallel seqkit stats -a ::: *.fastq.gz |
40.156 ± 1.747 | 38.762 | 44.132 | 8.95 ± 0.88 |
Reference
faster
uses the excellent Rust-Bio library:
Köster, J. (2016). Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), 444-446.