Fast processing of fastq files with `faster`

Description

faster is a new program I just released for working with fastq files. It is written in Rust and is comparably fast to seqtk, but offers some useful functionalities:

get a table with detailed statistics about a fastq file - number of reads, bases, min/max/quartiles lengths, N50, Q20%…, similar to the output of seqkit stats -a
get the read lengths for all reads
get gc content per read
get geometric mean of phred scores per read
filter reads based on length
subsample reads (by proportion of all reads in the file)

Download/run or compile

faster has a single executable - compiled binaries are available on github for x86_64 Linux, macOS and Windows. You will have to make the file executable (chmod a+x faster) and for MacOS, allow running external apps in your security settings. If this does not work for you, you will have to compile it yourself (which is pretty easy though). Below is an example on how to setup a Rust toolchain and compile faster:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

git clone https://github.com/angelovangel/faster.git

cd faster
cargo build --release

# the binary is now under ./target/release/, run it like this:
./target/release/faster /path/to/fastq/file.fastq.gz

# or put the faster executable in your path to make it easier to run 

Usage

faster takes one fastq/fastq.gz file as an argument and, when used with the --table flag, outputs a tab-separated table with statistics to stdout. There are options to obtain the length, GC-content, ‘mean’ phred scores per read, to filter reads by length, or to sample reads from the fastq file, see -help for details. For many fastq files, the best way is to run it with gnu parallel (see example below).

# for help
faster --help # or -h

# get a table with statistics
faster -t /path/to/fastq/file.fastq

# get a table for many files, with parallel
parallel faster -t ::: /path/to/fastq/*.fastq.gz

# again with parallel, but get rid of the table header
parallel faster -t ::: /path/to/fastq/*.fastq.gz | sed -n '/^file/!p'

# subsample a fraction of the reads
faster -s 0.1 /path/to/fastq/file.fastq.gz > subsample.fastq

Benchmarks

To get an idea how faster compares to other tools, I have benchmarked it with two other popular programs and 3 different datasets. I am aware that these tools have different and often much richer functionality (especially seqkit, I use it all the time), so these comparisons are for orientation only. The benchmarks were performed with hyperfine (-r 15 --warmup 2) on a MacBook Pro with an 8-core 2.3 GHz Quad-Core Intel Core i5 and 8 GB RAM. For Illumina reads, faster is slightly slower than seqstats (written in C using the klib library by Heng Li - the fastest thing possible out there), and for Nanopore it is even a bit faster than seqstats. seqkit stats performs worse of the three tools tested, but bear in mind the extraordinarily rich functionality it has.

dataset A - a small Nanopore fastq file with 37k reads and 350M bases

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`faster -s datasetA.fastq`	398.1 ± 21.2	380.4	469.6	1.00
`seqstats datasetA.fastq`	633.6 ± 54.1	593.3	773.6	1.59 ± 0.16
`seqkit stats -a datasetA.fastq`	1864.5 ± 70.3	1828.7	2117.3	4.68 ± 0.31

dataset B - a small Illumina fastq.gz file with ~100k reads

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`faster -s datasetB.fastq.gz`	181.7 ± 2.3	177.7	184.6	1.36 ± 0.09
`seqstats datasetB.fastq.gz`	133.4 ± 8.4	125.7	154.2	1.00
`seqkit stats -a datasetB.fastq.gz`	932.6 ± 37.0	873.8	1028.9	6.99 ± 0.52

dataset C - a small Illumina iSeq run, 11.5M reads and 1.7G bases, using `gnu parallel`

Command	Mean [s]	Min [s]	Max [s]	Relative
`parallel faster -s ::: *.fastq.gz`	6.438 ± 0.384	6.009	7.062	1.43 ± 0.15
`parallel seqstats ::: *.fastq.gz`	4.488 ± 0.394	4.120	5.312	1.00
`parallel seqkit stats -a ::: *.fastq.gz`	40.156 ± 1.747	38.762	44.132	8.95 ± 0.88

Reference

faster uses the excellent Rust-Bio library:

Köster, J. (2016). Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), 444-446.

Fast processing of fastq files with faster