Skip to content

Benchmark results

Head-to-head performance vs R DADA2 on representative datasets. Numbers come from the benchmark harness (see Tooling & metrics) run on our cluster — the datasets are large, so these are run manually and the tables below are regenerated from each run's summary.csv with bench_table.py.

Methodology

  • Function-vs-function. Each mode compares the dada2-rs subcommand against its direct R analog, one process each: dada-pooledpool=TRUE, multi-input dadapool=FALSE, dada-pseudopool="pseudo".
  • Wall is the fair end-to-end time (R as a single process). Speedup = R wall ÷ dada2-rs wall.
  • Peak RSS is the per-process high-water mark. The ratio is reported as dada2-rs ÷ R, so it reads directly as a fraction of the comparator's memory: <1× means dada2-rs uses less (e.g. 0.7× = 70% of R's RSS, a 30% reduction) and >1× means more (2.0× = twice R's). This fraction-of-the- comparator convention is the one commonly used in memory benchmarks to surface a reduction directly, and it handles both wins and increases in one column — at the cost of running in the opposite direction to Speedup (where higher is better). Bold marks the rows where dada2-rs wins the axis.
  • Build: release-native unless noted. See build target.
  • Correctness (ASV concordance) is validated separately — see concordance tooling.

Populate from a cluster run

We generate the results using comparison/benchmark/bench_pooled.py and summarize with:

python3 comparison/benchmark/bench_table.py \
    "pooled=bench_true/summary.csv" \
    "per-sample=bench_false/summary.csv" \
    "pseudo=bench_pseudo/summary.csv"
then paste the table below.

PacBio HiFi

  • dada2-rs v0.1.1-9a0c5da3 (v0.1.1 prerelease, release-native build)
  • Data is from Hergenrother 2024, 93 total samples, PacBio Sequel IIe.
  • Default settings using FASTQ input for dada in both tools
  • 24 threads, 96GB memory allocated on an exclusive node run

Overall workflow

PacBio is benchmarked from --kmer-size 5 (matched to R's fixed KMER_SIZE = 5 — isolates kernel + threading speed) to --kmer-size 7 (dada2-rs recommended default for PacBio — adds the k-mer screen-effectiveness gain). See the PacBio notes.

Overall workflow (remove-primers FASTQ -> removing chimeras)

Run Pooling mode k-mer sample jobs dada2-rs wall (s) R wall (s) Speedup (R÷rs) dada2-rs peak (MB) R peak (MB) Peak RSS (rs÷R)
PacBio_pooled_k5 full 5 NA 4336.8 11885.4 2.7× 29469 40062 0.7×
PacBio_pooled_k6 full 6 NA 1052.0 11885.4* 11.3x 34365 40062* 0.9x
PacBio_pooled_k7 full 7 NA 761.1 11885.4* 15.6x 53781 40062* 1.3x
PacBio_pseudo_k5 pseudo 5 6 2080.3 9299.1 4.5× 6173 3081 2.0×
PacBio_pseudo_k6 pseudo 6 6 496.5 9299.1* 18.7x 6877 3081* 2.2x
PacBio_pseudo_k7 pseudo 7 6 381.2 9299.1* 24.4x 10907 3081* 3.5x
PacBio_nopool_k5 none 5 6 859.0 7219.1 8.4× 5873 2783 2.1×
PacBio_nopool_k6 none 6 6 223.1 7219.1* 32.4x 6206 2783* 2.2x
PacBio_nopool_k7 none 7 6 180.3 7219.1* 40.0x 8640 2783* 3.1x
PacBio_pseudo_k5_j1 pseudo 5 1 2207.9 9299.1* 4.2x 2880 3081* 0.9x
PacBio_pseudo_k6_j1 pseudo 6 1 642.0 9299.1* 14.5x 2801 3081* 0.9x
PacBio_pseudo_k7_j1 pseudo 7 1 555.0 9299.1* 16.8x 3717 3081* 1.2x
PacBio_nopool_k5_j1 none 5 1 914.5 7219.1* 7.9x 2280 2783* 0.8x
PacBio_nopool_k6_j1 none 6 1 288.4 7219.1* 25.0x 2158 2783* 0.8x
PacBio_nopool_k7_j1 none 7 1 262.7 7219.1* 27.5x 2733 2783* 1.0x

* - values from k=5 run.

Preliminary observation and results

Recommendation: use --kmer-size 7 for PacBio runs, especially if your compute has a decent memory footprint. If you need to reduce memory and try to retain throughput, use --kmer-size 7 and --sample-jobs 1 for PacBio runs

k-mer size

  • R DADA2 currently has a fixed (hard-coded) k-mer size of 5 for screening. This appears to have the effect of reducing k-mer screening efficiency almost to zero (a no-op), with almost every unique sequence proceeding to NW alignment.
  • Increasing the k-mer length marginally has a significant effect, dramatically reducing walltime.
  • Preliminary followup (to be added) indicates this has essentially no impact on the ASVs derived post-chimera removal as well as their counts, and may actually increase sensitivity slightly with a few more ASVs recovered in the dada2-rs runs.
  • We are also evaluating the overall effect increasing k-mer sizes have on clustering with the different pooling results. Initial k=8 results confirms this increases memory usage dramatically (>100GB), so we don't anticipate this to have much practical use.

ASVs

  • Pre-chimera results differ slightly from R DADA2 (with some ASVs found unique to both runs). However as noted above these are essentially removed with chimera removal, suggesting their lower abundance

Memory vs walltime

  • dada2-rs currently uses more memory by default (last column) under default settings, particular with higher k-mer settings. This is at the dada-pooled/dada-pseudo/dada stage (see below)
  • dada-pseudo and dada (per-sample) memory overhead can be alleviated to run each sample serially by setting the number of concurrent jobs to one (--sample-jobs 1), with a small walltime penalty.
  • We're looking into potential paths to further reduce this footprint

Per step comparisons

PacBio pooled, k=7 — dada2-rs vs R (R-single end-to-end wall)

Step dada2-rs wall (s) dada2-rs cores dada2-rs peak (MB) R wall (s) Speedup
remove_primers 19.0 20.9 31 5410.2 284.3×
learn 25.2 20.6 2791 159.7 6.3×
dada 698.8 18.8 53781 6226.2 8.9×
make_table 0.3 1.0 111 0.1 0.2×
remove_bimera 17.8 23.6 103 19.8 1.1×
TOTAL 761.1 19.0 53781 11885.4 15.6×

PacBio pseudo, k=7 — dada2-rs vs R (R-single end-to-end wall)

Step dada2-rs wall (s) dada2-rs cores dada2-rs peak (MB) R wall (s) Speedup
remove_primers 19.0 20.8 29 5380.1 283.6×
learn 25.1 20.6 2809 155.2 6.2×
dada 329.5 21.7 10907 3688.6 11.2×
make_table 0.2 1.0 75 0.1 0.3×
remove_bimera 7.4 23.5 85 8.3 1.1×
TOTAL 381.2 21.6 10907 9299.1 24.4×

PacBio, no pooling, k=7 — dada2-rs vs R (R-single end-to-end wall)

Step dada2-rs wall (s) dada2-rs cores dada2-rs peak (MB) R wall (s) Speedup
remove_primers 19.1 20.8 29 5445.1 285.7×
learn 25.1 20.5 2717 159.9 6.4×
dada 131.9 21.0 8640 1542.7 11.7×
make_table 0.1 1.0 54 0.0 0.3×
remove_bimera 4.0 23.3 69 4.5 1.1×
TOTAL 180.3 21.0 8640 7219.1 40.0×

Note that most of the walltime improvement is actually dominated by the remove-primers step (which combines two steps from DADA2: removePrimers and filterAndTrim, and parallelizes both steps; R removePrimers processes data serially). In many cases alternative tools like cutadapt can help close this gap and may prove more flexible for others. However significant improvements are also apparent for both the learn-errors and dada steps for each pooling mode.

Streaming vs cached (--cache-samples)

dada-pseudo streams samples by default (re-reading each sample per round) rather than holding every sample's dereplicated state resident. --cache-samples opts into the older all-in-memory behavior. This characterizes that tradeoff on PacBio (k=5, node-exclusive single rep), comparing streaming (default) as baseline against cached and with R. --sample-jobs 1 is used to make R comparisons more apples-to-apples. Numbers are filtered per stack withcompare_bench.py --stack <stack>; only the dada step moves materially (other steps run on identical inputs).

Stack mode dada wall (s) dada peak Δ wall Δ peak
dada2-rs streaming (default) 2092.4 2.34 GB
dada2-rs cached 2033.5 8.18 GB −2.8% +249%
R-split streaming (default) 3737.8 2.86 GB
R-split cached 3454.6 14.86 GB −7.6% +420%
R-single streaming (default) 3735.1 3.02 GB†
R-single cached 3400.4 15.69 GB† −9.0% +420%

R-single runs the whole pipeline as one process, so peak RSS is only resolvable process-wide (not per step); it reflects the dada phase, which dominates.

Verdict: caching buys a small walltime gain on the dada step at a large peak-memory cost, and the pattern holds across all three stacks. Streaming is the right default; --cache-samples stays an opt-in for the walltime-bound, memory-rich case.

Two cross-stack notes: - streaming-vs-streaming: dada2-rs is leaner and faster on dada (2.34 GB / 2092 s vs 2.86 GB / 3738 s) - cached-vs-cached, dada2-rs is ~1.8× leaner (8.18 GB vs 14.86 GB) — R's resident-derep cache is not magically compact.

See issue #22 for the keep-but-default-off decision.

Pre/post-update A/B (memory + speed work)

A dada2-rs-vs-itself regression check (PacBio, node-exclusive, no cache flag).

pre is main before recent memory/speed improvements, post is current main. This spans the whole update window — a composite of the memory-representation change (#23, qual_sum:[u32] deferred division) and the learn-errors/dada speedups (issue #3 checklist: skip per-iteration setup, parallel build_trans_mat, SIMD kmer_dist8), not an isolated change. R is not a variable here, so no R numbers apply.

All rows are the same PacBio dataset, varying mode and k. learn wall falls ~−13 to −15% in every row below; the dada step is broken out:

mode k dada wall Δ dada CPU Δ dada peak (pre → post) peak Δ
pooled (--pool true) 5 −14.9% −15.5% 36.14 → 28.61 GB −20.8%
pooled (--pool true) 7 −14.4% −15.2% 36.04 → 28.69 GB −20.4%
pseudo (streaming default) 5 −14.9% −15.1% 6.00 → 6.30 GB +4.9%
pseudo (cached, --cache-samples) 7 −16.8% −17.4% 18.27 → 13.12 GB −28.2%

Verdict: the speed win is mode-independentdada wall and CPU fall together (~−15 to −17%) at flat cores (−0.3 to −1.0%) in every mode, i.e. genuinely less work, not better parallel packing. - The memory win is resident-bound: the modes that hold all samples resident shed a large fraction of the high-water mark (pooled −20%, ~7.5 GB; cached −28%, ~5.2 GB), while the streaming pseudo path keeps only a small working set, so there is no RSS reduction to capture (a small ~300 MB uptick, single rep — read as flat, not a regression). - The peak ordering tracks residency exactly — pooled 36 GB > cached 18 GB > streaming 6 GB. Pooled dada peak is ~36 GB at both k values: at that scale the resident all-samples data dominates RSS, the 4^k screen array is washed out, and the memory delta is representation-driven, not k-related.

Illumina MiSeq (F1000, 384 samples)

The majority of recent memory work focused on PacBio data. How does this affect Illumina data processing?

  • dada2-rs v0.1.1-9a0c5da3 (v0.1.1 prerelease, release-native build)
  • Data is the F1000 'MiSeqSOP' full data set: 384 samples, MiSeq v2 run.
  • Default settings using FASTQ input for dada in both tools
  • 24 threads, 96GB memory allocated on an exclusive node run

Overall workflow (filter and trim FASTQ -> removing chimeras)

Run dada2-rs wall (s) R wall (s) Speedup dada2-rs peak (MB) R peak (MB) Peak RSS (rs÷R)
MiSeqSOP_pooled_k5 417.8 1284.6 3.1× 3653 6105 0.6×
MiSeqSOP_pooled_k6 414.0 1284.6* 3.1x 3666 6105* 0.6x
MiSeqSOP_pooled_k7 416.1 1284.6* 3.1x 3660 6105* 0.6x
MiSeqSOP_pseudo_k5 165.6 1224.1 7.4× 1483 2001 0.7×
MiSeqSOP_pseudo_k6 165.5 1224.1* 7.4× 1517 2001* 0.8x
MiSeqSOP_pseudo_k7 164.8 1224.1* 7.4× 1424 2001* 0.7x
MiSeqSOP_nopool_k5 93.2 791.3 8.5× 1554 1673 0.9×
MiSeqSOP_nopool_k6 94.7 791.3 8.4x 1456 1673* 0.9×
MiSeqSOP_nopool_k7 94.2 791.3 8.4x 1414 1673* 0.9×

* - values from k=5 run

Tests prior to memory updates were from 2.4x (pooled, k=5) to 5.9x (no pooling, k=5). This round suggests a modest improvement: 3.1x to 8.5x, all at k=5.

Prior data also suggested a small improvement for pooled samples using a higher k-mer, but in this latest run it seems to disappear, replaced with an overall improvement in performance and memory.

Per step comparisons

MiSeqSOP pooled, k=5 — dada2-rs vs R (R-single end-to-end wall)

Step dada2-rs wall (s) dada2-rs cores dada2-rs peak (MB) R wall (s) Speedup
filter 11.4 20.3 21 30.8 2.7×
learn_fwd 18.9 21.5 642 83.0 4.4×
learn_rev 23.0 20.7 770 107.8 4.7×
dada_fwd 196.1 12.9 3653 436.1 2.2×
dada_rev 157.8 9.9 3009 355.1 2.3×
merge 7.2 15.6 1652 245.8 34.1×
make_table 0.6 1.0 337 0.3 0.5×
remove_bimera 2.8 23.1 47 2.6 0.9×
TOTAL 417.8 12.9 3653 1284.6 3.1×

MiSeqSOP, pseudo, k5 — dada2-rs vs R (R-single end-to-end wall)

Step dada2-rs wall (s) dada2-rs cores dada2-rs peak (MB) R wall (s) Speedup
filter 11.3 20.4 21 31.7 2.8×
learn_fwd 18.7 21.6 648 83.2 4.4×
learn_rev 22.4 21.1 772 109.4 4.9×
dada_fwd 65.4 20.8 686 433.9 6.6×
dada_rev 43.1 19.9 568 332.9 7.7×
merge 4.1 17.1 1483 208.9 51.5×
make_table 0.2 1.0 144 0.1 0.6×
remove_bimera 0.3 20.6 21 0.2 0.8×
TOTAL 165.6 20.5 1483 1224.1 7.4×

MiSeqSOP, no pooling, k5 — dada2-rs vs R (R-single end-to-end wall)

Step dada2-rs wall (s) dada2-rs cores dada2-rs peak (MB) R wall (s) Speedup
filter 11.2 20.6 21 31.1 2.8×
learn_fwd 19.0 21.5 635 80.8 4.3×
learn_rev 22.4 21.0 769 107.6 4.8×
dada_fwd 22.2 19.9 573 196.4 8.8×
dada_rev 14.8 19.0 467 155.5 10.5×
merge 3.4 15.5 1554 198.5 59.1×
make_table 0.1 1.0 62 0.1 1.0×
remove_bimera 0.1 17.8 21 0.1 0.8×
TOTAL 93.2 20.2 1554 791.3 8.5×

Notably the big improvement is with merge-pairs, largely due to threading, though overall steps are generally faster. One interesting point: the full pooling run suggests that the two dada steps are not fully utilizing all cores, which may be a further point to assess.

Regenerating these tables

After a run, distill its summary.csv to Markdown:

# scorecard across modes (one row per run)
python3 comparison/benchmark/bench_table.py \
    "pooled=bench_true/summary.csv" \
    "per-sample=bench_false/summary.csv" \
    "pseudo=bench_pseudo/summary.csv"

# per-step breakdown for one run
python3 comparison/benchmark/bench_table.py --per-step bench_pseudo/summary.csv

Paste the output into the tables above.