Benchmark results¶
Head-to-head performance vs R DADA2 on representative datasets. Numbers come from
the benchmark harness (see Tooling & metrics) run on our
cluster — the datasets are large, so these are run manually and the tables below
are regenerated from each run's summary.csv with
bench_table.py.
Methodology¶
- Function-vs-function. Each mode compares the dada2-rs subcommand against
its direct R analog, one process each:
dada-pooled↔pool=TRUE, multi-inputdada↔pool=FALSE,dada-pseudo↔pool="pseudo". - Wall is the fair end-to-end time (R as a single process). Speedup = R wall ÷ dada2-rs wall.
- Peak RSS is the per-process high-water mark. The ratio is reported as dada2-rs ÷ R, so it reads directly as a fraction of the comparator's memory: <1× means dada2-rs uses less (e.g. 0.7× = 70% of R's RSS, a 30% reduction) and >1× means more (2.0× = twice R's). This fraction-of-the- comparator convention is the one commonly used in memory benchmarks to surface a reduction directly, and it handles both wins and increases in one column — at the cost of running in the opposite direction to Speedup (where higher is better). Bold marks the rows where dada2-rs wins the axis.
- Build:
release-nativeunless noted. See build target. - Correctness (ASV concordance) is validated separately — see concordance tooling.
Populate from a cluster run
We generate the results using comparison/benchmark/bench_pooled.py
and summarize with:
PacBio HiFi¶
dada2-rs v0.1.1-9a0c5da3(v0.1.1 prerelease,release-nativebuild)- Data is from Hergenrother 2024, 93 total samples, PacBio Sequel IIe.
- Default settings using FASTQ input for
dadain both tools - 24 threads, 96GB memory allocated on an exclusive node run
Overall workflow¶
PacBio is benchmarked from --kmer-size 5 (matched to R's fixed
KMER_SIZE = 5 — isolates kernel + threading speed) to --kmer-size 7
(dada2-rs recommended default for PacBio — adds the k-mer screen-effectiveness gain). See the PacBio notes.
Overall workflow (remove-primers FASTQ -> removing chimeras)
| Run | Pooling mode | k-mer | sample jobs | dada2-rs wall (s) | R wall (s) | Speedup (R÷rs) | dada2-rs peak (MB) | R peak (MB) | Peak RSS (rs÷R) |
|---|---|---|---|---|---|---|---|---|---|
| PacBio_pooled_k5 | full | 5 | NA | 4336.8 | 11885.4 | 2.7× | 29469 | 40062 | 0.7× |
| PacBio_pooled_k6 | full | 6 | NA | 1052.0 | 11885.4* | 11.3x | 34365 | 40062* | 0.9x |
| PacBio_pooled_k7 | full | 7 | NA | 761.1 | 11885.4* | 15.6x | 53781 | 40062* | 1.3x |
| PacBio_pseudo_k5 | pseudo | 5 | 6 | 2080.3 | 9299.1 | 4.5× | 6173 | 3081 | 2.0× |
| PacBio_pseudo_k6 | pseudo | 6 | 6 | 496.5 | 9299.1* | 18.7x | 6877 | 3081* | 2.2x |
| PacBio_pseudo_k7 | pseudo | 7 | 6 | 381.2 | 9299.1* | 24.4x | 10907 | 3081* | 3.5x |
| PacBio_nopool_k5 | none | 5 | 6 | 859.0 | 7219.1 | 8.4× | 5873 | 2783 | 2.1× |
| PacBio_nopool_k6 | none | 6 | 6 | 223.1 | 7219.1* | 32.4x | 6206 | 2783* | 2.2x |
| PacBio_nopool_k7 | none | 7 | 6 | 180.3 | 7219.1* | 40.0x | 8640 | 2783* | 3.1x |
| PacBio_pseudo_k5_j1 | pseudo | 5 | 1 | 2207.9 | 9299.1* | 4.2x | 2880 | 3081* | 0.9x |
| PacBio_pseudo_k6_j1 | pseudo | 6 | 1 | 642.0 | 9299.1* | 14.5x | 2801 | 3081* | 0.9x |
| PacBio_pseudo_k7_j1 | pseudo | 7 | 1 | 555.0 | 9299.1* | 16.8x | 3717 | 3081* | 1.2x |
| PacBio_nopool_k5_j1 | none | 5 | 1 | 914.5 | 7219.1* | 7.9x | 2280 | 2783* | 0.8x |
| PacBio_nopool_k6_j1 | none | 6 | 1 | 288.4 | 7219.1* | 25.0x | 2158 | 2783* | 0.8x |
| PacBio_nopool_k7_j1 | none | 7 | 1 | 262.7 | 7219.1* | 27.5x | 2733 | 2783* | 1.0x |
* - values from k=5 run.
Preliminary observation and results¶
Recommendation: use --kmer-size 7 for PacBio runs, especially if your compute has a decent memory footprint. If you need to reduce memory and try to retain throughput, use --kmer-size 7 and --sample-jobs 1 for PacBio runs
k-mer size¶
- R DADA2 currently has a fixed (hard-coded) k-mer size of 5 for screening. This appears to have the effect of reducing k-mer screening efficiency almost to zero (a no-op), with almost every unique sequence proceeding to NW alignment.
- Increasing the k-mer length marginally has a significant effect, dramatically reducing walltime.
- Preliminary followup (to be added) indicates this has essentially no impact on the ASVs derived post-chimera removal as well as their counts, and may actually increase sensitivity slightly with a few more ASVs recovered in the
dada2-rsruns. - We are also evaluating the overall effect increasing k-mer sizes have on clustering with the different pooling results. Initial k=8 results confirms this increases memory usage dramatically (>100GB), so we don't anticipate this to have much practical use.
ASVs¶
- Pre-chimera results differ slightly from R DADA2 (with some ASVs found unique to both runs). However as noted above these are essentially removed with chimera removal, suggesting their lower abundance
Memory vs walltime¶
dada2-rscurrently uses more memory by default (last column) under default settings, particular with higher k-mer settings. This is at thedada-pooled/dada-pseudo/dadastage (see below)dada-pseudoanddada(per-sample) memory overhead can be alleviated to run each sample serially by setting the number of concurrent jobs to one (--sample-jobs 1), with a small walltime penalty.- We're looking into potential paths to further reduce this footprint
Per step comparisons¶
PacBio pooled, k=7 — dada2-rs vs R (R-single end-to-end wall)
| Step | dada2-rs wall (s) | dada2-rs cores | dada2-rs peak (MB) | R wall (s) | Speedup |
|---|---|---|---|---|---|
| remove_primers | 19.0 | 20.9 | 31 | 5410.2 | 284.3× |
| learn | 25.2 | 20.6 | 2791 | 159.7 | 6.3× |
| dada | 698.8 | 18.8 | 53781 | 6226.2 | 8.9× |
| make_table | 0.3 | 1.0 | 111 | 0.1 | 0.2× |
| remove_bimera | 17.8 | 23.6 | 103 | 19.8 | 1.1× |
| TOTAL | 761.1 | 19.0 | 53781 | 11885.4 | 15.6× |
PacBio pseudo, k=7 — dada2-rs vs R (R-single end-to-end wall)
| Step | dada2-rs wall (s) | dada2-rs cores | dada2-rs peak (MB) | R wall (s) | Speedup |
|---|---|---|---|---|---|
| remove_primers | 19.0 | 20.8 | 29 | 5380.1 | 283.6× |
| learn | 25.1 | 20.6 | 2809 | 155.2 | 6.2× |
| dada | 329.5 | 21.7 | 10907 | 3688.6 | 11.2× |
| make_table | 0.2 | 1.0 | 75 | 0.1 | 0.3× |
| remove_bimera | 7.4 | 23.5 | 85 | 8.3 | 1.1× |
| TOTAL | 381.2 | 21.6 | 10907 | 9299.1 | 24.4× |
PacBio, no pooling, k=7 — dada2-rs vs R (R-single end-to-end wall)
| Step | dada2-rs wall (s) | dada2-rs cores | dada2-rs peak (MB) | R wall (s) | Speedup |
|---|---|---|---|---|---|
| remove_primers | 19.1 | 20.8 | 29 | 5445.1 | 285.7× |
| learn | 25.1 | 20.5 | 2717 | 159.9 | 6.4× |
| dada | 131.9 | 21.0 | 8640 | 1542.7 | 11.7× |
| make_table | 0.1 | 1.0 | 54 | 0.0 | 0.3× |
| remove_bimera | 4.0 | 23.3 | 69 | 4.5 | 1.1× |
| TOTAL | 180.3 | 21.0 | 8640 | 7219.1 | 40.0× |
Note that most of the walltime improvement is actually dominated by the remove-primers step (which combines two steps from DADA2: removePrimers and filterAndTrim, and parallelizes both steps; R removePrimers processes data serially). In many cases alternative tools like cutadapt can help close this gap and may prove more flexible for others. However significant improvements are also apparent for both the learn-errors and dada steps for each pooling mode.
Streaming vs cached (--cache-samples)¶
dada-pseudo streams samples by default (re-reading each sample per round) rather
than holding every sample's dereplicated state resident. --cache-samples opts
into the older all-in-memory behavior. This characterizes that tradeoff on PacBio
(k=5, node-exclusive single rep), comparing streaming (default) as baseline
against cached and with R. --sample-jobs 1 is used to make R comparisons more apples-to-apples. Numbers are filtered per stack withcompare_bench.py --stack <stack>; only the dada step moves materially (other
steps run on identical inputs).
| Stack | mode | dada wall (s) | dada peak | Δ wall | Δ peak |
|---|---|---|---|---|---|
| dada2-rs | streaming (default) | 2092.4 | 2.34 GB | — | — |
| dada2-rs | cached | 2033.5 | 8.18 GB | −2.8% | +249% |
| R-split | streaming (default) | 3737.8 | 2.86 GB | — | — |
| R-split | cached | 3454.6 | 14.86 GB | −7.6% | +420% |
| R-single | streaming (default) | 3735.1 | 3.02 GB† | — | — |
| R-single | cached | 3400.4 | 15.69 GB† | −9.0% | +420% |
† R-single runs the whole pipeline as one process, so peak RSS is only resolvable
process-wide (not per step); it reflects the dada phase, which dominates.
Verdict: caching buys a small walltime gain on the dada step at a large peak-memory
cost, and the pattern holds across all three stacks. Streaming is the right
default; --cache-samples stays an opt-in for the walltime-bound, memory-rich case.
Two cross-stack notes:
- streaming-vs-streaming: dada2-rs is leaner and faster on dada (2.34 GB / 2092 s vs 2.86 GB / 3738 s)
- cached-vs-cached, dada2-rs is ~1.8× leaner (8.18 GB vs 14.86 GB) — R's resident-derep cache is not magically compact.
See issue #22 for the keep-but-default-off decision.
Pre/post-update A/B (memory + speed work)¶
A dada2-rs-vs-itself regression check (PacBio, node-exclusive, no cache flag).
pre is main before recent memory/speed improvements, post is current main. This spans the whole update window — a composite of the memory-representation change (#23, qual_sum:[u32] deferred division) and the learn-errors/dada speedups (issue #3 checklist: skip per-iteration setup, parallel build_trans_mat, SIMD kmer_dist8), not an isolated change. R is not a variable here, so no R numbers apply.
All rows are the same PacBio dataset, varying mode and k. learn wall falls ~−13 to −15% in every row below; the dada step is broken out:
| mode | k | dada wall Δ | dada CPU Δ | dada peak (pre → post) | peak Δ |
|---|---|---|---|---|---|
pooled (--pool true) |
5 | −14.9% | −15.5% | 36.14 → 28.61 GB | −20.8% |
pooled (--pool true) |
7 | −14.4% | −15.2% | 36.04 → 28.69 GB | −20.4% |
| pseudo (streaming default) | 5 | −14.9% | −15.1% | 6.00 → 6.30 GB | +4.9% |
pseudo (cached, --cache-samples) |
7 | −16.8% | −17.4% | 18.27 → 13.12 GB | −28.2% |
Verdict: the speed win is mode-independent
— dada wall and CPU fall together (~−15 to −17%) at flat cores (−0.3 to −1.0%) in every mode, i.e. genuinely less work, not better parallel packing.
- The memory win is resident-bound: the modes that hold all samples resident shed a large fraction of the high-water mark (pooled −20%, ~7.5 GB; cached −28%, ~5.2 GB), while the streaming pseudo path keeps only a small working set, so there is no RSS reduction to capture (a small ~300 MB uptick, single rep — read as flat, not a regression).
- The peak ordering tracks residency exactly — pooled 36 GB > cached 18 GB > streaming 6 GB. Pooled dada peak is ~36 GB at both k values: at that scale the resident all-samples data dominates RSS, the 4^k screen array is washed out, and the memory delta is representation-driven, not k-related.
Illumina MiSeq (F1000, 384 samples)¶
The majority of recent memory work focused on PacBio data. How does this affect Illumina data processing?
dada2-rs v0.1.1-9a0c5da3(v0.1.1 prerelease,release-nativebuild)- Data is the F1000 'MiSeqSOP' full data set: 384 samples, MiSeq v2 run.
- Default settings using FASTQ input for
dadain both tools - 24 threads, 96GB memory allocated on an exclusive node run
Overall workflow (filter and trim FASTQ -> removing chimeras)
| Run | dada2-rs wall (s) | R wall (s) | Speedup | dada2-rs peak (MB) | R peak (MB) | Peak RSS (rs÷R) |
|---|---|---|---|---|---|---|
| MiSeqSOP_pooled_k5 | 417.8 | 1284.6 | 3.1× | 3653 | 6105 | 0.6× |
| MiSeqSOP_pooled_k6 | 414.0 | 1284.6* | 3.1x | 3666 | 6105* | 0.6x |
| MiSeqSOP_pooled_k7 | 416.1 | 1284.6* | 3.1x | 3660 | 6105* | 0.6x |
| MiSeqSOP_pseudo_k5 | 165.6 | 1224.1 | 7.4× | 1483 | 2001 | 0.7× |
| MiSeqSOP_pseudo_k6 | 165.5 | 1224.1* | 7.4× | 1517 | 2001* | 0.8x |
| MiSeqSOP_pseudo_k7 | 164.8 | 1224.1* | 7.4× | 1424 | 2001* | 0.7x |
| MiSeqSOP_nopool_k5 | 93.2 | 791.3 | 8.5× | 1554 | 1673 | 0.9× |
| MiSeqSOP_nopool_k6 | 94.7 | 791.3 | 8.4x | 1456 | 1673* | 0.9× |
| MiSeqSOP_nopool_k7 | 94.2 | 791.3 | 8.4x | 1414 | 1673* | 0.9× |
* - values from k=5 run
Tests prior to memory updates were from 2.4x (pooled, k=5) to 5.9x (no pooling, k=5). This round suggests a modest improvement: 3.1x to 8.5x, all at k=5.
Prior data also suggested a small improvement for pooled samples using a higher k-mer, but in this latest run it seems to disappear, replaced with an overall improvement in performance and memory.
Per step comparisons¶
MiSeqSOP pooled, k=5 — dada2-rs vs R (R-single end-to-end wall)
| Step | dada2-rs wall (s) | dada2-rs cores | dada2-rs peak (MB) | R wall (s) | Speedup |
|---|---|---|---|---|---|
| filter | 11.4 | 20.3 | 21 | 30.8 | 2.7× |
| learn_fwd | 18.9 | 21.5 | 642 | 83.0 | 4.4× |
| learn_rev | 23.0 | 20.7 | 770 | 107.8 | 4.7× |
| dada_fwd | 196.1 | 12.9 | 3653 | 436.1 | 2.2× |
| dada_rev | 157.8 | 9.9 | 3009 | 355.1 | 2.3× |
| merge | 7.2 | 15.6 | 1652 | 245.8 | 34.1× |
| make_table | 0.6 | 1.0 | 337 | 0.3 | 0.5× |
| remove_bimera | 2.8 | 23.1 | 47 | 2.6 | 0.9× |
| TOTAL | 417.8 | 12.9 | 3653 | 1284.6 | 3.1× |
MiSeqSOP, pseudo, k5 — dada2-rs vs R (R-single end-to-end wall)
| Step | dada2-rs wall (s) | dada2-rs cores | dada2-rs peak (MB) | R wall (s) | Speedup |
|---|---|---|---|---|---|
| filter | 11.3 | 20.4 | 21 | 31.7 | 2.8× |
| learn_fwd | 18.7 | 21.6 | 648 | 83.2 | 4.4× |
| learn_rev | 22.4 | 21.1 | 772 | 109.4 | 4.9× |
| dada_fwd | 65.4 | 20.8 | 686 | 433.9 | 6.6× |
| dada_rev | 43.1 | 19.9 | 568 | 332.9 | 7.7× |
| merge | 4.1 | 17.1 | 1483 | 208.9 | 51.5× |
| make_table | 0.2 | 1.0 | 144 | 0.1 | 0.6× |
| remove_bimera | 0.3 | 20.6 | 21 | 0.2 | 0.8× |
| TOTAL | 165.6 | 20.5 | 1483 | 1224.1 | 7.4× |
MiSeqSOP, no pooling, k5 — dada2-rs vs R (R-single end-to-end wall)
| Step | dada2-rs wall (s) | dada2-rs cores | dada2-rs peak (MB) | R wall (s) | Speedup |
|---|---|---|---|---|---|
| filter | 11.2 | 20.6 | 21 | 31.1 | 2.8× |
| learn_fwd | 19.0 | 21.5 | 635 | 80.8 | 4.3× |
| learn_rev | 22.4 | 21.0 | 769 | 107.6 | 4.8× |
| dada_fwd | 22.2 | 19.9 | 573 | 196.4 | 8.8× |
| dada_rev | 14.8 | 19.0 | 467 | 155.5 | 10.5× |
| merge | 3.4 | 15.5 | 1554 | 198.5 | 59.1× |
| make_table | 0.1 | 1.0 | 62 | 0.1 | 1.0× |
| remove_bimera | 0.1 | 17.8 | 21 | 0.1 | 0.8× |
| TOTAL | 93.2 | 20.2 | 1554 | 791.3 | 8.5× |
Notably the big improvement is with merge-pairs, largely due to threading, though overall steps are generally faster. One interesting point: the full pooling run suggests that the two dada steps are not fully utilizing all cores, which may be a further point to assess.
Regenerating these tables¶
After a run, distill its summary.csv to Markdown:
# scorecard across modes (one row per run)
python3 comparison/benchmark/bench_table.py \
"pooled=bench_true/summary.csv" \
"per-sample=bench_false/summary.csv" \
"pseudo=bench_pseudo/summary.csv"
# per-step breakdown for one run
python3 comparison/benchmark/bench_table.py --per-step bench_pseudo/summary.csv
Paste the output into the tables above.