Fastdedup: Rust-based dataset deduplication — benchmarks on FineWeb sample-10BT

wapplewhite4 · February 20, 2026, 9:44pm

Hey everyone,

I’ve been working on a Rust CLI for dataset deduplication called fastdedup and wanted to share some benchmark results since dataset prep tooling comes up a lot here.

Ran both exact and fuzzy dedup against standard Python baselines on FineWeb sample-10BT (14.8M records, 29GB) on a single Hetzner CCX43 instance.

Exact dedup vs DuckDB + SHA-256

	fastdedup	DuckDB
Wall clock	2:55	7:55
Peak RAM	688 MB	21.9 GB
CPU cores	1	4+
Records/sec	~85,000	~31,000
Duplicates removed	51,392	51,392

2.7x faster, 32x less RAM, on a single core vs 4+. Duplicate counts match exactly.

Fuzzy dedup (MinHash + LSH) vs datatrove

	fastdedup	datatrove
Wall clock	36:44	3h50m+ (stage 1 only, terminated)
Peak RAM	23 GB	1.1 GB
Completed	Y	N
Duplicates removed	105,044 (0.7%)	—

datatrove’s stage 1 alone ran for 3h50m and we terminated it. The bottleneck turned out to be spaCy word tokenization on every document before shingling — fastdedup uses character n-grams directly which is significantly cheaper.

On the RAM trade-off: datatrove streams to disk keeping RAM low at the cost of heavy I/O between stages. fastdedup holds the LSH index in memory for speed. Different trade-offs — worth being transparent about.

Caveats

Fuzzy dedup requires ~23GB RAM at this scale — cloud instance workload, not a laptop workload
datatrove is designed for distributed execution and tasks=1 is not its optimal configuration — this benchmark represents how someone might run it locally
Tiered storage to spill the LSH index to disk is on the roadmap

There’s a small Gradio demo on HF Spaces if you want to test on a small file: [Spaces Link]

Full benchmark methodology, scripts, and raw results are in the repo:[GitHub link]

Happy to answer questions about the implementation or methodology.

NJX-njx · March 4, 2026, 2:17pm

This is impressive benchmark! Rust for data processing is definitely the way to go.

Some questions:

Are you planning to support fuzzy dedup with different similarity thresholds?
Could this be integrated with the Datatrove library for a unified pipeline?
Any plans for distributed processing across multiple machines?

The 2.7x speedup with 32x less RAM is huge for large-scale dataset preprocessing. Have you considered publishing this as a HuggingFace Space or integrating with the datasets library?

Great work!

wapplewhite4 · March 4, 2026, 11:52pm

Thanks so much! Really appreciate the kind words.

Re: your questions:

1. Fuzzy dedup with different similarity thresholds?

Already supported! The --threshold flag lets you set any Jaccard similarity threshold (0.0-1.0). For example:

bash

fastdedup fuzzy --input data.parquet --output clean.parquet \
    --threshold 0.85 --field text

You can also configure the number of MinHash permutations with --num-hashes (default 128). Higher values = more accurate but slower.

2. Datatrove integration?

Interesting idea! Right now fastdedup is a standalone CLI, but I’ve considered integrating with Python bindings. I could expose the Rust core via PyO3 for direct Python import.

3. Distributed processing?

Not currently planned. The design philosophy is single-machine efficiency.

That said, the architecture could support it:

Exact dedup: shardable by hash prefix
Fuzzy dedup: could shard the LSH index across machines

But honestly, single-machine performance might be enough for most use cases. What scale are you thinking?

Re: HuggingFace integration:

The demo Space is already live: https://huggingface.co/spaces/wapplewhite4/fastdedup-demo

A proper Python package would probably be cleaner long-term and thats most likely what will happen next.

What would be most useful for your workflow? Happy to prioritize based on real use cases! Thanks again.

Topic		Replies	Views
Minhash Deduplication 🤗Datasets	15	7752	August 6, 2022
How can I drop duplicates on datasets module? Beginners	3	8383	July 5, 2022
Any way to streaming-preprocess a dataset to disk? 🤗Datasets	7	160	March 4, 2026
Is huggingface dataset suitable for ddp training? Beginners	10	131	January 12, 2026
Processing big LLM datasets (e.g. FineWeb) 🤗Datasets	0	171	July 11, 2024

Fastdedup: Rust-based dataset deduplication — benchmarks on FineWeb sample-10BT

Exact dedup vs DuckDB + SHA-256

Fuzzy dedup (MinHash + LSH) vs datatrove

Caveats

Related topics