Hey everyone,
I’ve been working on a Rust CLI for dataset deduplication called fastdedup and wanted to share some benchmark results since dataset prep tooling comes up a lot here.
Ran both exact and fuzzy dedup against standard Python baselines on FineWeb sample-10BT (14.8M records, 29GB) on a single Hetzner CCX43 instance.
Exact dedup vs DuckDB + SHA-256
|
fastdedup |
DuckDB |
| Wall clock |
2:55 |
7:55 |
| Peak RAM |
688 MB |
21.9 GB |
| CPU cores |
1 |
4+ |
| Records/sec |
~85,000 |
~31,000 |
| Duplicates removed |
51,392 |
51,392 |
2.7x faster, 32x less RAM, on a single core vs 4+. Duplicate counts match exactly.
Fuzzy dedup (MinHash + LSH) vs datatrove
|
fastdedup |
datatrove |
| Wall clock |
36:44 |
3h50m+ (stage 1 only, terminated) |
| Peak RAM |
23 GB |
1.1 GB |
| Completed |
Y |
N |
| Duplicates removed |
105,044 (0.7%) |
— |
datatrove’s stage 1 alone ran for 3h50m and we terminated it. The bottleneck turned out to be spaCy word tokenization on every document before shingling — fastdedup uses character n-grams directly which is significantly cheaper.
On the RAM trade-off: datatrove streams to disk keeping RAM low at the cost of heavy I/O between stages. fastdedup holds the LSH index in memory for speed. Different trade-offs — worth being transparent about.
Caveats
-
Fuzzy dedup requires ~23GB RAM at this scale — cloud instance workload, not a laptop workload
-
datatrove is designed for distributed execution and tasks=1 is not its optimal configuration — this benchmark represents how someone might run it locally
-
Tiered storage to spill the LSH index to disk is on the roadmap
There’s a small Gradio demo on HF Spaces if you want to test on a small file: [Spaces Link]
Full benchmark methodology, scripts, and raw results are in the repo:[GitHub link]
Happy to answer questions about the implementation or methodology.
2 Likes
This is impressive benchmark! Rust for data processing is definitely the way to go.
Some questions:
- Are you planning to support fuzzy dedup with different similarity thresholds?
- Could this be integrated with the Datatrove library for a unified pipeline?
- Any plans for distributed processing across multiple machines?
The 2.7x speedup with 32x less RAM is huge for large-scale dataset preprocessing. Have you considered publishing this as a HuggingFace Space or integrating with the datasets library?
Great work!
1 Like
Thanks so much! Really appreciate the kind words.
Re: your questions:
1. Fuzzy dedup with different similarity thresholds?
Already supported! The --threshold flag lets you set any Jaccard similarity threshold (0.0-1.0). For example:
bash
fastdedup fuzzy --input data.parquet --output clean.parquet \
--threshold 0.85 --field text
You can also configure the number of MinHash permutations with --num-hashes (default 128). Higher values = more accurate but slower.
2. Datatrove integration?
Interesting idea! Right now fastdedup is a standalone CLI, but I’ve considered integrating with Python bindings. I could expose the Rust core via PyO3 for direct Python import.
3. Distributed processing?
Not currently planned. The design philosophy is single-machine efficiency.
That said, the architecture could support it:
But honestly, single-machine performance might be enough for most use cases. What scale are you thinking?
Re: HuggingFace integration:
The demo Space is already live: https://huggingface.co/spaces/wapplewhite4/fastdedup-demo
A proper Python package would probably be cleaner long-term and thats most likely what will happen next.
What would be most useful for your workflow? Happy to prioritize based on real use cases! Thanks again.
1 Like