Calculate MinHash Duplicate Ratio in Pretraining Data
Pretraining corpora for large language models now span trillions of tokens. The portion of those tokens that represent exact or near-exact duplicates of other tokens in the same corpus directly…

Pretraining corpora for large language models now span trillions of tokens. The portion of those tokens that represent exact or near-exact duplicates of other tokens in the same corpus directly affects compute efficiency, contamination risk, and downstream benchmark validity. MinHash, combined with Locality-Sensitive Hashing (LSH), provides the dominant method for measuring this overlap at scale. Calculating the duplicate ratio with this method requires explicit choices around n-gram size, signature length, similarity threshold, and bucketing strategy. Each choice changes the measured ratio. Each carries trade-offs between precision, recall, and memory.
The Mechanics of MinHash for Lexical Overlap Detection
MinHash is a probabilistic estimator of the Jaccard similarity between two sets. For two documents A and B, treated as sets of n-grams, Jaccard similarity is defined as |A ∩ B| / |A ∪ B|. A single hash function h maps each element of A and B to an integer. The probability that the minimum value of h(A) equals the minimum value of h(B) equals J(A, B). Repeating this with k independent hash functions produces a signature vector of length k. The fraction of positions where the two signatures match estimates J(A, B).
The estimator is unbiased. Its variance decreases as 1/k. A signature length of 128 yields a standard error of approximately 0.044 around the true similarity. A signature length of 256 reduces that error to roughly 0.031. Production pipelines for pretraining data almost always operate in the 128–256 range. Below 128, error grows too large for reliable thresholding. Above 256, memory overhead becomes material at trillion-token scale.
MinHash produces an unbiased estimator of Jaccard similarity. Variance scales as 1/k. Signature length 128–256 represents the empirical sweet spot for pretraining corpora.
MinHash detects lexical overlap, not semantic overlap. Two documents that paraphrase the same content using different words will not register as duplicates. Two documents that share boilerplate, headers, or repeated phrases will register strongly. The metric measures surface-level redundancy, which is what pretraining pipelines primarily aim to remove.
The mathematical foundation matters because it constrains what the tool can and cannot do. MinHash provides a lower bound on the computational work required to estimate set similarity without materializing the full intersection. The alternative—computing exact Jaccard similarity for every pair in a billion-document corpus—requires O(N²) set intersections. MinHash compresses each document into a fixed-length signature and reduces the problem to comparing integer vectors, a fundamentally different computational class.
One subtlety often missed in engineering discussions: the hash functions must be independent. In practice, pipelines derive multiple hash values from a single hash function by using different seeds or by combining the output of a fast hash (like MurmurHash3) with random linear transformations. This approximation introduces correlation between hash values. At k = 128, the effect is negligible. At k = 1024 or higher, correlated hashes can inflate variance beyond the theoretical 1/k bound. Production systems rarely push k this high, but the constraint is worth noting for teams experimenting with larger signature lengths.
Configuring N-gram Tokenization and Signature Lengths
The choice of n-gram size determines what counts as a duplicate. Word-level 5-grams capture phrasal redundancy. Word-level 13-grams capture extended syntactic overlap. Character-level n-grams, often used for code, range from 5 to 10 characters. The selection depends on document structure and the granularity of duplication expected in the source corpus.
Common Web crawl duplicates fall into three categories:
- Exact copies: identical byte sequences, often from mirrors and reposts.
- Near-duplicates with minor edits: whitespace differences, punctuation changes, header and footer variations.
- Template-generated content: large blocks of boilerplate with small variable substitutions.
Word-level 5-grams handle the first two categories. Word-level 9–13-grams handle the third. Code corpora, where variable names carry semantic weight, often use character-level 5–10-grams to avoid missing renames and reformats.
Signature length interacts with threshold sensitivity. At k = 128, two documents with true Jaccard similarity of 0.80 produce a signature-match estimate that fluctuates between roughly 0.75 and 0.85 across runs. At k = 256, the same documents produce estimates in a tighter band of 0.77–0.83. For threshold-based decisions (is this document a duplicate?), the wider confidence interval at k = 128 translates to higher false-positive or false-negative rates at the boundary.
| Parameter | Typical Range | Effect on Duplicate Detection |
|---|---|---|
| N-gram size (words) | 5–13 | Larger n-gram = stricter duplication, misses minor edits |
| N-gram size (chars) | 5–10 | Used for code and noisy OCR text |
| Signature length (k) | 128–256 | Higher k = lower variance, higher memory |
| Hash function | MurmurHash3 / xxHash | Choice affects speed, not accuracy |
The table summarizes standard parameter ranges drawn from published pretraining data studies. No single configuration fits all domains. Code data tolerates smaller n-grams because syntax dominates. Natural language tolerates larger n-grams because topical repetition occurs at the phrase level.
N-gram normalization deserves separate attention. Before extracting n-grams, most pipelines normalize text: lowercasing, collapsing whitespace, stripping punctuation, and applying consistent Unicode normalization (typically NFC). The normalization decisions directly affect what counts as a duplicate. A pipeline that preserves case will report fewer duplicates than one that lowercases everything. A pipeline that strips HTML tags will merge documents that differ only in markup. These decisions are not neutral—they define the duplicate concept the pipeline measures.
Tokenization boundary effects compound at scale. Word-level n-grams break differently depending on the tokenizer. A whitespace splitter produces different n-gram sets than a BPE tokenizer on the same text. For MinHash duplicate detection, the convention is to use a simple whitespace-and-punctuation splitter rather than the model's training tokenizer. The reason is pragmatic: MinHash operates on the raw text layer, not the subword layer. Mismatched tokenization between the deduplication pipeline and the training tokenizer introduces artifacts that are difficult to diagnose.
Scaling Near-Duplicate Detection with Locality-Sensitive Hashing
Pairwise comparison across N documents requires O(N²) similarity computations. At N = 10⁹ documents, this operation is infeasible. LSH addresses the problem by banding the signature matrix. Each signature of length k is divided into b bands of r rows each, where b × r = k. Two documents hash to the same bucket in any band if all r rows of their signatures match in that band. Documents sharing at least one bucket become candidate pairs.
The banding technique amplifies the probability that high-similarity pairs collide. For a target similarity threshold t and parameters b, r, the probability of detection is:
P(detection) = 1 − (1 − t^r)^b
Standard parameter choices for pretraining data:
- b = 16, r = 8 (total k = 128): tuned for t ≈ 0.7
- b = 32, r = 8 (total k = 256): tuned for t ≈ 0.8
- b = 16, r = 16 (total k = 256): tuned for t ≈ 0.85
Increasing b at fixed k sharpens the threshold but reduces recall below it. Increasing r broadens the threshold at the cost of larger buckets and more candidates. The (b, r) pair defines an operating point on the precision-recall curve.
LSH also controls memory. Each document is assigned to b bucket IDs. For a corpus of 10⁹ documents with b = 16, the system stores 16 × 10⁹ bucket assignments. At 8 bytes per assignment, the index requires roughly 128 GB. Larger b values scale linearly. Distributed implementations partition the bucket space across workers, but the per-worker memory footprint remains a binding constraint.
The S-curve shape of the LSH probability function has practical consequences. Near the threshold t, the detection probability transitions from near-zero to near-one over a narrow similarity range. This is the feature that makes LSH useful—it acts as a soft classifier. But the transition width depends on r. Larger r produces a sharper transition, meaning the system more cleanly separates duplicates from non-duplicates at the cost of missing borderline cases. Smaller r catches more borderline pairs but floods the verification step with false candidates.
In practice, the verification step is the computational bottleneck, not the LSH indexing step. For a corpus of 10⁹ documents with b = 16, indexing is a single-pass operation. Verification requires reading candidate document pairs from disk and computing exact Jaccard similarity. If the average document has 50 LSH collisions, the system must verify 50 × 10⁹ / 2 candidate pairs. Even at 10⁶ verifications per second, this takes roughly 7,000 CPU-hours. Reducing false-positive rates by tuning (b, r) parameters directly translates to reduced verification cost.
Defining and Interpreting the Duplicate Ratio Threshold
The duplicate ratio is the fraction of documents in a corpus sharing Jaccard similarity above a defined threshold with at least one other document. The threshold is a free parameter. Published pretraining datasets report ratios at thresholds ranging from 0.7 to 0.95. The choice reflects the curation team's tolerance for redundancy and the downstream model's sensitivity to repeated content.
Threshold 0.7 catches aggressive near-duplicates. Documents with 70% shared n-grams almost always originate from the same source or template. Threshold 0.8 catches moderate duplicates, including lightly edited mirrors. Threshold 0.9 catches only tightly overlapping content, leaving most paraphrases and topical overlaps in the corpus.
| Threshold | Document Type Captured | Typical Pretraining Corpus Ratios |
|---|---|---|
| 0.7 | Aggressive near-duplicates, templates | 5–15% |
| 0.8 | Lightly edited mirrors | 2–8% |
| 0.9 | Tightly overlapping content | 0.5–3% |
| 0.95 | Near-exact duplicates | <1% |
Ratios vary by source. Web crawl corpora report 5–15% at threshold 0.7. Curated book corpora report <1% at the same threshold. Code corpora cluster around 10–20% at threshold 0.8 due to forks, copies, and template repositories. The numbers reflect the source distribution, not a fixed standard.
No universal threshold for "optimal" duplicate ratio exists. The ratio is a function of source data, n-gram size, and similarity threshold. Comparisons across corpora require matched configurations.
The threshold also determines removal decisions. Removing documents above threshold 0.8 in a Web crawl often eliminates 5–10% of the corpus. Removing above threshold 0.95 eliminates <1%. The trade-off is between data volume and contamination risk. Aggressive removal shrinks the training set but may discard legitimate variations of important content.
A critical detail often glossed over: the duplicate ratio is not a single number. It is a function of the threshold. A team that reports "5% duplicates" without specifying the threshold, n-gram size, and signature length is reporting an incomparable metric. When Google described deduplication of their pretraining data, the reported ratio depended on all three parameters. When Meta reported deduplication for LLaMA's training set, the configuration differed. Cross-paper comparisons of duplicate ratios require matched configurations, which almost never exist.
The direction of deduplication also matters. Some pipelines remove the duplicate copy, keeping the first occurrence. Others remove both copies. Some apply a quality heuristic, keeping the copy from the higher-quality source. The choice affects the measured ratio only if the metric is defined as "fraction of documents removed" versus "fraction of documents that have at least one duplicate in the corpus." These are different numbers. The former counts each removed document once. The latter counts each document that participates in a duplicate pair, regardless of which copy is removed.
Deduplication also interacts with other filtering stages. A pipeline that applies quality classifiers before deduplication measures a different ratio than one that deduplicates first. The reason is that quality filters remove low-quality documents that are disproportionately likely to be duplicates (spam, boilerplate, auto-generated content). Running deduplication after quality filtering produces a lower measured ratio than running it before, even on the same corpus.
Implementation Strategies for Distributed Pretraining Pipelines
Standard libraries provide MinHash implementations in Python and distributed frameworks. datasketch offers the primary Python implementation, with MinHash and MinHashLSH classes supporting signature generation, batch insertion, and query operations. Apache Spark's MLlib includes MinHash and LSH modules for distributed corpora. The implementations differ in throughput, memory profile, and integration with downstream tokenizers.
A typical pipeline for pretraining data proceeds in five stages:
1. Document tokenization into normalized word sequences, removing punctuation, lowercasing, and applying consistent Unicode normalization.
2. N-gram extraction (word or character) with the selected n value.
3. Signature generation per document using k hash functions. A signature of 128 hashes requires approximately 1 KB of storage per document at 8-byte integers.
4. LSH indexing with (b, r) parameters tuned to the target threshold. Documents sharing a bucket become duplicate candidates.
5. Candidate pair verification with exact Jaccard computation. This step filters LSH false positives by computing the true similarity for each candidate pair.
Distributed implementations partition the corpus into shards, compute signatures per shard, and either compute LSH buckets locally or exchange bucket assignments across workers. The bottleneck is memory during candidate pair expansion. A document with 100 LSH collisions across the corpus generates 100 candidate pairs. Verification of these pairs requires reading and re-hashing the candidate documents.
Throughput estimates vary by framework. Single-node datasketch processes roughly 10⁴–10⁵ documents per minute on commodity hardware. Spark implementations on a 100-node cluster reach 10⁷–10⁸ documents per hour, depending on signature length and bucket configuration. These figures exclude I/O and tokenization, which often dominate end-to-end runtime.
The duplicate ratio itself is computed as:
duplicate_ratio = (number of documents with at least one verified duplicate) / (total number of documents)
The denominator excludes empty or malformed documents that failed tokenization. The numerator counts each document once, regardless of how many duplicates it has. Some pipelines weight duplicates by token count rather than document count. The metric definition must match the downstream use case.
Deduplication is baseline hygiene in any serious pretraining stack. The same curation logic appears across content archives wherever redundant data inflates storage costs and skews downstream models. The engineering problem is not whether to deduplicate, but how to tune the parameters so that the measured ratio reflects the actual redundancy in the data rather than artifacts of the configuration.
For teams building from scratch, a practical approach is to start with conservative parameters—k = 128, b = 16, r = 8, word-level 5-grams—and measure the duplicate ratio at multiple thresholds (0.7, 0.8, 0.9). This produces a ratio curve rather than a single number. The curve reveals how sensitive the corpus is to threshold changes. A steep curve indicates most duplicates are near-exact copies. A flat curve indicates a spectrum of overlap that requires careful threshold selection.
Limitations and Open Variables
MinHash does not detect semantic duplicates. Two documents conveying identical information in different words register as distinct. Embedding-based similarity methods, including sentence-BERT and dense retrievers, catch these cases but operate at higher computational cost and with different precision characteristics. Hybrid pipelines use MinHash for first-pass lexical filtering and embeddings for second-pass semantic review.
The exact computational cost for trillion-token corpora remains a function of the distributed framework, shard size, and memory budget. Published estimates are not directly comparable due to varying hardware configurations. Benchmarks published by individual teams apply only to their specific pipelines.
Code corpora require special handling. Standard word-level MinHash on source code over-reports duplication due to common keywords and operators. Character-level n-grams and AST-aware tokenization produce more meaningful ratios. The threshold also differs: code corpora tolerate higher ratios because forks, templates, and library inclusions are legitimate content rather than contamination.
OCR-extracted text introduces noise that affects n-gram stability. Character-level n-grams absorb this noise more gracefully than word-level n-grams, which break on every OCR misrecognition. For scanned document corpora, character-level MinHash is the empirical default.
The duplicate ratio is a necessary but insufficient quality metric. A corpus with 0% duplicates can still contain misinformation, bias, and low-quality content. MinHash measures surface redundancy. It does not measure semantic redundancy, factual accuracy, or domain coverage. Pretraining pipelines should pair MinHash-based deduplication with classifier-based quality filtering, embedding-based diversity analysis, and downstream benchmark evaluation. The metric remains a load-bearing component of any serious pretraining curation stack, not a standalone verdict.
An emerging area of investigation is document-level versus passage-level deduplication. Standard MinHash operates on whole documents. But duplicate passages within otherwise distinct documents—shared paragraphs, quoted text, boilerplate legal notices—escape detection. Passage-level MinHash splits documents into fixed-length chunks and applies the same pipeline at finer granularity. The trade-off is increased computation and a more complex deduplication graph. Some teams apply passage-level deduplication only to documents flagged by document-level MinHash as partially overlapping, reducing the overhead while catching intra-document redundancy.
The field is converging on a layered approach. Fast lexical methods (MinHash + LSH) serve as the first filter. Embedding-based methods operate on the survivors. Human review or quality classifiers operate on flagged edge cases. No single method is sufficient. The duplicate ratio measured by MinHash is the starting point of this pipeline, not the end.