Use Cases & Demos¶
Each section: claim, code, numbers. Runnable commands where available.
Bioinformatics: Genome-Scale Motif Search¶
21,000× faster than BioPython on real CRISPR guide search. Index a genome once, then look up thousands of short motifs — restriction enzyme sites, CRISPR guides, primers — in time proportional only to pattern length.
Workload: build an index over GRCh38 chromosome 22 (50 MB, 50.8M bases) and batch_find_all 1,000 sgRNA guides sampled from the Brunello lentiviral knockout library.
from parot import Index
genome = open("GRCh38_chr22.fa").read()
index = Index(genome) # Build once: ~1.2 seconds
# batch_find_all 1,000 guides: ~2.3 ms total (~2.3 μs per guide)
guides = [...] # 20-mer sgRNAs from Brunello
hits = index.batch_find_all(guides)
Results (Apple M2 Pro, 1,000 guides × GRCh38 chr22):
| Competitor | Regime | Median (ms) | Speedup vs parot |
|---|---|---|---|
parot batch_find_all |
index | 2.31 | 1.0x |
bytes.find loop |
substring | 49,333 | 21,313x |
BioPython Seq.count_overlap |
count-only | 49,576 | 21,418x |
| ripgrep (subprocess) | substring | 189.3 | 82x |
BioPython has no native find-all; count-only used for parity. Same class of data structure that powers BWA and Bowtie2 — now a pip install away.
See also: API Reference for
Indexmethods | Benchmarks for full performance tables | Concepts for background
Content Moderation: Spam & Copypasta Detection¶
4 of 4 planted campaigns detected in 46 ms. Bot networks amplify messages by posting near-identical content. Exact-match hashing misses campaigns with minor variation. Word-boundary-aware duplicate detection catches them all.
Demo: 5,160 synthetic posts — 5,000 organic and 160 spam across 4 coordinated campaigns (crypto scam, astroturfing, disinformation, phishing). Run find_duplicates() on the concatenated feed.
import parot
# Concatenate all posts in a feed batch
feed_text = "\n".join(post["text"] for post in posts)
# Surface repeated phrases (46ms on 336K characters)
results = parot.find_duplicates(
feed_text,
min_words=8,
min_chars=30,
max_words=60,
)
Results (Apple Silicon):
| Suspected Campaign | Occurrences | Words | Type |
|---|---|---|---|
| Invest now in TurboMoon coin and earn guaranteed returns... | 45 | 33 | Crypto scam |
| Breaking news: according to multiple unnamed sources... | 60 | 20 | Disinformation |
| Urgent security alert your account has been compromised... | 25 | 29 | Phishing |
| I switched to BrandX last month and honestly it changed... | 30 | 22 | Astroturfing |
Insight: bot copypasta is distinguishable from organic repetition by phrase length. Campaigns are 20–33 words; organic template overlap is 8–14 words. A simple heuristic (>15 words + >10 occurrences) separates signal from noise.
See also: API Reference for
find_duplicates()parameters | Concepts for how phrase detection works
Writing and Editing¶
Find unintentional repeated phrases in manuscripts, articles, and documentation. Output includes UTF-16 offsets for editor plugins (VSCode, Obsidian).
import parot
result = parot.find_duplicates(manuscript_text, min_words=3)
phrases = parot.materialize_strings(result, "phrase")
for phrase, count, words in zip(phrases, result["count"], result["number_of_words"]):
print(f'{count}x ({words} words) "{phrase}"')
Pride and Prejudice (752 KB): 75 ms, 2,025 phrases. Complete Shakespeare (5.6 MB): 796 ms, 10,308 phrases. King James Bible (4.6 MB): 2,385 ms, 57,009 phrases. case_insensitive=True is free — zero overhead, 7.6% more phrases found.
See also: API Reference for
find_duplicates()| Benchmarks for parameter tuning
Plagiarism Detection: Finding Copied Passages¶
Check a manuscript against known sources in milliseconds. Two complementary primitives: longest_common_substring returns the single longest shared run; find_duplicates surfaces every shared phrase.
Demo: student essay vs. source document. The student copied 5 paragraphs verbatim and wrote their own transitions.
import parot
# Find the single longest shared passage (0.57ms)
lcs = parot.longest_common_substring(source, essay)
# Find ALL shared phrases between documents
combined = source + "\n\n" + essay
duplicates = parot.find_duplicates(combined, min_words=5) # columnar Arrow dict
phrases = parot.materialize_strings(duplicates, "phrase")
shared = [p for p, c in zip(phrases, duplicates["count"]) if c >= 2]
Results (Apple Silicon):
- Longest shared passage: 277 characters, 34 words -- found in 0.57ms
- All shared phrases: 15 distinct passages totaling 128 words -- found in 1ms
- Verdict: 37% of the essay was plagiarized
The 15 detected phrases range from 4 to 28 words, capturing every copied passage including fragments split across paragraph boundaries.
See also: API Reference for
longest_common_substring()| Benchmarks for performance at scale
Client-Side Browser Search¶
Ship Parot in WebAssembly for instant full-text search with no server round-trip. Live demo searches 10 MB of Dickens in your browser.
import init, { Index } from 'parot';
await init();
const text = document.body.innerText; // ~100 KB of docs
const index = new Index(text, 0); // Build: ~10 ms
const count = index.count("search term");
const positions = index.findAll("search term");
// Structured results with context (columnar Arrow layout)
const r = index.search("search term", 50);
const dec = new TextDecoder();
const decode = (bytes, offsets, i) =>
dec.decode(bytes.slice(offsets[i], offsets[i + 1]));
for (let i = 0; i < r.positions.length; i++) {
const before = decode(r.beforeBytes, r.beforeOffsets, i);
const matched = decode(r.matchedBytes, r.matchedOffsets, i);
const after = decode(r.afterBytes, r.afterOffsets, i);
console.log(`...${before}[${matched}]${after}...`);
}
Query time depends only on pattern length — searching 10 MB takes the same time as 10 KB. memory_compactness=0 (default) puts the in-browser footprint at ~5× text; memory_compactness=4 shrinks it to ~1.4× at the cost of slower find_all. See Benchmarks for the indexOf-vs-Parot comparison.
See also: JavaScript API | Benchmarks | Concepts
Editor Plugins¶
Real-time duplicate highlighting via the WASM backend. Extensions live in their own repositories:
- concordance-vscode — VSCode extension
- concordance-obsidian — Obsidian plugin
Data Pipelines (pandas / polars replacement)¶
36,000× faster than polars str.contains on 500K rows, 50 patterns. The pandas and polars accessors ship in the box — one-line change.
import pandas as pd
import parot # registers .parot accessor
df = pd.read_parquet("logs.parquet") # 500K rows
# Before (1,200 ms):
mask = df["message"].str.contains("error")
# After (0.03 ms):
mask = df.parot.contains("message", "error")
# Multi-pattern prevalence table
df.parot.batch_count("message", ["error", "warning", "fatal", "timeout"])
See also: Python API | DataFrame benchmarks
LLM Training Data Dedup¶
The same linear-time approach used in "Deduplicating Training Data Makes Language Models Better" (Google, 2022). Their run needed 600 GB+ of RAM for C4. Parot targets small-to-medium corpora — manuscripts, docs, note collections — finishing the King James Bible in 2.4 s.
import parot
duplicates = parot.find_duplicates(
corpus_text,
min_words=5,
enable_block_detection=True,
)
See also: Concepts
Text Compression and Analysis¶
Parot powers downstream compression and Lempel-Ziv-style analysis tools — for example, qzip (block-transform compression). These build on Parot's self-index capability: construct the index once, then query or extract from the compressed structure directly without keeping the original text in memory.
from parot import Index
index = Index(data) # build once
assert index.count("some pattern") # query the compressed structure
See also: Python API
What's next?¶
- New to Parot? Start with Concepts.
- Ready to code? The API Reference has complete signatures.
- Want performance details? See Benchmarks.