Skip to content

Use Cases & Demos

Each section: claim, code, numbers. Runnable commands where available.


21,000× faster than BioPython on real CRISPR guide search. Index a genome once, then look up thousands of short motifs — restriction enzyme sites, CRISPR guides, primers — in time proportional only to pattern length.

Workload: build an index over GRCh38 chromosome 22 (50 MB, 50.8M bases) and batch_find_all 1,000 sgRNA guides sampled from the Brunello lentiviral knockout library.

from parot import Index

genome = open("GRCh38_chr22.fa").read()
index = Index(genome)  # Build once: ~1.2 seconds

# batch_find_all 1,000 guides: ~2.3 ms total (~2.3 μs per guide)
guides = [...]  # 20-mer sgRNAs from Brunello
hits = index.batch_find_all(guides)

Results (Apple M2 Pro, 1,000 guides × GRCh38 chr22):

Competitor Regime Median (ms) Speedup vs parot
parot batch_find_all index 2.31 1.0x
bytes.find loop substring 49,333 21,313x
BioPython Seq.count_overlap count-only 49,576 21,418x
ripgrep (subprocess) substring 189.3 82x

BioPython has no native find-all; count-only used for parity. Same class of data structure that powers BWA and Bowtie2 — now a pip install away.

uv run examples/demo_genome_search.py

See also: API Reference for Index methods | Benchmarks for full performance tables | Concepts for background


Content Moderation: Spam & Copypasta Detection

4 of 4 planted campaigns detected in 46 ms. Bot networks amplify messages by posting near-identical content. Exact-match hashing misses campaigns with minor variation. Word-boundary-aware duplicate detection catches them all.

Demo: 5,160 synthetic posts — 5,000 organic and 160 spam across 4 coordinated campaigns (crypto scam, astroturfing, disinformation, phishing). Run find_duplicates() on the concatenated feed.

import parot

# Concatenate all posts in a feed batch
feed_text = "\n".join(post["text"] for post in posts)

# Surface repeated phrases (46ms on 336K characters)
results = parot.find_duplicates(
    feed_text,
    min_words=8,
    min_chars=30,
    max_words=60,
)

Results (Apple Silicon):

Suspected Campaign Occurrences Words Type
Invest now in TurboMoon coin and earn guaranteed returns... 45 33 Crypto scam
Breaking news: according to multiple unnamed sources... 60 20 Disinformation
Urgent security alert your account has been compromised... 25 29 Phishing
I switched to BrandX last month and honestly it changed... 30 22 Astroturfing

Insight: bot copypasta is distinguishable from organic repetition by phrase length. Campaigns are 20–33 words; organic template overlap is 8–14 words. A simple heuristic (>15 words + >10 occurrences) separates signal from noise.

uv run examples/demo_content_moderation.py

See also: API Reference for find_duplicates() parameters | Concepts for how phrase detection works


Writing and Editing

Find unintentional repeated phrases in manuscripts, articles, and documentation. Output includes UTF-16 offsets for editor plugins (VSCode, Obsidian).

import parot

result = parot.find_duplicates(manuscript_text, min_words=3)
phrases = parot.materialize_strings(result, "phrase")
for phrase, count, words in zip(phrases, result["count"], result["number_of_words"]):
    print(f'{count}x  ({words} words)  "{phrase}"')

Pride and Prejudice (752 KB): 75 ms, 2,025 phrases. Complete Shakespeare (5.6 MB): 796 ms, 10,308 phrases. King James Bible (4.6 MB): 2,385 ms, 57,009 phrases. case_insensitive=True is free — zero overhead, 7.6% more phrases found.

See also: API Reference for find_duplicates() | Benchmarks for parameter tuning


Plagiarism Detection: Finding Copied Passages

Check a manuscript against known sources in milliseconds. Two complementary primitives: longest_common_substring returns the single longest shared run; find_duplicates surfaces every shared phrase.

Demo: student essay vs. source document. The student copied 5 paragraphs verbatim and wrote their own transitions.

import parot

# Find the single longest shared passage (0.57ms)
lcs = parot.longest_common_substring(source, essay)

# Find ALL shared phrases between documents
combined = source + "\n\n" + essay
duplicates = parot.find_duplicates(combined, min_words=5)  # columnar Arrow dict
phrases = parot.materialize_strings(duplicates, "phrase")
shared = [p for p, c in zip(phrases, duplicates["count"]) if c >= 2]

Results (Apple Silicon):

  • Longest shared passage: 277 characters, 34 words -- found in 0.57ms
  • All shared phrases: 15 distinct passages totaling 128 words -- found in 1ms
  • Verdict: 37% of the essay was plagiarized

The 15 detected phrases range from 4 to 28 words, capturing every copied passage including fragments split across paragraph boundaries.

uv run examples/demo_plagiarism_detection.py

See also: API Reference for longest_common_substring() | Benchmarks for performance at scale


Ship Parot in WebAssembly for instant full-text search with no server round-trip. Live demo searches 10 MB of Dickens in your browser.

import init, { Index } from 'parot';
await init();

const text = document.body.innerText;  // ~100 KB of docs
const index = new Index(text, 0);      // Build: ~10 ms

const count = index.count("search term");
const positions = index.findAll("search term");

// Structured results with context (columnar Arrow layout)
const r = index.search("search term", 50);
const dec = new TextDecoder();
const decode = (bytes, offsets, i) =>
  dec.decode(bytes.slice(offsets[i], offsets[i + 1]));
for (let i = 0; i < r.positions.length; i++) {
  const before = decode(r.beforeBytes, r.beforeOffsets, i);
  const matched = decode(r.matchedBytes, r.matchedOffsets, i);
  const after = decode(r.afterBytes, r.afterOffsets, i);
  console.log(`...${before}[${matched}]${after}...`);
}

Query time depends only on pattern length — searching 10 MB takes the same time as 10 KB. memory_compactness=0 (default) puts the in-browser footprint at ~5× text; memory_compactness=4 shrinks it to ~1.4× at the cost of slower find_all. See Benchmarks for the indexOf-vs-Parot comparison.

See also: JavaScript API | Benchmarks | Concepts


Editor Plugins

Real-time duplicate highlighting via the WASM backend. Extensions live in their own repositories:

  • concordance-vscode — VSCode extension
  • concordance-obsidian — Obsidian plugin

Data Pipelines (pandas / polars replacement)

36,000× faster than polars str.contains on 500K rows, 50 patterns. The pandas and polars accessors ship in the box — one-line change.

import pandas as pd
import parot                          # registers .parot accessor

df = pd.read_parquet("logs.parquet")              # 500K rows

# Before (1,200 ms):
mask = df["message"].str.contains("error")

# After (0.03 ms):
mask = df.parot.contains("message", "error")

# Multi-pattern prevalence table
df.parot.batch_count("message", ["error", "warning", "fatal", "timeout"])

See also: Python API | DataFrame benchmarks


LLM Training Data Dedup

The same linear-time approach used in "Deduplicating Training Data Makes Language Models Better" (Google, 2022). Their run needed 600 GB+ of RAM for C4. Parot targets small-to-medium corpora — manuscripts, docs, note collections — finishing the King James Bible in 2.4 s.

import parot

duplicates = parot.find_duplicates(
    corpus_text,
    min_words=5,
    enable_block_detection=True,
)

See also: Concepts


Text Compression and Analysis

Parot powers downstream compression and Lempel-Ziv-style analysis tools — for example, qzip (block-transform compression). These build on Parot's self-index capability: construct the index once, then query or extract from the compressed structure directly without keeping the original text in memory.

from parot import Index

index = Index(data)                       # build once
assert index.count("some pattern")        # query the compressed structure

See also: Python API


What's next?