Demos¶
Runnable examples showing Parot in real-world scenarios. Each Python demo is a standalone script — install Parot (pip install "git+https://github.com/sophiaconsulting/parot#subdirectory=bindings/python") and run the file.
Content Moderation: Spam & Copypasta Detection¶
4 of 4 planted campaigns detected in 46 ms. Bot networks amplify messages by posting near-identical content. Exact-match hashing misses campaigns with minor variation. Word-boundary-aware duplicate detection catches them all.
Demo: 5,160 synthetic posts — 5,000 organic and 160 spam across 4 coordinated campaigns (crypto scam, astroturfing, disinformation, phishing). Run find_duplicates() on the concatenated feed.
import parot
# Concatenate all posts in a feed batch
feed_text = "\n".join(post["text"] for post in posts)
# Surface repeated phrases (46ms on 336K characters)
results = parot.find_duplicates(
feed_text,
min_words=8,
min_chars=30,
max_words=60,
)
Results (Apple Silicon):
| Suspected Campaign | Occurrences | Words | Type |
|---|---|---|---|
| Invest now in TurboMoon coin and earn guaranteed returns... | 45 | 33 | Crypto scam |
| Breaking news: according to multiple unnamed sources... | 60 | 20 | Disinformation |
| Urgent security alert your account has been compromised... | 25 | 29 | Phishing |
| I switched to BrandX last month and honestly it changed... | 30 | 22 | Astroturfing |
Insight: bot copypasta is distinguishable from organic repetition by phrase length. Campaigns are 20–33 words; organic template overlap is 8–14 words. A simple heuristic (>15 words + >10 occurrences) separates signal from noise.
Bioinformatics: Genome-Scale Motif Search¶
21,000× faster than BioPython on 50 MB of GRCh38 chr22. Index a genome once; look up thousands of short motifs in time proportional only to pattern length.
Demo: 19 real biological motifs (EcoRI, BamHI, HindIII, TATA box, E-box, etc.) on synthetic genomes up to 50 MB, compared against bytes.count(), re.findall(), and BioPython.
from parot import Index
genome = open("genome.fasta").read()
index = Index(genome) # Build once
# Each query: ~0.03ms regardless of genome size
enzymes = ["GAATTC", "GGATCC", "AAGCTT", "GCGGCCGC"]
for motif in enzymes:
print(f"{motif}: {index.count(motif):,} occurrences")
Results (Apple M2 Pro, 1,000 sgRNA guides × GRCh38 chr22, 50 MB):
| Competitor | Regime | Median (ms) | Speedup vs parot |
|---|---|---|---|
parot batch_find_all |
index | 2.31 | 1.0x |
bytes.find loop |
substring | 49,333 | 21,313x |
BioPython Seq.count_overlap |
count-only | 49,576 | 21,418x |
| ripgrep (subprocess) | substring | 189.3 | 82x |
Query time stays flat as the genome grows — pattern length, not corpus size, is what matters. Same class of data structure that powers BWA and Bowtie2.
Plagiarism Detection: Finding Copied Passages¶
37% of the essay plagiarized, detected in 1 ms. Two complementary primitives: longest_common_substring finds the single longest shared run; find_duplicates surfaces every shared phrase.
Demo: student essay vs. source document. The student copied 5 paragraphs verbatim and wrote their own transitions.
import parot
# Find the single longest shared passage (0.57ms)
lcs = parot.longest_common_substring(source, essay)
# Find ALL shared phrases between documents
combined = source + "\n\n" + essay
duplicates = parot.find_duplicates(combined, min_words=5)
shared = [d for d in duplicates if d["count"] >= 2]
Results (Apple Silicon):
- Longest shared passage: 277 characters, 34 words — found in 0.57ms
- All shared phrases: 15 distinct passages totaling 128 words — found in 1ms
- Verdict: 37% of the essay was plagiarized
The 15 detected phrases range from 4 to 28 words, capturing every copied passage including fragments split across paragraph boundaries.
Client-Side Browser Search¶
1,287× faster than indexOf on 10 MB. Ship Parot in WebAssembly for instant full-text search with no server round-trip.
Open the live demo → — searches 10 MB of Dickens in your browser.
Or use it in any project:
import init, { Index } from 'parot';
await init();
const text = document.body.innerText; // ~100 KB of docs
const index = new Index(text, 0); // Build: ~10 ms
const count = index.count("search term");
const positions = index.findAll("search term");
console.log(`${count} matches, index is ${index.heapSize()} bytes`);
Query time depends only on pattern length — searching 10 MB takes the same time as 10 KB. memory_compactness=0 (default) puts the WASM footprint at ~5× text; memory_compactness=4 shrinks it to ~1.4× at the cost of slower find_all.
Results (Apple M4):
| Text size | indexOf loop | parot | Speedup |
|---|---|---|---|
| 10 KB | 0.32ms | 0.22ms | 1.5x |
| 100 KB | 3.8ms | 0.22ms | 17x |
| 1 MB | 39ms | 0.26ms | 143x |
| 10 MB | 383ms | 0.29ms | 1,287x |
The demo includes a live text editor — paste your own content and search it instantly.
Index Visualizer¶
An educational tool for exploring Parot's index structures. Type a short string; every data structure is computed and visualized live.
Or run locally with just demo and open http://localhost:8765/examples/demo_visualizer.html.
What it shows:
- Index table — all suffixes sorted lexicographically, with rank and position
- Shared-prefix bars — shared-prefix lengths shown as colored bars
- Transform string — each character in its own cell, with sentinel highlighted
- Longest repeated substring — highlighted in the original text
- Distinct substring count — total unique substrings
- Round-trip — demonstrates perfect recovery from the reversible transform
Try typing "mississippi" or "abracadabra" — classic textbook examples that reveal beautiful patterns.
Text Complexity Analyzer¶
Paste text; get instant metrics — distinct substring count, longest repeated substring, and every duplicate phrase color-coded inline. A writing tool for finding accidental repetition.
Or run locally with just demo and open http://localhost:8765/examples/demo_text_complexity.html.
Features:
- Information density — distinct substring count measures how much unique content exists
- Duplicate phrase detection — every repeated phrase highlighted in a different color
- Interactive table — click any phrase to scroll to it in the text
- Compare mode — paste two texts side by side to compare complexity profiles
- Tunable parameters — adjust min words, toggle case sensitivity
Includes presets: Lorem ipsum (highly repetitive), Dickens (literary), random characters (high entropy). The contrast between these is immediately striking.
Longest Common Substring Finder¶
Two side-by-side text panes. Paste text in each; the longest shared passage is highlighted in both. Immediately intuitive: "what do these two texts have in common?"
Or run locally with just demo and open http://localhost:8765/examples/demo_lcs.html.
Features:
- Side-by-side highlighting — the LCS highlighted in green in both panes
- Stats — LCS length (chars and words), similarity percentage, time taken
- Multiple occurrences — if the LCS appears more than once, all instances are highlighted
- Preset examples — original vs plagiarized, two Wikipedia articles, similar code snippets
Useful for comparing documents, detecting copied content, or finding shared code patterns.