Concepts¶
The primitives Parot exposes, in plain English. You don't need to read this to use the library — the Guide tells you which tool to pick — but understanding the primitives helps you make better decisions.
What is an Index?¶
A compressed structure that answers "where does this pattern appear?" in time proportional to pattern length — regardless of how large the text is.
Parot's Index is a self-index: it replaces the original text with a compressed structure you can query directly. At the default memory_compactness=0 the index stays at ~5× text with microsecond-latency batch_find_all; memory_compactness=4 shrinks it to ~1.4× text at the cost of slower find_all.
Think of a library. Searching without an index means reading every page of every book (that's str.count() or indexOf). Parot's index is a compact card catalog — smaller than the text itself, and you don't need the original books to answer queries.
Two core operations:
count(pattern): how many times does the pattern appear? Always fast — scales with pattern length, not corpus size.find_all(pattern): where does each occurrence start? Speed depends onmemory_compactness(see the Guide).
| Approach | Memory | Query time | Original text needed? |
|---|---|---|---|
Scan the text (indexOf, str.count()) |
1× text | O(n) per query | Yes |
Parot Index (memory_compactness=0, default) |
~5× text | scales with pattern length | No |
Parot Index (memory_compactness=4) |
~1.4× text | count unchanged; slower find_all | No |
When to use an Index
Pays for itself after ~150 queries. If you're searching for many patterns in the same text, or the text is large (>1 MB), Parot is almost certainly the right choice. See Benchmarks for break-even analysis.
What is duplicate phrase detection?¶
Find every phrase that appears more than once in your text — with word-boundary awareness.
Raw substring matching finds byte sequences. If "the quick brown fox" repeats, it would also report "e quick brown f" as a duplicate. Parot clips to word boundaries only, producing meaningful phrases humans can act on.
Use cases:
- Writing & editing. Unintentional repetition in manuscripts and documentation.
- Plagiarism detection. Shared passages between documents.
- LLM training data. Training-corpus dedup — the same approach used in Google's 2022 paper.
- Content moderation. Coordinated spam campaigns and copypasta.
2,385 ms on the King James Bible (4.6 MB, 824K words) finds 57,009 phrases.
What is LCS (Longest Common Substring)?¶
The longest passage shared between two texts.
Given two documents, LCS returns the longest contiguous sequence of characters that appears in both. (Not "longest common subsequence", which allows gaps.)
Parot computes LCS in linear time — vs. O(nm) for dynamic programming. On 10 KB inputs, that's 532× faster than pylcs.
Use cases: plagiarism detection, comparing document versions, finding shared content.
What is sentence splitting?¶
Rule-based sentence boundary detection with abbreviation and multi-script support. The duplicate detection pipeline needs it: phrases never span sentence boundaries.
- 100+ abbreviations (Mr., Dr., Prof., U.S.A., ...)
- Decimal numbers (3.14 is not a sentence break)
- URLs and email addresses
- CJK punctuation (。?!), Arabic (؟), Devanagari (।), and more
Newlines are always treated as boundaries.
How it fits together¶
One Rust core; Python, JavaScript/WASM, and CLI bindings from a single codebase.
Text (UTF-8)
│
▼
Rust core (linear-time construction)
│
├── Index — count / find_all / batch / search / extract (self-index)
├── Duplicate phrase detection — word-boundary aware
├── Similarity — common passages, text_similarity, LCS
├── String analysis — distinct substrings, longest repeated
└── Sentence splitter / tokenizer
- Start here. The Guide picks the right tool for your use case.
- API details. The API Reference has complete signatures for Python, JavaScript, Rust, and CLI.