Guide¶
When to use what¶
Build an index, then query¶
Use an Index when:
- You search the same text many times.
- You need substring search (not word-level).
- The text is large (>100 KB) — the bigger, the more you win.
- You want to discard the original text (self-index).
from parot import Index, materialize_strings
index = Index(corpus) # build once
for motif in thousand_motifs: # each query: microseconds
n = index.count(motif)
result = index.search("pattern", context=50) # columnar Arrow dict
for before, matched, after in zip(
materialize_strings(result, "before"),
materialize_strings(result, "matched"),
materialize_strings(result, "after"),
):
print(f"...{before}[{matched}]{after}...")
Don't use it when:
- You'll run fewer than ~150 queries on the same text.
- The text changes between queries (the index must be rebuilt).
- You need regex (Parot does exact substring only — use fast-regex for index-accelerated regex).
See the API Reference for complete signatures.
Duplicate detection¶
Find repeated phrases in manuscripts, documentation, note collections, or LLM training data. The same linear-time approach used in "Deduplicating Training Data Makes Language Models Better" (Google, 2022) — made accessible via pip and npm.
Defaults (min_words=4, min_chars=9) balance precision and recall. case_insensitive=True collapses phrases that differ only in case ("he said" / "He said" become one). That reports fewer distinct phrases, but each one covers more of the text — what you want when deduplicating. Leave it off when you want every distinct phrase. See benchmarks for parameter sensitivity.
See the Duplicate Detection API for usage.
Case-insensitive and whitespace-insensitive search¶
Normalize at build time. Queries are auto-normalized to match, and find_all() returns positions in the original (unnormalized) text — so you can use the offsets directly for highlighting or slicing.
Use when your corpus has inconsistent casing or whitespace: user-generated content, OCR output, mixed formatting.
from parot import Index
corpus = "Hello World\nhello world\nHELLO WORLD"
# Case-insensitive: "hello" matches all three variants
index = Index(corpus, case_insensitive=True)
assert index.count("hello world") == 3
# Whitespace-insensitive: "Hello World" matches "hello world"
index = Index(corpus, normalize_whitespace=True)
positions = index.find_all("Hello World") # positions in original text
# Both:
index = Index(corpus, case_insensitive=True, normalize_whitespace=True)
See the Rust crate's builder API. The
index exposes case_insensitive(true) and normalize_whitespace(true)
toggles that behave identically to the Python and JavaScript examples
above.
Tuning memory_compactness — the memory/speed knob¶
Two operations: count (how many?) and find_all (where?). count is always fast — scales with pattern length. find_all speed depends on memory_compactness, which controls how much positional detail the index keeps.
memory_compactness |
Memory (1MB text) | count() | find_all() | Best for |
|---|---|---|---|---|
| 0 (default) | 3.4 MB | 0.28ms | 0.32ms | Default — fastest find_all, all use cases |
| 1 | 2.2 MB | 0.27ms | 1.0ms | Mild memory pressure |
| 2 | 1.6 MB | 0.28ms | 2.7ms | Smaller artifact, find_all still usable |
| 3 | 1.3 MB | 0.27ms | 6.0ms | |
| 4 | 1.2 MB | 0.26ms | 12.6ms | Memory-constrained environments |
Higher values reduce memory footprint at the cost of find_all query latency. count() is unaffected.
Normalization is free
case_insensitive and normalize_whitespace have zero measurable overhead on build time, memory, or count() speed. Enable them whenever your use case benefits.
Choosing the right level:
Use memory_compactness=0. You want fast find_all for highlighting matches.
At 10MB of text, the index is ~50MB — fine for browsers and desktop apps.
Stick with memory_compactness=0 (default). Count speed is identical across all levels,
and level 0 also makes find_all() fast in case you need it later.
Rule of thumb
Default to memory_compactness=0. Raise it only to shrink the on-disk / in-memory artifact.
For the conceptual primitives Parot exposes, see Concepts.
Limitations¶
- No regex. Parot does exact substring matching. For
colou?rorlog.*error, use fast-regex (index-accelerated) or a regex engine. - No fuzzy matching. For approximate or edit-distance search, use fast-fuzz or fuse.js.
- No streaming / incremental indexing. Build from the complete text. If the text changes, rebuild.
- Single-threaded construction. Index building uses a single thread. For >1 GB texts, build time is significant.
FAQ¶
Why not just use indexOf?
indexOf scans the entire text for every query — linear in text size per call. On 10 MB, that's ~10 ms per query. Parot's query cost is proportional to pattern length, not text size — typically under 100 µs. The gap grows linearly with the corpus.
Below ~150 queries on 1 MB of text, indexOf wins. Above that, Parot wins and the gap grows without bound.
Why not use Pagefind / FlexSearch / lunr.js / Fuse.js?
Those are token-based search engines — they split text into words and match by terms. Excellent for document retrieval ("pages containing these keywords"); they can't do arbitrary substring search.
"icroserv" (a substring of "microservices") returns zero results in lunr.js, FlexSearch, and Fuse.js. Parot finds it instantly. Use token-based engines when you want ranking and fuzzy matching; use Parot when you need exact substring positions.
What about regex?
Regex engines (including String.prototype.matchAll) scan the whole text per query — same linear cost as indexOf. Parot trades generality for speed: exact substrings only, but microsecond-latency after one build. For regex with the same speedup profile, see fast-regex.
When should I use Parot instead of Elasticsearch?
When your queries don't align with word boundaries: arbitrary substring search, duplicate phrase detection, non-tokenizable data (DNA, binary protocols), client-side WASM search, or one-shot analysis.
Rule of thumb: if you'd use str.find() or grep today, Parot is a drop-in acceleration. If you'd use Elasticsearch today, keep using Elasticsearch.
How does Parot compare to Aho-Corasick?
| Parot | Aho-Corasick | |
|---|---|---|
| Build cost | linear on the text (~50 ms/MB) | linear on the patterns (~0.1 ms) |
| Query cost | scales with pattern length, independent of text size | must scan the entire text |
| Best when | querying the same text many times | querying different texts with the same patterns |
| Break-even | ~150 queries on same text | always wins for one-shot multi-pattern |
Parot indexes the text once. Aho-Corasick indexes the patterns once. See the multi-pattern benchmarks.
Why is the DNA benchmark so extreme (21,000x)?
Genome sequences are the ideal case. Large text (50 MB+), short queries (4–20 bp), no tokenization to fall back on, exact matches only. bwa and bowtie — the standard tools in every genomics lab — are built on comparable data structures.
What's new here?
Nothing in the underlying algorithms — they've existed for 30 years. What's new is the packaging: one Rust core with Python, JS/WASM, and CLI bindings from a single codebase, native pandas / polars / Arrow integration, and word-boundary-aware duplicate detection. The contribution is making it all accessible from a single pip install.