Skip to content

Guide

When to use what

Build an index, then query

Use an Index when:

  • You search the same text many times.
  • You need substring search (not word-level).
  • The text is large (>100 KB) — the bigger, the more you win.
  • You want to discard the original text (self-index).
from parot import Index, materialize_strings

index = Index(corpus)                # build once

for motif in thousand_motifs:        # each query: microseconds
    n = index.count(motif)

result = index.search("pattern", context=50)      # columnar Arrow dict
for before, matched, after in zip(
    materialize_strings(result, "before"),
    materialize_strings(result, "matched"),
    materialize_strings(result, "after"),
):
    print(f"...{before}[{matched}]{after}...")

Don't use it when:

  • You'll run fewer than ~150 queries on the same text.
  • The text changes between queries (the index must be rebuilt).
  • You need regex (Parot does exact substring only — use fast-regex for index-accelerated regex).

See the API Reference for complete signatures.

Duplicate detection

Find repeated phrases in manuscripts, documentation, note collections, or LLM training data. The same linear-time approach used in "Deduplicating Training Data Makes Language Models Better" (Google, 2022) — made accessible via pip and npm.

Defaults (min_words=4, min_chars=9) balance precision and recall. case_insensitive=True collapses phrases that differ only in case ("he said" / "He said" become one). That reports fewer distinct phrases, but each one covers more of the text — what you want when deduplicating. Leave it off when you want every distinct phrase. See benchmarks for parameter sensitivity.

See the Duplicate Detection API for usage.

Normalize at build time. Queries are auto-normalized to match, and find_all() returns positions in the original (unnormalized) text — so you can use the offsets directly for highlighting or slicing.

Use when your corpus has inconsistent casing or whitespace: user-generated content, OCR output, mixed formatting.

from parot import Index

corpus = "Hello World\nhello world\nHELLO WORLD"

# Case-insensitive: "hello" matches all three variants
index = Index(corpus, case_insensitive=True)
assert index.count("hello world") == 3

# Whitespace-insensitive: "Hello   World" matches "hello world"
index = Index(corpus, normalize_whitespace=True)
positions = index.find_all("Hello World")  # positions in original text

# Both:
index = Index(corpus, case_insensitive=True, normalize_whitespace=True)
import { Index } from 'parot';

const index = new Index(text, 2, true, true); // case + whitespace
const count = index.count("hello world"); // matches regardless of case/spacing

See the Rust crate's builder API. The index exposes case_insensitive(true) and normalize_whitespace(true) toggles that behave identically to the Python and JavaScript examples above.

Tuning memory_compactness — the memory/speed knob

Two operations: count (how many?) and find_all (where?). count is always fast — scales with pattern length. find_all speed depends on memory_compactness, which controls how much positional detail the index keeps.

memory_compactness Memory (1MB text) count() find_all() Best for
0 (default) 3.4 MB 0.28ms 0.32ms Default — fastest find_all, all use cases
1 2.2 MB 0.27ms 1.0ms Mild memory pressure
2 1.6 MB 0.28ms 2.7ms Smaller artifact, find_all still usable
3 1.3 MB 0.27ms 6.0ms
4 1.2 MB 0.26ms 12.6ms Memory-constrained environments

Higher values reduce memory footprint at the cost of find_all query latency. count() is unaffected.

Normalization is free

case_insensitive and normalize_whitespace have zero measurable overhead on build time, memory, or count() speed. Enable them whenever your use case benefits.

Choosing the right level:

Use memory_compactness=0. You want fast find_all for highlighting matches. At 10MB of text, the index is ~50MB — fine for browsers and desktop apps.

index = Index(text, memory_compactness=0)
positions = index.find_all("pattern")

Stick with memory_compactness=0 (default). Count speed is identical across all levels, and level 0 also makes find_all() fast in case you need it later.

index = Index(text, memory_compactness=0)
for motif in motifs:
    n = index.count(motif)  # microseconds per query

Use memory_compactness=2 or 4. Trades find_all speed for a smaller artifact. Count is unchanged.

index = Index(text, memory_compactness=4)
# Count: same speed as level 0
# find_all: slower, but uses ~4x less memory than level 0

Rule of thumb

Default to memory_compactness=0. Raise it only to shrink the on-disk / in-memory artifact.


For the conceptual primitives Parot exposes, see Concepts.


Limitations

  • No regex. Parot does exact substring matching. For colou?r or log.*error, use fast-regex (index-accelerated) or a regex engine.
  • No fuzzy matching. For approximate or edit-distance search, use fast-fuzz or fuse.js.
  • No streaming / incremental indexing. Build from the complete text. If the text changes, rebuild.
  • Single-threaded construction. Index building uses a single thread. For >1 GB texts, build time is significant.

FAQ

Why not just use indexOf?

indexOf scans the entire text for every query — linear in text size per call. On 10 MB, that's ~10 ms per query. Parot's query cost is proportional to pattern length, not text size — typically under 100 µs. The gap grows linearly with the corpus.

Below ~150 queries on 1 MB of text, indexOf wins. Above that, Parot wins and the gap grows without bound.

Why not use Pagefind / FlexSearch / lunr.js / Fuse.js?

Those are token-based search engines — they split text into words and match by terms. Excellent for document retrieval ("pages containing these keywords"); they can't do arbitrary substring search.

"icroserv" (a substring of "microservices") returns zero results in lunr.js, FlexSearch, and Fuse.js. Parot finds it instantly. Use token-based engines when you want ranking and fuzzy matching; use Parot when you need exact substring positions.

What about regex?

Regex engines (including String.prototype.matchAll) scan the whole text per query — same linear cost as indexOf. Parot trades generality for speed: exact substrings only, but microsecond-latency after one build. For regex with the same speedup profile, see fast-regex.

When should I use Parot instead of Elasticsearch?

When your queries don't align with word boundaries: arbitrary substring search, duplicate phrase detection, non-tokenizable data (DNA, binary protocols), client-side WASM search, or one-shot analysis.

Rule of thumb: if you'd use str.find() or grep today, Parot is a drop-in acceleration. If you'd use Elasticsearch today, keep using Elasticsearch.

How does Parot compare to Aho-Corasick?
Parot Aho-Corasick
Build cost linear on the text (~50 ms/MB) linear on the patterns (~0.1 ms)
Query cost scales with pattern length, independent of text size must scan the entire text
Best when querying the same text many times querying different texts with the same patterns
Break-even ~150 queries on same text always wins for one-shot multi-pattern

Parot indexes the text once. Aho-Corasick indexes the patterns once. See the multi-pattern benchmarks.

Why is the DNA benchmark so extreme (21,000x)?

Genome sequences are the ideal case. Large text (50 MB+), short queries (4–20 bp), no tokenization to fall back on, exact matches only. bwa and bowtie — the standard tools in every genomics lab — are built on comparable data structures.

What's new here?

Nothing in the underlying algorithms — they've existed for 30 years. What's new is the packaging: one Rust core with Python, JS/WASM, and CLI bindings from a single codebase, native pandas / polars / Arrow integration, and word-boundary-aware duplicate detection. The contribution is making it all accessible from a single pip install.