Parot¶
An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI. Build an index once, query it millions of times.
Benchmark¶
Parot's 187 ms index build happens once — queries are constant-time regardless of corpus size.
17 competitors on 10 MB of Wikipedia-EN. One index, many queries. Every position cross-verified against every competitor — mismatches abort the run.
| Tool | Lang | 100 phrases | 1,000 phrases | 10,000 phrases |
|---|---|---|---|---|
| Parot | Rust | 0.34 ms 1.00× |
4.59 ms 1.00× |
~46 ms 1.00× |
| ripgrep* | Rust | 123 ms 361× |
93.4 ms 20× |
~250 ms 5× |
| ahocorasick-rs | Rust | 53.7 ms 158× |
118 ms 26× |
~180 ms 4× |
| pyahocorasick | Python | 116 ms 339× |
158 ms 34× |
~250 ms 5× |
| stringzilla | Python | 42.6 ms 125× |
424 ms 92× |
~4.2 s 91× |
| modern-ahocorasick | JS | 341 ms 998× |
442 ms 96× |
695 ms 15× |
bytes.find loop |
Python | 383 ms 1,124× |
4.1 s 894× |
~41 s 891× |
String.indexOf |
JS | 403 ms 1,181× |
4.3 s 929× |
43.3 s 941× |
str.find loop |
Python | 427 ms 1,253× |
4.5 s 973× |
~45 s 978× |
pyarrow count_substring |
Python | 1,336 ms 3,918× |
14.4 s 3,129× |
~144 s 3,130× |
RegExp.exec /g |
JS | 396 ms 1,161× |
24.5 s 5,329× |
256 s 5,565× |
String.matchAll |
JS | 394 ms 1,156× |
24.8 s 5,403× |
258 s 5,609× |
| regex (mrab) | Python | 449 ms 1,317× |
30.5 s 6,640× |
~305 s 6,630× |
| google-re2* | Python | 3,027 ms 8,877× |
31.6 s 6,870× |
~316 s 6,870× |
re.finditer |
Python | 677 ms 1,985× |
71.4 s 15,551× |
~714 s 15,520× |
polars count_matches |
Python | 1,457 ms 4,273× |
106 s 23,134× |
~1,060 s 23,040× |
pandas str.count |
Python | 1,709 ms 5,011× |
109 s 23,672× |
~1,090 s 23,700× |
Apple M2 Pro · 100, 1,000 and 10,000 phrases of 2–20 words · median of 3 runs. Reproduce with just bench-hero. *soft parity (non-overlapping multi-pattern semantics). ~ in the 10,000 column = projected from 1,000-phrase scaling, pending a real run.
→ Live browser demo → Full benchmark suite
Why it's fast¶
Parot is sublinear: after a one-time O(N) build, each query is O(m + k) — where m is the pattern length and k is the number of matches. Every other tool in the table is O(N) per query; double the corpus, double the time. Parot's query time doesn't change.
That's why the gap grows with your text — 125× on 10 MB, thousands of times on a gigabyte.
Install¶
Parot is in pre-release. Python ships to TestPyPI; JavaScript ships to npm under the
betadist-tag. Both move to their primary registries once the API stabilises.
The --extra-index-url lets pip pull Parot's runtime deps (numpy, loguru, rich, typer) from PyPI:
Or with uv:
Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).
Quick start¶
Drop-in string replacement. Wrap any string. Every str method still works; the slow ones become microsecond-latency queries, plus findall, finditer, kwic, and summary on top.
import parot
text = parot.text(open("genome.fa").read()) # wraps str, index built lazily
text.count("ATTGCC") # microsecond-latency
"CRISPR" in text # microsecond-latency
text.findall("TATA") # sorted list of all positions
for m in text.finditer("GATTACA"):
print(m.before(40), m.match, m.after(40)) # with surrounding context
DataFrames. Replace df["col"].str.contains() with one line. Up to 36,000× faster.
import pandas as pd
import parot # registers .parot accessor
df = pd.read_parquet("logs.parquet") # 500K rows
mask = df["message"].str.contains("error") # 1,200 ms
mask = df.parot.contains("message", "error") # 0.03 ms — 36,000× faster
df.parot.batch_count("message", ["error", "warning", "fatal", "timeout"])
Power-user API. Full index with numpy arrays, batch operations, and serialization.
from parot import Index
index = Index(open("shakespeare.txt").read()) # build once
index.count("the king") # ~0.003 ms
index.find_all("my lord") # numpy array of positions
index.search("good sir", context=50) # with surrounding context
index.batch_count(["king", "queen", "duke"]) # parallel over patterns
index.save("shakespeare.parot") # resumable, cross-platform
Duplicates, similarity, analysis.
import parot
parot.find_duplicates(text, min_words=4) # repeated phrases
parot.common_passages(manuscript, reference) # shared passages
parot.text_similarity(manuscript, reference) # float 0..1
parot.longest_common_substring(draft_a, draft_b)
parot.unique_fragment_count(corpus)
JavaScript / WASM. Same engine, any browser or Node.js. Indices serialize across Python and JS.
import { Index } from 'parot';
const index = new Index(text, 0);
index.count('pattern'); // microsecond-latency
index.findAll('pattern'); // Uint32Array of positions
index.search('pattern', 50); // with surrounding context
index.batchCount(['error', 'warning', 'fatal']); // multi-pattern
const bytes = index.serialize(); // interchange with Python
CLI.
parot scan manuscript.md --top 20 # duplicate phrases
parot search corpus.txt "pattern" # positions in a file
parot count corpus.txt "pattern" # occurrence count
parot lcs a.md b.md # longest common substring
parot info manuscript.md # size, word count, distinct substrings
→ Full Python API · Full JavaScript API · CLI Reference
What's in the box¶
Parot is not just "fast grep" — it's a text-analysis toolkit. Everything below is a single call on an index you already built.
Search
| Capability | Python | JS | CLI |
|---|---|---|---|
count, find, find_all, index, __contains__ — single-pattern |
✓ | ✓ | ✓ |
search, extract — matches with surrounding context |
✓ | ✓ | — |
finditer — lazy iterator with .before() / .after() per match |
✓ | — | — |
batch_count, batch_find_all, batch_search, batch_extract — multi-pattern, parallel |
✓ | ✓ | — |
kwic — Key-Word-In-Context DataFrame |
✓ | — | — |
summary — pattern prevalence DataFrame |
✓ | — | — |
Multi-document / segment-aware
| Capability | Python | JS | CLI |
|---|---|---|---|
Index.from_strings, from_series, from_dataframe, from_arrow, from_pyarrow, from_polars |
✓ | ✓ | — |
contains_mask, find_segments, count_per_segment — per-segment answers |
✓ | ✓ | — |
batch_contains_mask, batch_count_per_segment, batch_find_segments |
✓ | ✓ | — |
filter, grep — return the segments (and IDs) matching a pattern |
✓ | ✓ | — |
Duplicates & repetition
| Capability | Python | JS | CLI |
|---|---|---|---|
find_duplicates — repeated phrases, word-boundary aware, sentence-clipped |
✓ | ✓ | ✓ |
find_duplicates_normalized — with case folding + whitespace collapsing |
✓ | ✓ | — |
find_duplicates_from_path — memory-mapped, won't load into Python heap |
✓ | — | ✓ |
batch_find_duplicates — many documents in parallel |
✓ | ✓ | — |
Similarity & comparison
| Capability | Python | JS | CLI |
|---|---|---|---|
common_passages — every shared passage between two documents + coverage |
✓ | ✓ | — |
batch_common_passages — one reference vs. many candidates |
✓ | ✓ | — |
text_similarity, batch_text_similarity — scalar coverage score |
✓ | ✓ | — |
longest_common_substring — longest shared run between two texts |
✓ | ✓ | ✓ |
unique_fragment_count — structural corpus fingerprint |
✓ | ✓ | ✓ |
DataFrame integration
| Capability | Pandas | Polars |
|---|---|---|
series.parot.contains / count / find_all |
✓ | ✓ |
series.parot.batch_contains / batch_count — multi-pattern, one result frame |
✓ | ✓ |
df.parot.contains / count / batch_contains / batch_count — column-level accessors |
✓ | ✓ |
pl.col("text").parot.contains / count — expression-level in lazy queries |
— | ✓ |
Persistent per-column index cache with build_index / invalidate / memory_bytes |
✓ | ✓ |
result_to_frame, batch_result_to_frames, batch_summary_frame — Arrow → DataFrame |
✓ | ✓ |
Persistence & interchange
| Capability | Python | JS |
|---|---|---|
Index.save / Index.load — file-backed, resumable |
✓ | ✓ |
Index.to_bytes / Index.from_bytes — in-memory serialization |
✓ | ✓ |
| Cross-platform format: a Python-built index loads in the browser and vice versa | ✓ | ✓ |
Configuration¶
- Memory / speed knob.
memory_compactnesstradesfind_alllatency for a smaller RAM footprint (0 = fastest, 4 = most compact). - Case-insensitive matching via a build-time flag, applied to both corpus and queries.
- Whitespace normalization — results remap back to original positions.
- Introspection.
len(idx),idx.memory_bytes,idx.config,idx.has_segments,idx.segment_count,idx[i]character access.
The substring gap¶
Token-based search libraries (fuse.js, lunr, minisearch) split text into words. They can't find arbitrary substrings:
| Query | fuse.js | lunr | minisearch | Parot |
|---|---|---|---|---|
"rown fox" |
— | ✓ | ✓ | ✓ |
"ATTA" |
— | — | — | ✓ |
"Script dev" |
— | — | — | ✓ |
"ghbor-no" |
— | — | — | ✓ |
"ix arr" |
— | — | — | ✓ |
"uick brown f" |
— | ✓ | ✓ | ✓ |
| Score | 0/6 | 2/6 | 2/6 | 6/6 |
Parot's WASM build runs in any browser — no server round-trip, no backend needed.
When to use Parot¶
- You query the same text multiple times
- Text is large (>100 KB) — the bigger, the more you win
- You need substring search, not just word search
- You're replacing pandas/polars string operations on a column
- You're finding repeated phrases (writing, plagiarism, LLM dedup, moderation)
- You need client-side full-text search in the browser via WASM
If you'd use str.find() or grep today, Parot is a drop-in acceleration. If you'd use Elasticsearch today, keep using Elasticsearch.
Ecosystem¶
Projects built on Parot:
| Project | Description |
|---|---|
| fast-diff | Structural document diff |
| fast-regex | Index-accelerated regular expressions |
| fast-dedup | Large-scale document deduplication |
| fast-fuzz | Fuzzy string matching at scale |
| qzip | Block-transform compression |
FAQ¶
When should I use Parot instead of Elasticsearch?
When your queries don't align with word boundaries: arbitrary substring search, duplicate phrase detection, non-tokenizable data (DNA, binary protocols), client-side WASM search, or one-shot analysis.
Is Parot a good fit for my workload?
Parot is designed for workloads where the same text is queried repeatedly. If you query each document once and discard it, a linear scanner may be the right tool. If you query each document many times — or scan many patterns against one corpus — Parot wins, and the gap grows without bound.
Does it work in the browser?
Yes. The WASM build runs in any modern browser and Node.js with the same API as the native library. See the live browser demo.
Licensing¶
Parot is distributed under the Elastic License 2.0 — a source-available license that keeps the code open for almost every real-world use while protecting the project from being resold as a hosted service.
Free, no contract or key required: production use (including commercial), shipping Parot inside a closed-source product, internal tools, research, academic work, and open-source projects.
Three restrictions: no hosted reseller services, no circumventing the license-key gate on save / load, no stripping copyright or license notices.
What requires a key: saving and loading index artifacts (Index.save / load, to_bytes / from_bytes, JS serialize / deserialize). Build and query are free forever. Set PAROT_LICENSE_KEY=<key> before running any save/load.
- Free trial key — rotated every 14 days. Fine for evaluation, research, CI, and reproducible benchmarks. Email hello@sophiaconsulting.ai and we'll send you the current key.
- Commercial key — flat-rate, annual, bound to your team. Email hello@sophiaconsulting.ai with a one-line description of your use case.
- Hosted-reseller terms — talk to us before shipping.
Stability, security, contributing¶
- Stability — API contract documented in STABILITY. Pin to a tag or commit SHA for reproducible builds during the
0.xseries. - Security — vulnerability reports: see SECURITY.
- Contributing — see Contributing for setup, testing, and PR guidelines.