Parot¶

An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI. Build an index once, query it millions of times.

Benchmark¶

Parot's 187 ms index build happens once — queries are constant-time regardless of corpus size.

17 competitors on 10 MB of Wikipedia-EN. One index, many queries. Every position cross-verified against every competitor — mismatches abort the run.

Tool	Lang	100 phrases	1,000 phrases	10,000 phrases
Parot	Rust	0.34 ms `1.00×`	4.59 ms `1.00×`	~46 ms `1.00×`
ripgrep*	Rust	123 ms `361×`	93.4 ms `20×`	~250 ms `5×`
ahocorasick-rs	Rust	53.7 ms `158×`	118 ms `26×`	~180 ms `4×`
pyahocorasick	Python	116 ms `339×`	158 ms `34×`	~250 ms `5×`
stringzilla	Python	42.6 ms `125×`	424 ms `92×`	~4.2 s `91×`
modern-ahocorasick	JS	341 ms `998×`	442 ms `96×`	695 ms `15×`
`bytes.find` loop	Python	383 ms `1,124×`	4.1 s `894×`	~41 s `891×`
`String.indexOf`	JS	403 ms `1,181×`	4.3 s `929×`	43.3 s `941×`
`str.find` loop	Python	427 ms `1,253×`	4.5 s `973×`	~45 s `978×`
pyarrow `count_substring`	Python	1,336 ms `3,918×`	14.4 s `3,129×`	~144 s `3,130×`
`RegExp.exec /g`	JS	396 ms `1,161×`	24.5 s `5,329×`	256 s `5,565×`
`String.matchAll`	JS	394 ms `1,156×`	24.8 s `5,403×`	258 s `5,609×`
regex (mrab)	Python	449 ms `1,317×`	30.5 s `6,640×`	~305 s `6,630×`
google-re2*	Python	3,027 ms `8,877×`	31.6 s `6,870×`	~316 s `6,870×`
`re.finditer`	Python	677 ms `1,985×`	71.4 s `15,551×`	~714 s `15,520×`
polars `count_matches`	Python	1,457 ms `4,273×`	106 s `23,134×`	~1,060 s `23,040×`
pandas `str.count`	Python	1,709 ms `5,011×`	109 s `23,672×`	~1,090 s `23,700×`

_{Apple M2 Pro · 100, 1,000 and 10,000 phrases of 2–20 words · median of 3 runs. Reproduce with just bench-hero. ^*soft parity (non-overlapping multi-pattern semantics). ~ in the 10,000 column = projected from 1,000-phrase scaling, pending a real run.}

→ Live browser demo → Full benchmark suite

Why it's fast¶

Parot is sublinear: after a one-time O(N) build, each query is O(m + k) — where m is the pattern length and k is the number of matches. Every other tool in the table is O(N) per query; double the corpus, double the time. Parot's query time doesn't change.

That's why the gap grows with your text — 125× on 10 MB, thousands of times on a gigabyte.

Install¶

Parot is in pre-release. Python ships to TestPyPI; JavaScript ships to npm under the beta dist-tag. Both move to their primary registries once the API stabilises.

PythonJavaScript / TypeScriptCLI

The --extra-index-url lets pip pull Parot's runtime deps (numpy, loguru, rich, typer) from PyPI:

pip install -i https://test.pypi.org/simple/ \
            --extra-index-url https://pypi.org/simple/ \
            parot

Or with uv:

uv pip install -i https://test.pypi.org/simple/ \
               --extra-index-url https://pypi.org/simple/ \
               parot

npm install parot@beta          # or: pnpm add parot@beta / yarn add parot@beta

Clone and build (not yet on Homebrew or crates.io):

git clone https://github.com/sophiaconsulting/parot.git && cd parot
cargo install --path crates/cli

Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).

Quick start¶

Drop-in string replacement. Wrap any string. Every str method still works; the slow ones become microsecond-latency queries, plus findall, finditer, kwic, and summary on top.

import parot

text = parot.text(open("genome.fa").read())   # wraps str, index built lazily

text.count("ATTGCC")          # microsecond-latency
"CRISPR" in text              # microsecond-latency
text.findall("TATA")          # sorted list of all positions
for m in text.finditer("GATTACA"):
    print(m.before(40), m.match, m.after(40))   # with surrounding context

DataFrames. Replace df["col"].str.contains() with one line. Up to 36,000× faster.

import pandas as pd
import parot                          # registers .parot accessor

df = pd.read_parquet("logs.parquet")              # 500K rows

mask = df["message"].str.contains("error")        # 1,200 ms
mask = df.parot.contains("message", "error")      # 0.03 ms — 36,000× faster

df.parot.batch_count("message", ["error", "warning", "fatal", "timeout"])

Power-user API. Full index with numpy arrays, batch operations, and serialization.

from parot import Index

index = Index(open("shakespeare.txt").read())    # build once
index.count("the king")                          # ~0.003 ms
index.find_all("my lord")                        # numpy array of positions
index.search("good sir", context=50)             # with surrounding context
index.batch_count(["king", "queen", "duke"])     # parallel over patterns
index.save("shakespeare.parot")                  # resumable, cross-platform

Duplicates, similarity, analysis.

import parot

parot.find_duplicates(text, min_words=4)              # repeated phrases
parot.common_passages(manuscript, reference)          # shared passages
parot.text_similarity(manuscript, reference)          # float 0..1
parot.longest_common_substring(draft_a, draft_b)
parot.unique_fragment_count(corpus)

JavaScript / WASM. Same engine, any browser or Node.js. Indices serialize across Python and JS.

import { Index } from 'parot';

const index = new Index(text, 0);
index.count('pattern');                          // microsecond-latency
index.findAll('pattern');                        // Uint32Array of positions
index.search('pattern', 50);                     // with surrounding context
index.batchCount(['error', 'warning', 'fatal']); // multi-pattern
const bytes = index.serialize();                 // interchange with Python

CLI.

parot scan    manuscript.md --top 20   # duplicate phrases
parot search  corpus.txt "pattern"     # positions in a file
parot count   corpus.txt "pattern"     # occurrence count
parot lcs     a.md b.md                # longest common substring
parot info    manuscript.md            # size, word count, distinct substrings

→ Full Python API · Full JavaScript API · CLI Reference

What's in the box¶

Parot is not just "fast grep" — it's a text-analysis toolkit. Everything below is a single call on an index you already built.

Search

Capability	Python	JS	CLI
`count`, `find`, `find_all`, `index`, `__contains__` — single-pattern	✓	✓	✓
`search`, `extract` — matches with surrounding context	✓	✓	—
`finditer` — lazy iterator with `.before()` / `.after()` per match	✓	—	—
`batch_count`, `batch_find_all`, `batch_search`, `batch_extract` — multi-pattern, parallel	✓	✓	—
`kwic` — Key-Word-In-Context DataFrame	✓	—	—
`summary` — pattern prevalence DataFrame	✓	—	—

Multi-document / segment-aware

Capability	Python	JS	CLI
`Index.from_strings`, `from_series`, `from_dataframe`, `from_arrow`, `from_pyarrow`, `from_polars`	✓	✓	—
`contains_mask`, `find_segments`, `count_per_segment` — per-segment answers	✓	✓	—
`batch_contains_mask`, `batch_count_per_segment`, `batch_find_segments`	✓	✓	—
`filter`, `grep` — return the segments (and IDs) matching a pattern	✓	✓	—

Duplicates & repetition

Capability	Python	JS	CLI
`find_duplicates` — repeated phrases, word-boundary aware, sentence-clipped	✓	✓	✓
`find_duplicates_normalized` — with case folding + whitespace collapsing	✓	✓	—
`find_duplicates_from_path` — memory-mapped, won't load into Python heap	✓	—	✓
`batch_find_duplicates` — many documents in parallel	✓	✓	—

Similarity & comparison

Capability	Python	JS	CLI
`common_passages` — every shared passage between two documents + coverage	✓	✓	—
`batch_common_passages` — one reference vs. many candidates	✓	✓	—
`text_similarity`, `batch_text_similarity` — scalar coverage score	✓	✓	—
`longest_common_substring` — longest shared run between two texts	✓	✓	✓
`unique_fragment_count` — structural corpus fingerprint	✓	✓	✓

DataFrame integration

Capability	Pandas	Polars
`series.parot.contains / count / find_all`	✓	✓
`series.parot.batch_contains / batch_count` — multi-pattern, one result frame	✓	✓
`df.parot.contains / count / batch_contains / batch_count` — column-level accessors	✓	✓
`pl.col("text").parot.contains / count` — expression-level in lazy queries	—	✓
Persistent per-column index cache with `build_index` / `invalidate` / `memory_bytes`	✓	✓
`result_to_frame`, `batch_result_to_frames`, `batch_summary_frame` — Arrow → DataFrame	✓	✓

Persistence & interchange

Capability	Python	JS
`Index.save` / `Index.load` — file-backed, resumable	✓	✓
`Index.to_bytes` / `Index.from_bytes` — in-memory serialization	✓	✓
Cross-platform format: a Python-built index loads in the browser and vice versa	✓	✓

Configuration¶

Memory / speed knob. memory_compactness trades find_all latency for a smaller RAM footprint (0 = fastest, 4 = most compact).
Case-insensitive matching via a build-time flag, applied to both corpus and queries.
Whitespace normalization — results remap back to original positions.
Introspection. len(idx), idx.memory_bytes, idx.config, idx.has_segments, idx.segment_count, idx[i] character access.

The substring gap¶

Token-based search libraries (fuse.js, lunr, minisearch) split text into words. They can't find arbitrary substrings:

Query	fuse.js	lunr	minisearch	Parot
`"rown fox"`	—	✓	✓	✓
`"ATTA"`	—	—	—	✓
`"Script dev"`	—	—	—	✓
`"ghbor-no"`	—	—	—	✓
`"ix arr"`	—	—	—	✓
`"uick brown f"`	—	✓	✓	✓
Score	0/6	2/6	2/6	6/6

Parot's WASM build runs in any browser — no server round-trip, no backend needed.

When to use Parot¶

You query the same text multiple times
Text is large (>100 KB) — the bigger, the more you win
You need substring search, not just word search
You're replacing pandas/polars string operations on a column
You're finding repeated phrases (writing, plagiarism, LLM dedup, moderation)
You need client-side full-text search in the browser via WASM

If you'd use str.find() or grep today, Parot is a drop-in acceleration. If you'd use Elasticsearch today, keep using Elasticsearch.

Ecosystem¶

Projects built on Parot:

Project	Description
fast-diff	Structural document diff
fast-regex	Index-accelerated regular expressions
fast-dedup	Large-scale document deduplication
fast-fuzz	Fuzzy string matching at scale
qzip	Block-transform compression

FAQ¶

When should I use Parot instead of Elasticsearch?

When your queries don't align with word boundaries: arbitrary substring search, duplicate phrase detection, non-tokenizable data (DNA, binary protocols), client-side WASM search, or one-shot analysis.

Is Parot a good fit for my workload?

Parot is designed for workloads where the same text is queried repeatedly. If you query each document once and discard it, a linear scanner may be the right tool. If you query each document many times — or scan many patterns against one corpus — Parot wins, and the gap grows without bound.

Does it work in the browser?

Yes. The WASM build runs in any modern browser and Node.js with the same API as the native library. See the live browser demo.

Licensing¶

Parot is distributed under the Elastic License 2.0 — a source-available license that keeps the code open for almost every real-world use while protecting the project from being resold as a hosted service.

Free, no contract or key required: production use (including commercial), shipping Parot inside a closed-source product, internal tools, research, academic work, and open-source projects.

Three restrictions: no hosted reseller services, no circumventing the license-key gate on save / load, no stripping copyright or license notices.

What requires a key: saving and loading index artifacts (Index.save / load, to_bytes / from_bytes, JS serialize / deserialize). Build and query are free forever. Set PAROT_LICENSE_KEY=<key> before running any save/load.

Free trial key — rotated every 14 days. Fine for evaluation, research, CI, and reproducible benchmarks. Email hello@sophiaconsulting.ai and we'll send you the current key.
Commercial key — flat-rate, annual, bound to your team. Email hello@sophiaconsulting.ai with a one-line description of your use case.
Hosted-reseller terms — talk to us before shipping.

Stability, security, contributing¶

Stability — API contract documented in STABILITY. Pin to a tag or commit SHA for reproducible builds during the 0.x series.
Security — vulnerability reports: see SECURITY.
Contributing — see Contributing for setup, testing, and PR guidelines.