Skip to content

Parot

An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI. Build an index once, query it millions of times.

License: Elastic License 2.0 TestPyPI npm@beta

Benchmark

Parot's 187 ms index build happens once — queries are constant-time regardless of corpus size.

17 competitors on 10 MB of Wikipedia-EN. One index, many queries. Every position cross-verified against every competitor — mismatches abort the run.

Tool Lang 100 phrases 1,000 phrases 10,000 phrases
Parot Rust 0.34 ms 1.00× 4.59 ms 1.00× ~46 ms 1.00×
ripgrep* Rust 123 ms 361× 93.4 ms 20× ~250 ms
ahocorasick-rs Rust 53.7 ms 158× 118 ms 26× ~180 ms
pyahocorasick Python 116 ms 339× 158 ms 34× ~250 ms
stringzilla Python 42.6 ms 125× 424 ms 92× ~4.2 s 91×
modern-ahocorasick JS 341 ms 998× 442 ms 96× 695 ms 15×
bytes.find loop Python 383 ms 1,124× 4.1 s 894× ~41 s 891×
String.indexOf JS 403 ms 1,181× 4.3 s 929× 43.3 s 941×
str.find loop Python 427 ms 1,253× 4.5 s 973× ~45 s 978×
pyarrow count_substring Python 1,336 ms 3,918× 14.4 s 3,129× ~144 s 3,130×
RegExp.exec /g JS 396 ms 1,161× 24.5 s 5,329× 256 s 5,565×
String.matchAll JS 394 ms 1,156× 24.8 s 5,403× 258 s 5,609×
regex (mrab) Python 449 ms 1,317× 30.5 s 6,640× ~305 s 6,630×
google-re2* Python 3,027 ms 8,877× 31.6 s 6,870× ~316 s 6,870×
re.finditer Python 677 ms 1,985× 71.4 s 15,551× ~714 s 15,520×
polars count_matches Python 1,457 ms 4,273× 106 s 23,134× ~1,060 s 23,040×
pandas str.count Python 1,709 ms 5,011× 109 s 23,672× ~1,090 s 23,700×

Apple M2 Pro · 100, 1,000 and 10,000 phrases of 2–20 words · median of 3 runs. Reproduce with just bench-hero. *soft parity (non-overlapping multi-pattern semantics). ~ in the 10,000 column = projected from 1,000-phrase scaling, pending a real run.

→ Live browser demo → Full benchmark suite

Why it's fast

Parot is sublinear: after a one-time O(N) build, each query is O(m + k) — where m is the pattern length and k is the number of matches. Every other tool in the table is O(N) per query; double the corpus, double the time. Parot's query time doesn't change.

That's why the gap grows with your text — 125× on 10 MB, thousands of times on a gigabyte.


Install

Parot is in pre-release. Python ships to TestPyPI; JavaScript ships to npm under the beta dist-tag. Both move to their primary registries once the API stabilises.

The --extra-index-url lets pip pull Parot's runtime deps (numpy, loguru, rich, typer) from PyPI:

pip install -i https://test.pypi.org/simple/ \
            --extra-index-url https://pypi.org/simple/ \
            parot

Or with uv:

uv pip install -i https://test.pypi.org/simple/ \
               --extra-index-url https://pypi.org/simple/ \
               parot
npm install parot@beta          # or: pnpm add parot@beta / yarn add parot@beta

Clone and build (not yet on Homebrew or crates.io):

git clone https://github.com/sophiaconsulting/parot.git && cd parot
cargo install --path crates/cli

Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).


Quick start

Drop-in string replacement. Wrap any string. Every str method still works; the slow ones become microsecond-latency queries, plus findall, finditer, kwic, and summary on top.

import parot

text = parot.text(open("genome.fa").read())   # wraps str, index built lazily

text.count("ATTGCC")          # microsecond-latency
"CRISPR" in text              # microsecond-latency
text.findall("TATA")          # sorted list of all positions
for m in text.finditer("GATTACA"):
    print(m.before(40), m.match, m.after(40))   # with surrounding context

DataFrames. Replace df["col"].str.contains() with one line. Up to 36,000× faster.

import pandas as pd
import parot                          # registers .parot accessor

df = pd.read_parquet("logs.parquet")              # 500K rows

mask = df["message"].str.contains("error")        # 1,200 ms
mask = df.parot.contains("message", "error")      # 0.03 ms — 36,000× faster

df.parot.batch_count("message", ["error", "warning", "fatal", "timeout"])

Power-user API. Full index with numpy arrays, batch operations, and serialization.

from parot import Index

index = Index(open("shakespeare.txt").read())    # build once
index.count("the king")                          # ~0.003 ms
index.find_all("my lord")                        # numpy array of positions
index.search("good sir", context=50)             # with surrounding context
index.batch_count(["king", "queen", "duke"])     # parallel over patterns
index.save("shakespeare.parot")                  # resumable, cross-platform

Duplicates, similarity, analysis.

import parot

parot.find_duplicates(text, min_words=4)              # repeated phrases
parot.common_passages(manuscript, reference)          # shared passages
parot.text_similarity(manuscript, reference)          # float 0..1
parot.longest_common_substring(draft_a, draft_b)
parot.unique_fragment_count(corpus)

JavaScript / WASM. Same engine, any browser or Node.js. Indices serialize across Python and JS.

import { Index } from 'parot';

const index = new Index(text, 0);
index.count('pattern');                          // microsecond-latency
index.findAll('pattern');                        // Uint32Array of positions
index.search('pattern', 50);                     // with surrounding context
index.batchCount(['error', 'warning', 'fatal']); // multi-pattern
const bytes = index.serialize();                 // interchange with Python

CLI.

parot scan    manuscript.md --top 20   # duplicate phrases
parot search  corpus.txt "pattern"     # positions in a file
parot count   corpus.txt "pattern"     # occurrence count
parot lcs     a.md b.md                # longest common substring
parot info    manuscript.md            # size, word count, distinct substrings

Full Python API · Full JavaScript API · CLI Reference


What's in the box

Parot is not just "fast grep" — it's a text-analysis toolkit. Everything below is a single call on an index you already built.

Search
Capability Python JS CLI
count, find, find_all, index, __contains__ — single-pattern
search, extract — matches with surrounding context
finditer — lazy iterator with .before() / .after() per match
batch_count, batch_find_all, batch_search, batch_extract — multi-pattern, parallel
kwic — Key-Word-In-Context DataFrame
summary — pattern prevalence DataFrame
Multi-document / segment-aware
Capability Python JS CLI
Index.from_strings, from_series, from_dataframe, from_arrow, from_pyarrow, from_polars
contains_mask, find_segments, count_per_segment — per-segment answers
batch_contains_mask, batch_count_per_segment, batch_find_segments
filter, grep — return the segments (and IDs) matching a pattern
Duplicates & repetition
Capability Python JS CLI
find_duplicates — repeated phrases, word-boundary aware, sentence-clipped
find_duplicates_normalized — with case folding + whitespace collapsing
find_duplicates_from_path — memory-mapped, won't load into Python heap
batch_find_duplicates — many documents in parallel
Similarity & comparison
Capability Python JS CLI
common_passages — every shared passage between two documents + coverage
batch_common_passages — one reference vs. many candidates
text_similarity, batch_text_similarity — scalar coverage score
longest_common_substring — longest shared run between two texts
unique_fragment_count — structural corpus fingerprint
DataFrame integration
Capability Pandas Polars
series.parot.contains / count / find_all
series.parot.batch_contains / batch_count — multi-pattern, one result frame
df.parot.contains / count / batch_contains / batch_count — column-level accessors
pl.col("text").parot.contains / count — expression-level in lazy queries
Persistent per-column index cache with build_index / invalidate / memory_bytes
result_to_frame, batch_result_to_frames, batch_summary_frame — Arrow → DataFrame
Persistence & interchange
Capability Python JS
Index.save / Index.load — file-backed, resumable
Index.to_bytes / Index.from_bytes — in-memory serialization
Cross-platform format: a Python-built index loads in the browser and vice versa

Configuration

  • Memory / speed knob. memory_compactness trades find_all latency for a smaller RAM footprint (0 = fastest, 4 = most compact).
  • Case-insensitive matching via a build-time flag, applied to both corpus and queries.
  • Whitespace normalization — results remap back to original positions.
  • Introspection. len(idx), idx.memory_bytes, idx.config, idx.has_segments, idx.segment_count, idx[i] character access.

The substring gap

Token-based search libraries (fuse.js, lunr, minisearch) split text into words. They can't find arbitrary substrings:

Query fuse.js lunr minisearch Parot
"rown fox"
"ATTA"
"Script dev"
"ghbor-no"
"ix arr"
"uick brown f"
Score 0/6 2/6 2/6 6/6

Parot's WASM build runs in any browser — no server round-trip, no backend needed.


When to use Parot

  • You query the same text multiple times
  • Text is large (>100 KB) — the bigger, the more you win
  • You need substring search, not just word search
  • You're replacing pandas/polars string operations on a column
  • You're finding repeated phrases (writing, plagiarism, LLM dedup, moderation)
  • You need client-side full-text search in the browser via WASM

If you'd use str.find() or grep today, Parot is a drop-in acceleration. If you'd use Elasticsearch today, keep using Elasticsearch.


Ecosystem

Projects built on Parot:

Project Description
fast-diff Structural document diff
fast-regex Index-accelerated regular expressions
fast-dedup Large-scale document deduplication
fast-fuzz Fuzzy string matching at scale
qzip Block-transform compression

FAQ

When should I use Parot instead of Elasticsearch?

When your queries don't align with word boundaries: arbitrary substring search, duplicate phrase detection, non-tokenizable data (DNA, binary protocols), client-side WASM search, or one-shot analysis.

Is Parot a good fit for my workload?

Parot is designed for workloads where the same text is queried repeatedly. If you query each document once and discard it, a linear scanner may be the right tool. If you query each document many times — or scan many patterns against one corpus — Parot wins, and the gap grows without bound.

Does it work in the browser?

Yes. The WASM build runs in any modern browser and Node.js with the same API as the native library. See the live browser demo.


Licensing

Parot is distributed under the Elastic License 2.0 — a source-available license that keeps the code open for almost every real-world use while protecting the project from being resold as a hosted service.

Free, no contract or key required: production use (including commercial), shipping Parot inside a closed-source product, internal tools, research, academic work, and open-source projects.

Three restrictions: no hosted reseller services, no circumventing the license-key gate on save / load, no stripping copyright or license notices.

What requires a key: saving and loading index artifacts (Index.save / load, to_bytes / from_bytes, JS serialize / deserialize). Build and query are free forever. Set PAROT_LICENSE_KEY=<key> before running any save/load.

  • Free trial key — rotated every 14 days. Fine for evaluation, research, CI, and reproducible benchmarks. Email hello@sophiaconsulting.ai and we'll send you the current key.
  • Commercial key — flat-rate, annual, bound to your team. Email hello@sophiaconsulting.ai with a one-line description of your use case.
  • Hosted-reseller terms — talk to us before shipping.

Stability, security, contributing

  • Stability — API contract documented in STABILITY. Pin to a tag or commit SHA for reproducible builds during the 0.x series.
  • Security — vulnerability reports: see SECURITY.
  • Contributing — see Contributing for setup, testing, and PR guidelines.