Python API Reference¶

Every signature, parameter table, and docstring on this page is rendered live from the installed parot package. If the code changes, this page changes on the next mkdocs build — no manual edits.

For usage-oriented explanations and tutorials, see the Guide. For conceptual background, see Concepts.

Breaking change in 0.3.0: the low-level, mechanism-leaky top-level primitives were removed. Build an Index and call its methods. The previous top-level search class was renamed to Index, and the .locate* methods were renamed to .find_all* / .find_segments*.

Index — compressed full-text search¶

Build an index once from a string, bytes, list of segments, or pandas Series, then query patterns in time proportional only to pattern length — independent of text length. The memory_compactness tuning knob trades memory for find_all() speed:

`memory_compactness`	Memory	`find_all()` speed
0 (default)	~5× text	Fastest
2	~1.8× text	Moderate
4	~1.3× text	Slowest

See the Guide for when to use each level.

Index ¶

Index(data: str | bytes, memory_compactness: int = 0, case_insensitive: bool = False, normalize_whitespace: bool = False)

Compressed full-text index with O(p) pattern matching.

Can be built from str, bytes, or multi-document sources (from_strings, from_arrow, from_pyarrow, from_polars, from_dataframe). search() and batch_search() work with all construction paths.

Use from_strings() or from_series() to build with stored boundaries, enabling segment-aware methods without explicit boundary arguments.

extract ¶

extract(pattern: str | bytes, max_context: int = 100) -> dict

Extract matched text with forward context as a columnar Arrow dict.

Returns the same shape as one element of batch_extract: keys text_bytes (np.ndarray[uint8]) and text_offsets (np.ndarray[uint32], length n_hits + 1). Use materialize_strings(result, "text") to recover a list[str].

search ¶

search(pattern: str | bytes, context: int = 50) -> dict

Search and return structured hits as a columnar Arrow dict.

Returns the same shape as one element of batch_search: positions, starts, ends (np.ndarray[uint64]); matched_bytes/ matched_offsets, before_bytes/before_offsets, after_bytes/after_offsets (Arrow string columns). Works with every construction path. Use materialize_strings(result, "matched") to recover hit strings.

filter ¶

filter(pattern: str | bytes) -> dict

Return matching segments as columnar Arrow dict: text_bytes + text_offsets.

grep ¶

grep(pattern: str | bytes) -> dict

Return matching segment IDs + text as columnar Arrow dict: segment_ids + text_bytes + text_offsets.

batch_extract ¶

batch_extract(patterns: Sequence[str | bytes], max_context: int = 100) -> list[dict]

Extract matched text with forward context for many patterns.

Returns one columnar Arrow dict per pattern with keys text_bytes (np.ndarray[uint8]) and text_offsets (np.ndarray[uint32], length n_hits + 1). Use materialize_strings(d, "text") per record.

batch_search ¶

batch_search(patterns: Sequence[str | bytes], context: int = 50) -> list[dict]

Batch structured search with Arrow-shape columnar results.

Returns one dict per pattern, each with positions/starts/ends (np.ndarray[uint64]) and Arrow string columns matched_bytes/matched_offsets, before_bytes/before_offsets, after_bytes/after_offsets. Works with every construction path.

find ¶

find(pattern: str | bytes) -> int

Like str.find: first codepoint offset of pattern, or -1 if absent.

findall ¶

findall(pattern: str | bytes) -> list[int]

Sorted list of every position of pattern. Plain Python ints.

index ¶

index(pattern: str | bytes) -> int

Like str.index: first position of pattern or ValueError.

finditer ¶

finditer(pattern: str | bytes, *, eager: bool = False) -> Iterator[Match]

Iterator of Match objects, lazy by default.

kwic ¶

kwic(pattern: str | bytes, context: int = 40, *, framework: str = 'pandas') -> Any

Key-word-in-context DataFrame (pandas default; framework='polars' supported).

summary ¶

summary(patterns: Sequence[str | bytes], *, framework: str = 'pandas') -> Any

Per-pattern prevalence DataFrame: total_count, segment_count, segment_fraction.

incremental ¶

incremental() -> IncrementalSearch

Create a new forkable incremental match state rooted at this index.

set_license_key `staticmethod` ¶

set_license_key(key: str) -> None

Install a license for the current process. Required before any .save() / .load() / .from_bytes() / .to_bytes() call on an encrypted .parot artifact.

Duplicate phrase detection¶

Find repeated passages in a single document or across a batch of documents. These functions return columnar dicts with numpy arrays and Arrow-style string columns — see Concepts → Columnar Arrow Format for the layout.

find_duplicates ¶

find_duplicates(text: str | bytes, sanitized_text: Optional[str | bytes] = None, removed_ranges: Optional[list[tuple[int, int]]] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None, collapse_whitespace: Optional[bool] = None) -> dict

find_duplicates_normalized ¶

find_duplicates_normalized(text: str | bytes, case_insensitive: bool = False, collapse_whitespace: bool = True, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> dict

find_duplicates_from_path ¶

find_duplicates_from_path(path: str, case_insensitive: Optional[bool] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, collapse_whitespace: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> dict

Memory-map a file and run the duplicate-detection pipeline against it.

Lower-overhead entry point for very large inputs: avoids reading the file into a Python str object before passing it across the FFI boundary. Returns the same columnar dict shape as find_duplicates.

batch_find_duplicates ¶

batch_find_duplicates(texts: list[str | bytes], sanitized_texts: Optional[list[str | bytes]] = None, removed_ranges_per_text: Optional[list[list[tuple[int, int]]]] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None, collapse_whitespace: Optional[bool] = None) -> list[dict]

Parallelized version of find_duplicates over a list of texts.

Runs the full single-text duplicate-detection pipeline on each element of texts independently, using rayon under the hood when there are 4 or more inputs. Returns a list of columnar Arrow dicts, one per input text, each with the same shape as find_duplicates.

batch_find_duplicates_normalized ¶

batch_find_duplicates_normalized(texts: list[str | bytes], case_insensitive: bool = False, collapse_whitespace: bool = False, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> list[dict]

Parallelized version of find_duplicates_normalized over a list of texts.

Cross-document similarity and shared passages¶

Score how much two or more documents share, and extract the matching passages themselves.

text_similarity ¶

text_similarity(text_a: str, text_b: str) -> float

batch_text_similarity ¶

batch_text_similarity(reference: str, candidates: Sequence[str]) -> NDArray[float64]

common_passages ¶

common_passages(text_a: str, text_b: str) -> dict

batch_common_passages ¶

batch_common_passages(reference: str, candidates: Sequence[str]) -> list[dict]

longest_common_substring ¶

longest_common_substring(text1: str | bytes, text2: str | bytes) -> str

Convenience primitives¶

Small helpers that build an ephemeral index per call. Prefer constructing an Index and reusing it when you need more than one query.

count ¶

count(data: bytes, pattern: bytes) -> int

search_range ¶

search_range(data: bytes, pattern: bytes) -> tuple[int, int]

unique_fragment_count ¶

unique_fragment_count(text: str | bytes) -> int

Count unique fragments. Operates on UTF-16 code units for str, UTF-8 bytes for bytes.

unique_fragment_count_bytes ¶

unique_fragment_count_bytes(data: bytes | str) -> int

Count unique fragments operating on raw bytes.

DataFrame helpers¶

Convert columnar results into pandas/polars DataFrames for downstream analysis.

result_to_frame ¶

result_to_frame(result: dict, framework: str = 'pandas') -> Any

Convert an Arrow columnar dict to a pandas or polars DataFrame.

Auto-detects string columns (_bytes/_offsets pairs) and materializes them. Fixed-size numpy columns are kept as-is.

Parameters:

Name	Type	Description	Default
`result`	`dict`	Columnar dict from search/extract/filter/grep.	required
`framework`	`str`	"pandas" (default) or "polars".	`'pandas'`

result_to_strings ¶

result_to_strings(result: dict, key: str = 'text') -> list[str]

Materialize one Arrow string column from a columnar dict into a list[str].

batch_result_to_frames ¶

batch_result_to_frames(results: list[dict], framework: str = 'pandas') -> list[Any]

Apply result_to_frame to each element of a batch result list.

batch_contains_frame ¶

batch_contains_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any

Return a DataFrame (n_segments x n_patterns) of boolean contains flags.

batch_count_frame ¶

batch_count_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any

Return a DataFrame (n_segments x n_patterns) of per-segment occurrence counts.

batch_summary_frame ¶

batch_summary_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any

Return a summary DataFrame with one row per pattern.

materialize_strings ¶

materialize_strings(col: dict, key: str) -> list[str]

Accelerated string type¶

fast_str ¶

fast_str(value: str = '')

Bases: str

str subclass with O(p) substring operations via a lazy Index.

See module docstring for the design rationale and the full list of accelerated methods.

idx `property` ¶

idx: Index

The underlying Index, built on first access and cached.

Access this directly if you want the full Index API:

text.idx.batch_count(["a", "b", "c"])
text.idx.search("Scrooge", context=100)
text.idx.save("dickens.parot")

index_built `property` ¶

index_built: bool

True if the index has already been built (no-op cost).

memory_bytes `property` ¶

memory_bytes: int | None

Index heap usage in bytes, or None if the index isn't built yet.

build_index ¶

build_index() -> fast_str

Force the index build now (instead of waiting for first use).

Useful when you want predictable timings — e.g. before a hot loop. Returns self for chaining.

invalidate ¶

invalidate() -> None

Drop the cached Index. Frees memory; next FM call rebuilds.

findall ¶

findall(sub: str) -> list[int]

Sorted list[int] of every occurrence of sub. (re.findall feel.)

finditer ¶

finditer(sub: str, *, eager: bool = False) -> Iterator[Match]

Iterator of Match objects, lazy by default. (re.finditer feel.)

kwic ¶

kwic(pattern: str, context: int = 40, *, framework: str = 'pandas') -> Any

Key-word-in-context DataFrame. framework="pandas"|"polars".

summary ¶

summary(patterns: Sequence[str], *, framework: str = 'pandas') -> Any

Pattern prevalence DataFrame. framework="pandas"|"polars".

text `module-attribute` ¶

text = fast_str

Incremental search¶

Forkable incremental pattern matcher on top of an Index. Obtain one via Index.incremental(). Useful for building regex engines, bidirectional search, and co-occurrence probes on top of Parot's search index.

IncrementalSearch ¶

Forkable incremental pattern matcher on top of an Index.

Constructed via Index.incremental(). Holds a strong reference to the parent Index. Methods mutate in place (extend/extend_byte) or produce an independent copy (fork). Indexes are static — this does NOT add new documents to an existing index.

extend ¶

extend(data: str | bytes) -> bool

Extend the pattern by appending data. Returns True if the match range is still non-empty after the extension.

extend_byte ¶

extend_byte(byte: int) -> bool

Extend the pattern by a single byte (0-255). Returns True if the match range is still non-empty.

fork ¶

fork() -> IncrementalSearch

Return an independent copy of this search state.

count ¶

count() -> int

Number of occurrences of the current pattern.

range ¶

range() -> tuple[int, int]

Current search range (lo, hi) in internal coordinates.

is_empty ¶

is_empty() -> bool

True if the match range is empty (no occurrences).

pattern_len ¶

pattern_len() -> int

Length of the current pattern (bytes appended so far).

locate ¶

locate() -> NDArray[uint64]

Positions of every occurrence of the current pattern.

Result types¶

Match ¶

Match(idx: Index, start: int, end: int, match: str)

A single Index hit. Lazy by default.

Attributes¶

start : int Codepoint offset where the match begins in the source text. end : int Codepoint offset where the match ends (exclusive). match : str The matched substring (always equal to the pattern that produced this Match — stored, not re-fetched). span : tuple[int, int] (start, end) — matches re.Match.span().

Methods¶

before(n) / after(n) : fetch n codepoints of context on either side. expand(n) : return a KwicHit(before, match, after) triple, n codepoints on each side. This is the on-demand replacement for re-running search with a wider context window. group(0) : return match (re.Match compatibility shim).

Notes¶

Match objects do not hold a strong reference to the source text — they hold the Index, which holds the source. Dropping the Index invalidates all outstanding Matches.

KwicHit ¶

Bases: NamedTuple

Key-word-in-context: a single hit's (before, match, after) triple.

Returned by Match.expand(n). The before and after slices are codepoint-bounded (so len(before) <= n and they're real Python strs you can slice / display directly).

Python API Reference¶

Index — compressed full-text search¶

Index ¶

extract ¶

search ¶

filter ¶

grep ¶

batch_extract ¶

batch_search ¶

find ¶

findall ¶

index ¶

finditer ¶

kwic ¶

summary ¶

incremental ¶

set_license_key staticmethod ¶

Duplicate phrase detection¶

find_duplicates ¶

find_duplicates_normalized ¶

find_duplicates_from_path ¶

batch_find_duplicates ¶

batch_find_duplicates_normalized ¶

Cross-document similarity and shared passages¶

text_similarity ¶

batch_text_similarity ¶

common_passages ¶

batch_common_passages ¶

longest_common_substring ¶

Convenience primitives¶

count ¶

search_range ¶

unique_fragment_count ¶

unique_fragment_count_bytes ¶

DataFrame helpers¶

result_to_frame ¶

result_to_strings ¶

batch_result_to_frames ¶

batch_contains_frame ¶

batch_count_frame ¶

batch_summary_frame ¶

materialize_strings ¶

Accelerated string type¶

fast_str ¶

idx property ¶

index_built property ¶

memory_bytes property ¶

build_index ¶

invalidate ¶

findall ¶

finditer ¶

kwic ¶

summary ¶

text module-attribute ¶

Incremental search¶

IncrementalSearch ¶

extend ¶

extend_byte ¶

fork ¶

count ¶

range ¶

is_empty ¶

pattern_len ¶

locate ¶

Result types¶

Match ¶

Attributes¶

Methods¶

Notes¶

KwicHit ¶

set_license_key `staticmethod` ¶

idx `property` ¶

index_built `property` ¶

memory_bytes `property` ¶

text `module-attribute` ¶