Skip to content

Python API Reference

Every signature, parameter table, and docstring on this page is rendered live from the installed parot package. If the code changes, this page changes on the next mkdocs build — no manual edits.

For usage-oriented explanations and tutorials, see the Guide. For conceptual background, see Concepts.

Breaking change in 0.3.0: the low-level, mechanism-leaky top-level primitives were removed. Build an Index and call its methods. The previous top-level search class was renamed to Index, and the .locate* methods were renamed to .find_all* / .find_segments*.


Build an index once from a string, bytes, list of segments, or pandas Series, then query patterns in time proportional only to pattern length — independent of text length. The memory_compactness tuning knob trades memory for find_all() speed:

memory_compactness Memory find_all() speed
0 (default) ~5× text Fastest
2 ~1.8× text Moderate
4 ~1.3× text Slowest

See the Guide for when to use each level.

Index

Index(data: str | bytes, memory_compactness: int = 0, case_insensitive: bool = False, normalize_whitespace: bool = False)

Compressed full-text index with O(p) pattern matching.

Can be built from str, bytes, or multi-document sources (from_strings, from_arrow, from_pyarrow, from_polars, from_dataframe). search() and batch_search() work with all construction paths.

Use from_strings() or from_series() to build with stored boundaries, enabling segment-aware methods without explicit boundary arguments.

extract

extract(pattern: str | bytes, max_context: int = 100) -> dict

Extract matched text with forward context as a columnar Arrow dict.

Returns the same shape as one element of batch_extract: keys text_bytes (np.ndarray[uint8]) and text_offsets (np.ndarray[uint32], length n_hits + 1). Use materialize_strings(result, "text") to recover a list[str].

search

search(pattern: str | bytes, context: int = 50) -> dict

Search and return structured hits as a columnar Arrow dict.

Returns the same shape as one element of batch_search: positions, starts, ends (np.ndarray[uint64]); matched_bytes/ matched_offsets, before_bytes/before_offsets, after_bytes/after_offsets (Arrow string columns). Works with every construction path. Use materialize_strings(result, "matched") to recover hit strings.

filter

filter(pattern: str | bytes) -> dict

Return matching segments as columnar Arrow dict: text_bytes + text_offsets.

grep

grep(pattern: str | bytes) -> dict

Return matching segment IDs + text as columnar Arrow dict: segment_ids + text_bytes + text_offsets.

batch_extract

batch_extract(patterns: Sequence[str | bytes], max_context: int = 100) -> list[dict]

Extract matched text with forward context for many patterns.

Returns one columnar Arrow dict per pattern with keys text_bytes (np.ndarray[uint8]) and text_offsets (np.ndarray[uint32], length n_hits + 1). Use materialize_strings(d, "text") per record.

batch_search(patterns: Sequence[str | bytes], context: int = 50) -> list[dict]

Batch structured search with Arrow-shape columnar results.

Returns one dict per pattern, each with positions/starts/ends (np.ndarray[uint64]) and Arrow string columns matched_bytes/matched_offsets, before_bytes/before_offsets, after_bytes/after_offsets. Works with every construction path.

find

find(pattern: str | bytes) -> int

Like str.find: first codepoint offset of pattern, or -1 if absent.

findall

findall(pattern: str | bytes) -> list[int]

Sorted list of every position of pattern. Plain Python ints.

index

index(pattern: str | bytes) -> int

Like str.index: first position of pattern or ValueError.

finditer

finditer(pattern: str | bytes, *, eager: bool = False) -> Iterator[Match]

Iterator of Match objects, lazy by default.

kwic

kwic(pattern: str | bytes, context: int = 40, *, framework: str = 'pandas') -> Any

Key-word-in-context DataFrame (pandas default; framework='polars' supported).

summary

summary(patterns: Sequence[str | bytes], *, framework: str = 'pandas') -> Any

Per-pattern prevalence DataFrame: total_count, segment_count, segment_fraction.

incremental

incremental() -> IncrementalSearch

Create a new forkable incremental match state rooted at this index.

set_license_key staticmethod

set_license_key(key: str) -> None

Install a license for the current process. Required before any .save() / .load() / .from_bytes() / .to_bytes() call on an encrypted .parot artifact.


Duplicate phrase detection

Find repeated passages in a single document or across a batch of documents. These functions return columnar dicts with numpy arrays and Arrow-style string columns — see Concepts → Columnar Arrow Format for the layout.

find_duplicates

find_duplicates(text: str | bytes, sanitized_text: Optional[str | bytes] = None, removed_ranges: Optional[list[tuple[int, int]]] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None, collapse_whitespace: Optional[bool] = None) -> dict

find_duplicates_normalized

find_duplicates_normalized(text: str | bytes, case_insensitive: bool = False, collapse_whitespace: bool = True, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> dict

find_duplicates_from_path

find_duplicates_from_path(path: str, case_insensitive: Optional[bool] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, collapse_whitespace: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> dict

Memory-map a file and run the duplicate-detection pipeline against it.

Lower-overhead entry point for very large inputs: avoids reading the file into a Python str object before passing it across the FFI boundary. Returns the same columnar dict shape as find_duplicates.

batch_find_duplicates

batch_find_duplicates(texts: list[str | bytes], sanitized_texts: Optional[list[str | bytes]] = None, removed_ranges_per_text: Optional[list[list[tuple[int, int]]]] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None, collapse_whitespace: Optional[bool] = None) -> list[dict]

Parallelized version of find_duplicates over a list of texts.

Runs the full single-text duplicate-detection pipeline on each element of texts independently, using rayon under the hood when there are 4 or more inputs. Returns a list of columnar Arrow dicts, one per input text, each with the same shape as find_duplicates.

batch_find_duplicates_normalized

batch_find_duplicates_normalized(texts: list[str | bytes], case_insensitive: bool = False, collapse_whitespace: bool = False, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> list[dict]

Parallelized version of find_duplicates_normalized over a list of texts.


Cross-document similarity and shared passages

Score how much two or more documents share, and extract the matching passages themselves.

text_similarity

text_similarity(text_a: str, text_b: str) -> float

batch_text_similarity

batch_text_similarity(reference: str, candidates: Sequence[str]) -> NDArray[float64]

common_passages

common_passages(text_a: str, text_b: str) -> dict

batch_common_passages

batch_common_passages(reference: str, candidates: Sequence[str]) -> list[dict]

longest_common_substring

longest_common_substring(text1: str | bytes, text2: str | bytes) -> str

Convenience primitives

Small helpers that build an ephemeral index per call. Prefer constructing an Index and reusing it when you need more than one query.

count

count(data: bytes, pattern: bytes) -> int

search_range

search_range(data: bytes, pattern: bytes) -> tuple[int, int]

unique_fragment_count

unique_fragment_count(text: str | bytes) -> int

Count unique fragments. Operates on UTF-16 code units for str, UTF-8 bytes for bytes.

unique_fragment_count_bytes

unique_fragment_count_bytes(data: bytes | str) -> int

Count unique fragments operating on raw bytes.


DataFrame helpers

Convert columnar results into pandas/polars DataFrames for downstream analysis.

result_to_frame

result_to_frame(result: dict, framework: str = 'pandas') -> Any

Convert an Arrow columnar dict to a pandas or polars DataFrame.

Auto-detects string columns (_bytes/_offsets pairs) and materializes them. Fixed-size numpy columns are kept as-is.

Parameters:

Name Type Description Default
result dict

Columnar dict from search/extract/filter/grep.

required
framework str

"pandas" (default) or "polars".

'pandas'

result_to_strings

result_to_strings(result: dict, key: str = 'text') -> list[str]

Materialize one Arrow string column from a columnar dict into a list[str].

batch_result_to_frames

batch_result_to_frames(results: list[dict], framework: str = 'pandas') -> list[Any]

Apply result_to_frame to each element of a batch result list.

batch_contains_frame

batch_contains_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any

Return a DataFrame (n_segments x n_patterns) of boolean contains flags.

batch_count_frame

batch_count_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any

Return a DataFrame (n_segments x n_patterns) of per-segment occurrence counts.

batch_summary_frame

batch_summary_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any

Return a summary DataFrame with one row per pattern.

materialize_strings

materialize_strings(col: dict, key: str) -> list[str]

Accelerated string type

fast_str

fast_str(value: str = '')

Bases: str

str subclass with O(p) substring operations via a lazy Index.

See module docstring for the design rationale and the full list of accelerated methods.

idx property

idx: Index

The underlying Index, built on first access and cached.

Access this directly if you want the full Index API:

text.idx.batch_count(["a", "b", "c"])
text.idx.search("Scrooge", context=100)
text.idx.save("dickens.parot")

index_built property

index_built: bool

True if the index has already been built (no-op cost).

memory_bytes property

memory_bytes: int | None

Index heap usage in bytes, or None if the index isn't built yet.

build_index

build_index() -> fast_str

Force the index build now (instead of waiting for first use).

Useful when you want predictable timings — e.g. before a hot loop. Returns self for chaining.

invalidate

invalidate() -> None

Drop the cached Index. Frees memory; next FM call rebuilds.

findall

findall(sub: str) -> list[int]

Sorted list[int] of every occurrence of sub. (re.findall feel.)

finditer

finditer(sub: str, *, eager: bool = False) -> Iterator[Match]

Iterator of Match objects, lazy by default. (re.finditer feel.)

kwic

kwic(pattern: str, context: int = 40, *, framework: str = 'pandas') -> Any

Key-word-in-context DataFrame. framework="pandas"|"polars".

summary

summary(patterns: Sequence[str], *, framework: str = 'pandas') -> Any

Pattern prevalence DataFrame. framework="pandas"|"polars".

text module-attribute

text = fast_str

Forkable incremental pattern matcher on top of an Index. Obtain one via Index.incremental(). Useful for building regex engines, bidirectional search, and co-occurrence probes on top of Parot's search index.

IncrementalSearch

Forkable incremental pattern matcher on top of an Index.

Constructed via Index.incremental(). Holds a strong reference to the parent Index. Methods mutate in place (extend/extend_byte) or produce an independent copy (fork). Indexes are static — this does NOT add new documents to an existing index.

extend

extend(data: str | bytes) -> bool

Extend the pattern by appending data. Returns True if the match range is still non-empty after the extension.

extend_byte

extend_byte(byte: int) -> bool

Extend the pattern by a single byte (0-255). Returns True if the match range is still non-empty.

fork

Return an independent copy of this search state.

count

count() -> int

Number of occurrences of the current pattern.

range

range() -> tuple[int, int]

Current search range (lo, hi) in internal coordinates.

is_empty

is_empty() -> bool

True if the match range is empty (no occurrences).

pattern_len

pattern_len() -> int

Length of the current pattern (bytes appended so far).

locate

locate() -> NDArray[uint64]

Positions of every occurrence of the current pattern.


Result types

Match

Match(idx: Index, start: int, end: int, match: str)

A single Index hit. Lazy by default.

Attributes

start : int Codepoint offset where the match begins in the source text. end : int Codepoint offset where the match ends (exclusive). match : str The matched substring (always equal to the pattern that produced this Match — stored, not re-fetched). span : tuple[int, int] (start, end) — matches re.Match.span().

Methods

before(n) / after(n) : fetch n codepoints of context on either side. expand(n) : return a KwicHit(before, match, after) triple, n codepoints on each side. This is the on-demand replacement for re-running search with a wider context window. group(0) : return match (re.Match compatibility shim).

Notes

Match objects do not hold a strong reference to the source text — they hold the Index, which holds the source. Dropping the Index invalidates all outstanding Matches.

KwicHit

Bases: NamedTuple

Key-word-in-context: a single hit's (before, match, after) triple.

Returned by Match.expand(n). The before and after slices are codepoint-bounded (so len(before) <= n and they're real Python strs you can slice / display directly).