Python API Reference¶
Every signature, parameter table, and docstring on this page is rendered live from the installed
parotpackage. If the code changes, this page changes on the nextmkdocs build— no manual edits.
For usage-oriented explanations and tutorials, see the Guide. For conceptual background, see Concepts.
Breaking change in 0.3.0: the low-level, mechanism-leaky top-level primitives were removed. Build an
Indexand call its methods. The previous top-level search class was renamed toIndex, and the.locate*methods were renamed to.find_all*/.find_segments*.
Index — compressed full-text search¶
Build an index once from a string, bytes, list of segments, or pandas Series,
then query patterns in time proportional only to pattern length — independent
of text length. The memory_compactness tuning knob trades memory for find_all()
speed:
memory_compactness |
Memory | find_all() speed |
|---|---|---|
| 0 (default) | ~5× text | Fastest |
| 2 | ~1.8× text | Moderate |
| 4 | ~1.3× text | Slowest |
See the Guide for when to use each level.
Index
¶
Index(data: str | bytes, memory_compactness: int = 0, case_insensitive: bool = False, normalize_whitespace: bool = False)
Compressed full-text index with O(p) pattern matching.
Can be built from str, bytes, or multi-document sources (from_strings, from_arrow, from_pyarrow, from_polars, from_dataframe). search() and batch_search() work with all construction paths.
Use from_strings() or from_series() to build with stored boundaries, enabling segment-aware methods without explicit boundary arguments.
extract
¶
Extract matched text with forward context as a columnar Arrow dict.
Returns the same shape as one element of batch_extract: keys
text_bytes (np.ndarray[uint8]) and text_offsets
(np.ndarray[uint32], length n_hits + 1). Use
materialize_strings(result, "text") to recover a list[str].
search
¶
Search and return structured hits as a columnar Arrow dict.
Returns the same shape as one element of batch_search: positions,
starts, ends (np.ndarray[uint64]); matched_bytes/
matched_offsets, before_bytes/before_offsets,
after_bytes/after_offsets (Arrow string columns). Works with
every construction path. Use materialize_strings(result, "matched")
to recover hit strings.
filter
¶
Return matching segments as columnar Arrow dict: text_bytes + text_offsets.
grep
¶
Return matching segment IDs + text as columnar Arrow dict: segment_ids + text_bytes + text_offsets.
batch_extract
¶
Extract matched text with forward context for many patterns.
Returns one columnar Arrow dict per pattern with keys text_bytes
(np.ndarray[uint8]) and text_offsets (np.ndarray[uint32],
length n_hits + 1). Use materialize_strings(d, "text") per
record.
batch_search
¶
Batch structured search with Arrow-shape columnar results.
Returns one dict per pattern, each with positions/starts/ends
(np.ndarray[uint64]) and Arrow string columns
matched_bytes/matched_offsets,
before_bytes/before_offsets,
after_bytes/after_offsets. Works with every construction path.
find
¶
Like str.find: first codepoint offset of pattern, or -1 if absent.
findall
¶
Sorted list of every position of pattern. Plain Python ints.
finditer
¶
finditer(pattern: str | bytes, *, eager: bool = False) -> Iterator[Match]
Iterator of Match objects, lazy by default.
kwic
¶
Key-word-in-context DataFrame (pandas default; framework='polars' supported).
summary
¶
Per-pattern prevalence DataFrame: total_count, segment_count, segment_fraction.
incremental
¶
incremental() -> IncrementalSearch
Create a new forkable incremental match state rooted at this index.
set_license_key
staticmethod
¶
Install a license for the current process. Required before any
.save() / .load() / .from_bytes() / .to_bytes() call on
an encrypted .parot artifact.
Duplicate phrase detection¶
Find repeated passages in a single document or across a batch of documents.
These functions return columnar dicts with numpy arrays and Arrow-style
string columns — see Concepts → Columnar Arrow Format for
the layout.
find_duplicates
¶
find_duplicates(text: str | bytes, sanitized_text: Optional[str | bytes] = None, removed_ranges: Optional[list[tuple[int, int]]] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None, collapse_whitespace: Optional[bool] = None) -> dict
find_duplicates_normalized
¶
find_duplicates_normalized(text: str | bytes, case_insensitive: bool = False, collapse_whitespace: bool = True, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> dict
find_duplicates_from_path
¶
find_duplicates_from_path(path: str, case_insensitive: Optional[bool] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, collapse_whitespace: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> dict
Memory-map a file and run the duplicate-detection pipeline against it.
Lower-overhead entry point for very large inputs: avoids reading the file
into a Python str object before passing it across the FFI boundary.
Returns the same columnar dict shape as find_duplicates.
batch_find_duplicates
¶
batch_find_duplicates(texts: list[str | bytes], sanitized_texts: Optional[list[str | bytes]] = None, removed_ranges_per_text: Optional[list[list[tuple[int, int]]]] = None, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None, collapse_whitespace: Optional[bool] = None) -> list[dict]
Parallelized version of find_duplicates over a list of texts.
Runs the full single-text duplicate-detection pipeline on each element of
texts independently, using rayon under the hood when there are 4 or more
inputs. Returns a list of columnar Arrow dicts, one per input text, each
with the same shape as find_duplicates.
batch_find_duplicates_normalized
¶
batch_find_duplicates_normalized(texts: list[str | bytes], case_insensitive: bool = False, collapse_whitespace: bool = False, min_words: Optional[int] = None, min_chars: Optional[int] = None, max_words: Optional[int] = None, min_words_in_substring: Optional[int] = None, enable_block_detection: Optional[bool] = None, clip_sentences: Optional[bool] = None, skip_dedup: Optional[bool] = None) -> list[dict]
Parallelized version of find_duplicates_normalized over a list of texts.
Cross-document similarity and shared passages¶
Score how much two or more documents share, and extract the matching passages themselves.
batch_text_similarity
¶
batch_common_passages
¶
Convenience primitives¶
Small helpers that build an ephemeral index per call. Prefer constructing an
Index and reusing it when you need more than one query.
unique_fragment_count
¶
Count unique fragments. Operates on UTF-16 code units for str, UTF-8 bytes for bytes.
unique_fragment_count_bytes
¶
Count unique fragments operating on raw bytes.
DataFrame helpers¶
Convert columnar results into pandas/polars DataFrames for downstream analysis.
result_to_frame
¶
Convert an Arrow columnar dict to a pandas or polars DataFrame.
Auto-detects string columns (_bytes/_offsets pairs) and materializes them. Fixed-size numpy columns are kept as-is.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
dict
|
Columnar dict from search/extract/filter/grep. |
required |
framework
|
str
|
"pandas" (default) or "polars". |
'pandas'
|
result_to_strings
¶
Materialize one Arrow string column from a columnar dict into a list[str].
batch_result_to_frames
¶
Apply result_to_frame to each element of a batch result list.
batch_contains_frame
¶
batch_contains_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any
Return a DataFrame (n_segments x n_patterns) of boolean contains flags.
batch_count_frame
¶
batch_count_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any
Return a DataFrame (n_segments x n_patterns) of per-segment occurrence counts.
batch_summary_frame
¶
batch_summary_frame(idx: Index, patterns: Sequence[str], framework: str = 'pandas') -> Any
Return a summary DataFrame with one row per pattern.
Accelerated string type¶
fast_str
¶
Bases: str
str subclass with O(p) substring operations via a lazy Index.
See module docstring for the design rationale and the full list of accelerated methods.
idx
property
¶
idx: Index
The underlying Index, built on first access and cached.
Access this directly if you want the full Index API:
text.idx.batch_count(["a", "b", "c"])
text.idx.search("Scrooge", context=100)
text.idx.save("dickens.parot")
memory_bytes
property
¶
Index heap usage in bytes, or None if the index isn't built yet.
build_index
¶
build_index() -> fast_str
Force the index build now (instead of waiting for first use).
Useful when you want predictable timings — e.g. before a hot loop.
Returns self for chaining.
findall
¶
Sorted list[int] of every occurrence of sub. (re.findall feel.)
finditer
¶
finditer(sub: str, *, eager: bool = False) -> Iterator[Match]
Iterator of Match objects, lazy by default. (re.finditer feel.)
kwic
¶
Key-word-in-context DataFrame. framework="pandas"|"polars".
summary
¶
Pattern prevalence DataFrame. framework="pandas"|"polars".
Incremental search¶
Forkable incremental pattern matcher on top of an Index. Obtain one via
Index.incremental(). Useful for building regex engines, bidirectional
search, and co-occurrence probes on top of Parot's search index.
IncrementalSearch
¶
Forkable incremental pattern matcher on top of an Index.
Constructed via Index.incremental(). Holds a strong reference to the
parent Index. Methods mutate in place (extend/extend_byte) or produce
an independent copy (fork). Indexes are static — this does NOT add new
documents to an existing index.
Result types¶
Match
¶
Match(idx: Index, start: int, end: int, match: str)
A single Index hit. Lazy by default.
Attributes¶
start : int
Codepoint offset where the match begins in the source text.
end : int
Codepoint offset where the match ends (exclusive).
match : str
The matched substring (always equal to the pattern that produced this
Match — stored, not re-fetched).
span : tuple[int, int]
(start, end) — matches re.Match.span().
Methods¶
before(n) / after(n) : fetch n codepoints of context on either side.
expand(n) : return a KwicHit(before, match, after) triple, n codepoints
on each side. This is the on-demand replacement for re-running
search with a wider context window.
group(0) : return match (re.Match compatibility shim).
Notes¶
Match objects do not hold a strong reference to the source text — they hold the Index, which holds the source. Dropping the Index invalidates all outstanding Matches.
KwicHit
¶
Bases: NamedTuple
Key-word-in-context: a single hit's (before, match, after) triple.
Returned by Match.expand(n). The before and after slices are
codepoint-bounded (so len(before) <= n and they're real Python strs you
can slice / display directly).