Matching Module

The conllu_tools.matching module provides utilities for finding and matching complex linguistic patterns in CoNLL-U data. It supports pattern-based searching across parsed corpora using a flexible condition-based matching system.

Key Features:

  • Token-level pattern matching with conditions on any CoNLL-U field

  • Sentence-level pattern sequences for multi-token matching

  • Support for negation, substring matching, and value alternatives

  • Pattern building from simple string syntax

Quick Example

from conllu_tools.matching import build_pattern, find_in_corpus

# Build a pattern to find adjective + noun sequences
pattern = build_pattern("ADJ+NOUN", name="adj_noun")

# Find all matches in a corpus
matches = find_in_corpus(corpus, [pattern])

Utility Functions

build_pattern

conllu_tools.matching.build_pattern(pattern_str, name=None)[source]

Build a SentencePattern from a pattern string.

See the documentation for a detailed explanation of the syntax for the pattern string.

Parameters:
  • pattern_str (str) – The pattern string to parse.

  • name (str | None) – Optional name for the pattern.

Returns:

The constructed SentencePattern instance.

Return type:

SentencePattern

find_in_corpus

conllu_tools.matching.find_in_corpus(corpus, patterns)[source]

Find all matches of given patterns in the corpus.

Parameters:
  • corpus (list[conllu.TokenList]) – The corpus to search.

  • patterns (list[SentencePattern]) – The patterns to match.

Returns:

The list of all match results.

Return type:

list[MatchResult]

Pattern Classes

SentencePattern

class conllu_tools.matching.SentencePattern(pattern, name=None)[source]

Bases: object

Represents a sequence of TokenPattern to match in a sentence as a whole.

name

The name of the SentencePattern.

Type:

str

pattern

The list of TokenPatterns to match in sequence.

Type:

list[TokenPattern]

__init__(pattern, name=None)[source]

Initialize the SentencePattern.

reset()[source]

Reset the matching state.

Return type:

None

match(sentence)[source]

Match the pattern in the given sentence.

Uses a backtracking algorithm: when a partial match fails, the algorithm retries from the position after where the failed match started, ensuring all possible matches are found.

Return type:

list[MatchResult]

explain()[source]

Provide a string explanation of the SentencePattern.

Return type:

str

__repr__()[source]

Return a string representation of the SentencePattern.

Return type:

str

__str__()[source]

Return a string description of the SentencePattern.

Return type:

str

TokenPattern

class conllu_tools.matching.TokenPattern(conditions=None, negate=False, count=None, min_count=None, max_count=None)[source]

Bases: object

Represents a series of conditions to match in a token.

conditions

The list of conditions to match.

Type:

list[Condition]

negate

Whether to negate the match result.

Type:

bool

count

Exact number of times the pattern should match.

Type:

int

min_count

Minimum number of times the pattern should match.

Type:

int

max_count

Maximum number of times the pattern should match.

Type:

int

match_multiple

Whether the pattern can match multiple times.

Type:

bool

__init__(conditions=None, negate=False, count=None, min_count=None, max_count=None)[source]

Initialize the TokenPattern.

test(target)[source]

Test if a token meets the conditions in the pattern.

Return type:

bool

property is_satisfied: bool

Check if the minimum count is satisfied.

property is_exceeded: bool

Check if the maximum count is exceeded.

property is_valid: bool

Check if the TokenPattern is properly configured.

explain()[source]

Provide a string explanation of the TokenPattern.

Return type:

str

__repr__()[source]

Return a string representation of the TokenPattern.

Return type:

str

__str__()[source]

Return a string description of the TokenPattern.

Return type:

str

Condition

class conllu_tools.matching.Condition(key=None, values=None, match_type='equals', match_any=False, negate=False)[source]

Bases: object

Represents a condition to be met by the properties of a token.

A Condition can also represent a container for other Conditions, allowing for nested logical structures.

key

The token attribute key to check.

Type:

str | None

values

The values or nested Conditions to match against.

Type:

list[str | Condition]

match_type

The type of match to perform (‘equals’, ‘contains’, ‘startswith’, ‘endswith’).

Type:

str

match_any

Whether to match any of the values (True) or all (False).

Type:

bool

negate

Whether to negate the result of the condition.

Type:

bool

__init__(key=None, values=None, match_type='equals', match_any=False, negate=False)[source]

Initialize the Condition.

test(target)[source]

Test if the passed token meets the condition.

Return type:

bool

explain()[source]

Provide a string explanation of the Condition.

Return type:

str

property is_valid: bool

Check if the Condition is properly configured.

property is_container: bool

Check if this Condition is a container for other Conditions.

__repr__()[source]

Return a string representation of the Condition.

Return type:

str

__str__()[source]

Return a string description of the Condition.

Return type:

str

Result Classes

MatchResult

class conllu_tools.matching.MatchResult(pattern_name, sentence_id, tokens)[source]

Bases: object

A class to store match results.

pattern_name: str
sentence_id: str
tokens: list[Token]
property substring: str

Return the matched substring.

property lemmata: list[str]

Return the lemmata of the matched tokens.

property forms: list[str]

Return the forms of the matched tokens.

__repr__()[source]

Return a representation of the MatchResult.

Return type:

str

__str__()[source]

Return a string representation of the MatchResult.

Return type:

str

__init__(pattern_name, sentence_id, tokens)