Matching
This guide covers pattern matching for finding linguistic structures in CoNLL-U annotated corpora.
Overview
The matching module provides a pattern language for searching CoNLL-U data. You can find:
Tokens matching specific POS tags, lemmas, or morphological features
Sequential patterns of tokens (e.g., DET + ADJ + NOUN)
Complex linguistic constructions with quantifiers and negation
Matches based on substring patterns (contains, starts with, ends with)
Quick Start
Find all nouns followed by a verb in a corpus:
import conllu
from conllu_tools.matching import build_pattern, find_in_corpus
# Load corpus
with open('corpus.conllu', encoding='utf-8') as f:
corpus = conllu.parse(f.read())
# Build and search for pattern
pattern = build_pattern('NOUN+VERB', name='noun-verb')
matches = find_in_corpus(corpus, [pattern])
# Print results
for match in matches:
print(f"[{match.sentence_id}] {match.substring}")
print(f" Lemmata: {match.lemmata}")
Basic Usage
Building Patterns
The build_pattern() function parses a pattern string into a SentencePattern object:
from conllu_tools.matching import build_pattern
# Simple UPOS pattern
noun_pattern = build_pattern('NOUN')
# Pattern with conditions
ablative_noun = build_pattern('NOUN:feats=(Case=Abl)')
# Multi-token sequence
det_noun = build_pattern('DET+NOUN')
# Named pattern for identification
pattern = build_pattern('NOUN+VERB', name='subject-verb')
Searching a Corpus
Use find_in_corpus() to search for patterns across all sentences:
from conllu_tools.matching import build_pattern, find_in_corpus
# Create multiple patterns
patterns = [
build_pattern('NOUN+VERB', name='noun-verb'),
build_pattern('DET+NOUN', name='det-noun'),
]
# Search corpus
matches = find_in_corpus(corpus, patterns)
# Group results by pattern
for match in matches:
print(f"Pattern '{match.pattern_name}': {match.substring}")
Working with Match Results
Each MatchResult provides access to matched content:
for match in matches:
# Pattern identification
print(f"Pattern: {match.pattern_name}")
print(f"Sentence: {match.sentence_id}")
# Matched text
print(f"Text: {match.substring}") # "una scala"
print(f"Forms: {match.forms}") # ['una', 'scala']
print(f"Lemmata: {match.lemmata}") # ['unus', 'scalae']
# Access individual tokens
for token in match.tokens:
print(f" {token['form']} ({token['upos']})")
Pattern Syntax
Token Patterns
The basic structure of a token pattern is:
UPOS:attribute=value
The UPOS tag comes first, followed by optional attribute conditions separated by colons.
UPOS Matching
Match tokens by their universal part-of-speech tag:
# Single UPOS
pattern = build_pattern('NOUN')
# Multiple UPOS options (OR)
pattern = build_pattern('NOUN|VERB')
# Any UPOS (wildcard)
pattern = build_pattern('*')
Valid UPOS tags: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
Attribute Conditions
Add conditions on token attributes using UD/CoNLL-U field names:
# Match by lemma
pattern = build_pattern('NOUN:lemma=rex')
# Match by form
pattern = build_pattern('VERB:form=est')
# Match by head
pattern = build_pattern('NOUN:head=0') # Root nouns
# Match by dependency relation
pattern = build_pattern('*:deprel=nsubj')
Available attributes: id, form, lemma, xpos, feats, head, deprel, deps, misc
Multiple Values (OR)
Use pipes to match any of several values:
# Lemma is either 'rex' or 'regina'
pattern = build_pattern('NOUN:lemma=rex|regina')
# UPOS is NOUN or PROPN
pattern = build_pattern('NOUN|PROPN')
Morphological Features
Match on the feats column using parentheses with comma-separated conditions:
# Singular ablative noun
pattern = build_pattern('NOUN:feats=(Number=Sing,Case=Abl)')
# Ablative or dative noun
pattern = build_pattern('NOUN:feats=(Case=Abl|Dat)')
# Multiple feature conditions
pattern = build_pattern('VERB:feats=(Mood=Sub,Tense=Pres)')
Match Type Modifiers
Control how string values are matched using < and >:
# Exact match (default)
pattern = build_pattern('NOUN:form=a')
# Contains
pattern = build_pattern('NOUN:form=<ae>') # Form contains 'ae'
# Starts with
pattern = build_pattern('NOUN:form=<ab') # Form starts with 'ab'
# Ends with
pattern = build_pattern('NOUN:form=um>') # Form ends with 'um'
Negation
Use ! to negate conditions:
# Not a noun
pattern = build_pattern('!NOUN')
# Noun not in singular
pattern = build_pattern('NOUN:feats=(Number=!Sing)')
# Form does not contain 'ae'
pattern = build_pattern('*:form=!<ae>')
Quantifiers
Use regex-style quantifiers to match multiple consecutive tokens:
# Match exactly 2 adjectives
pattern = build_pattern('ADJ{2}')
# Match 0-3 tokens of any type (optional sequence)
pattern = build_pattern('*{0,3}')
# Match 1-5 adjectives
pattern = build_pattern('ADJ{1,5}')
Sentence Patterns
Combine token patterns with + to match sequences:
# Determiner followed by noun
pattern = build_pattern('DET+NOUN')
# Preposition phrase: ADP + accusative noun
pattern = build_pattern('ADP+NOUN:feats=(Case=Acc)')
# Subject-verb-object with gaps
pattern = build_pattern('*:deprel=nsubj+*{0,10}+VERB:deprel=root+*{0,10}+*:deprel=obj')
# Noun phrase with optional adjectives
pattern = build_pattern('DET+ADJ{0,2}+NOUN')
Advanced Usage
Manual Pattern Construction
For complex patterns, construct objects directly:
from conllu_tools.matching import Condition, TokenPattern, SentencePattern
# Create conditions
case_nom = Condition(key='Case', values=['Nom'])
case_acc = Condition(key='Case', values=['Acc'])
feats_cond = Condition(key='feats', values=[case_nom, case_acc], match_any=True)
upos_cond = Condition(key='upos', values=['NOUN'])
# Create token pattern
noun_pattern = TokenPattern(conditions=[upos_cond, feats_cond])
# Create sentence pattern
pattern = SentencePattern(pattern=[noun_pattern], name='nom-or-acc-noun')
Condition Types
The Condition class supports various match types:
from conllu_tools.matching import Condition
# Exact match (default)
cond = Condition(key='lemma', values=['rex'])
# Contains
cond = Condition(key='form', values=['ae'], match_type='contains')
# Starts with
cond = Condition(key='form', values=['ab'], match_type='startswith')
# Ends with
cond = Condition(key='form', values=['um'], match_type='endswith')
# Negation
cond = Condition(key='upos', values=['NOUN'], negate=True)
# Match any of multiple values
cond = Condition(key='lemma', values=['rex', 'regina'], match_any=True)
Nested Conditions
For dictionary-type attributes like feats, use nested conditions:
from conllu_tools.matching import Condition
# Nested conditions for features
case_cond = Condition(key='Case', values=['Abl'])
number_cond = Condition(key='Number', values=['Sing'])
# All conditions must match
feats_all = Condition(key='feats', values=[case_cond, number_cond], match_any=False)
# Any condition matches
feats_any = Condition(key='feats', values=[case_cond, number_cond], match_any=True)
Pattern Explanation
Get a human-readable explanation of a pattern:
pattern = build_pattern('NOUN:feats=(Case=Abl,Number=Sing)+VERB')
print(pattern.explain())
# Output:
# This pattern matches a sequence of the following token patterns:
# Token Pattern 1: Matches a token when 'upos' equals NOUN and ...
# Token Pattern 2: Matches a token when 'upos' equals VERB
Matching Individual Sentences
Apply patterns to single sentences:
import conllu
from conllu_tools.matching import build_pattern
conllu_text = """# sent_id = example-1
# text = Rex magnam urbem videt.
1 Rex rex NOUN _ Case=Nom|Gender=Masc|Number=Sing 4 nsubj _ _
2 magnam magnus ADJ _ Case=Acc|Gender=Fem|Number=Sing 3 amod _ _
3 urbem urbs NOUN _ Case=Acc|Gender=Fem|Number=Sing 4 obj _ _
4 videt video VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Pres 0 root _ _
5 . . PUNCT _ _ 4 punct _ _
"""
sentence = conllu.parse(conllu_text)[0]
pattern = build_pattern('ADJ+NOUN', name='adj-noun')
matches = pattern.match(sentence)
for match in matches:
print(f"Found: {match.substring}") # "magnam urbem"
Common Issues
Pattern Not Matching
Problem: Pattern doesn’t find expected matches.
Solutions:
Check UPOS tags match your corpus annotation:
# Debug: print actual UPOS values
for token in sentence:
print(f"{token['form']}: {token['upos']}")
Verify feature names and values:
# Debug: print actual features
for token in sentence:
print(f"{token['form']}: {token['feats']}")
Test simpler patterns first:
# Start simple
pattern = build_pattern('NOUN') # Does this match?
# Then add conditions
pattern = build_pattern('NOUN:feats=(Case=Abl)')
Empty Feature Handling
Problem: Tokens with empty features (_) don’t match feature conditions.
Solution: Feature conditions only match when features exist:
# This won't match tokens with feats=_
pattern = build_pattern('NOUN:feats=(Case=Nom)')
# To also find nouns regardless of features, use alternation or separate patterns
nouns = build_pattern('NOUN', name='all-nouns')
nom_nouns = build_pattern('NOUN:feats=(Case=Nom)', name='nominative-nouns')
Quantifier Behavior
Problem: Quantified patterns match more or fewer tokens than expected.
Solution: Quantifiers apply to the preceding token pattern only:
# Matches: DET + (0-2 ADJ) + NOUN
pattern = build_pattern('DET+ADJ{0,2}+NOUN')
# NOT: (DET + ADJ){0,2} + NOUN
# Each token pattern is separate
See Also
Loading - Loading CoNLL-U files
Validation - Validating CoNLL-U files
Matching Module - Detailed matching API