Utils
The conllu_tools.utils module provides utilities for working with morphological annotations, including XPOS format conversion, feature validation, and UPOS tag mapping. These tools are particularly useful for harmonizing annotations from different Latin treebanks (Perseus, PROIEL, ITTB, LLCT) into a common format.
Morphology Normalization
Normalizes XPOS and FEATS together, with automatic format detection and validation.
Quick Start
Normalizing morphological annotations:
from conllu_tools.utils.normalization import normalize_morphology
from conllu_tools.io import load_language_data
# Load feature set for Latin
feature_set = load_language_data('feats', language='la')
# Normalize XPOS and FEATS together
xpos, feats = normalize_morphology(
upos='VERB',
xpos='v-s-ga-g-',
feats='Aspect=Perf|Case=Gen|Degree=Pos|Number=Sing|Voice=Act',
feature_set=feature_set,
ref_features='VerbForm=Ger' # Missing feature added from reference
)
print(xpos)
# Output: 'v-stga-g-'
print(feats)
# Output: {'Aspect': 'Perf', 'Case': 'Gen', 'Degree': 'Pos', 'Number': 'Sing', 'VerbForm': 'Ger', 'Voice': 'Act'}
Basic Usage
from conllu_tools.utils.normalization import normalize_morphology
from conllu_tools.io import load_language_data
feature_set = load_language_data('feats', language='la')
# Basic normalization
xpos, feats = normalize_morphology(
upos='NOUN',
xpos='n-s---mn-',
feats='Case=Nom|Gender=Masc|Number=Sing',
feature_set=feature_set
)
print(xpos) # 'n-s---mn-' (validated)
print(feats) # {'Case': 'Nom', 'Gender': 'Masc', 'Number': 'Sing'}
With Reference Features
Use ref_features to fill in missing features from a reference source:
# Features are incomplete - missing NumForm
xpos, feats = normalize_morphology(
upos='NUM',
xpos='m-p---fa-',
feats='Case=Acc|Gender=Fem|Number=Plur',
feature_set=feature_set,
ref_features='NumForm=Word' # Will be added
)
print(feats)
# Output: {'Case': 'Acc', 'Gender': 'Fem', 'NumForm': 'Word', 'Number': 'Plur'}
What Gets Normalized
The normalizer performs these operations:
Formats XPOS: Auto-detects and converts LLCT, ITTB, PROIEL formats to Perseus
Validates XPOS: Checks each position against UPOS-specific validity rules
Reconciles features: Merges feats with ref_features (feats take precedence)
Validates features: Filters out features invalid for the given UPOS
Generates XPOS from features: Creates XPOS positions from validated features
Reconciles XPOS: Merges provided and generated XPOS (provided takes precedence)
Returns tuple: (normalized_xpos, validated_features)
UPOS Utilities
Convert language-specific POS tags to Universal POS tags.
DALME to UPOS
Convert DALME project tags to Universal POS:
from conllu_tools.utils.upos import dalme_to_upos
dalme_tag = 'coordinating conjunction'
upos_tag = dalme_to_upos(dalme_tag)
print(upos_tag) # 'CCONJ'
UPOS to Perseus
Convert a UPOS tag to a Perseus XPOS tag:
from conllu_tools.utils.upos import upos_to_perseus
upos = "NOUN"
perseus_tag = upos_to_perseus(upos)
print(perseus_tag) # 'n'
Feature Utilities
Convert Features to XPOS
The features_to_xpos function creates Perseus XPOS positions from feature dictionaries:
from conllu_tools.utils.features import features_to_xpos
# Generate XPOS from features
xpos = features_to_xpos('Case=Nom|Gender=Masc|Number=Sing')
print(xpos) # '-----mn-' (positions 7,8 filled)
xpos = features_to_xpos('Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act')
print(xpos) # '-3spia---' (verb positions filled)
# Works with dictionaries too
xpos = features_to_xpos({'Case': 'Acc', 'Gender': 'Fem', 'Number': 'Plur'})
print(xpos) # '-p---fa-' (positions 3,7,8 filled)
xpos = features_to_xpos({'Degree': 'Sup'})
print(xpos) # '--------s' (position 9 filled - superlative adjective)
Feature-to-Position Mapping:
The function uses the FEATS_TO_XPOS mapping constant:
# Example mappings (feature, value) → (position, character)
FEATS_TO_XPOS = {
('Person', '1'): (2, '1'), # First person
('Person', '2'): (2, '2'), # Second person
('Person', '3'): (2, '3'), # Third person
('Number', 'Sing'): (3, 's'), # Singular
('Number', 'Plur'): (3, 'p'), # Plural
('Aspect', 'Imp'): (4, 'i'), # Imperfect
('Aspect', 'Perf'): (4, 't'), # Future Perfect
('Tense', 'Pres'): (4, 'p'), # Present
('Tense', 'Past'): (4, 'r'), # Perfect
('Tense', 'Pqp'): (4, 'l'), # Pluperfect
('Tense', 'Fut'): (4, 'f'), # Future
('VerbForm', 'Inf'): (5, 'n'), # Infinitive
('VerbForm', 'Part'): (5, 'p'), # Participle
('VerbForm', 'Ger'): (5, 'd'), # Gerund
('VerbForm', 'Gdv'): (5, 'g'), # Gerundive
('VerbForm', 'Sup'): (5, 'u'), # Supine
('Mood', 'Ind'): (5, 'i'), # Indicative
('Mood', 'Sub'): (5, 's'), # Subjunctive
('Mood', 'Imp'): (5, 'm'), # Imperative
('Voice', 'Act'): (6, 'a'), # Active
('Voice', 'Pass'): (6, 'p'), # Passive
('VerbType', 'Deponent'): (6, 'd'), # Deponent
('Gender', 'Fem'): (7, 'f'), # Feminine
('Gender', 'Masc'): (7, 'm'), # Masculine
('Gender', 'Neut'): (7, 'n'), # Neuter
('Case', 'Abl'): (8, 'b'), # Ablative
('Case', 'Acc'): (8, 'a'), # Accusative
('Case', 'Dat'): (8, 'd'), # Dative
('Case', 'Gen'): (8, 'g'), # Genitive
('Case', 'Nom'): (8, 'n'), # Nominative
('Case', 'Voc'): (8, 'v'), # Vocative
('Case', 'Loc'): (8, 'l'), # Locative
('Case', 'Ins'): (8, 'i'), # Instrumental
('Degree', 'Cmp'): (9, 'c'), # Comparative
('Degree', 'Pos'): (9, 'p'), # Positive
('Degree', 'Sup'): (9, 's'), # Superlative
('Degree', 'Abs'): (9, 'a'), # Absolute
}
Use cases:
Generating XPOS when only morphological features are available
Filling missing XPOS positions from features
Creating initial tags for new annotations
Component of
normalize_morphologyfunction
Convert XPOS to Features
The xpos_to_features function extracts morphological features from a Perseus-format XPOS string:
from conllu_tools.utils.features import xpos_to_features
# Extract features from verb XPOS
feats = xpos_to_features('v3spia---')
print(feats)
# Output: {'Person': '3', 'Number': 'Sing', 'Tense': 'Pres', 'Mood': 'Ind', 'Voice': 'Act'}
# Extract features from noun XPOS
feats = xpos_to_features('n-s---mn-')
print(feats)
# Output: {'Number': 'Sing', 'Gender': 'Masc', 'Case': 'Nom'}
# Extract features from adjective XPOS
feats = xpos_to_features('a-p---fgs')
print(feats)
# Output: {'Number': 'Plur', 'Gender': 'Fem', 'Case': 'Gen', 'Degree': 'Sup'}
# Positions with '-' are skipped
feats = xpos_to_features('---------')
print(feats)
# Output: {}
Position-to-Feature Mapping:
The function uses the XPOS_TO_FEATS mapping constant (inverse of FEATS_TO_XPOS):
# Example mappings (position, character) → (feature, value)
XPOS_TO_FEATS = {
(2, '1'): ('Person', '1'), # First person
(2, '2'): ('Person', '2'), # Second person
(2, '3'): ('Person', '3'), # Third person
(3, 's'): ('Number', 'Sing'), # Singular
(3, 'p'): ('Number', 'Plur'), # Plural
(4, 'p'): ('Tense', 'Pres'), # Present
(4, 'r'): ('Tense', 'Past'), # Perfect
(4, 'i'): ('Aspect', 'Imp'), # Imperfect
(5, 'i'): ('Mood', 'Ind'), # Indicative
(5, 's'): ('Mood', 'Sub'), # Subjunctive
(5, 'n'): ('VerbForm', 'Inf'), # Infinitive
(5, 'p'): ('VerbForm', 'Part'), # Participle
(6, 'a'): ('Voice', 'Act'), # Active
(6, 'p'): ('Voice', 'Pass'), # Passive
(7, 'm'): ('Gender', 'Masc'), # Masculine
(7, 'f'): ('Gender', 'Fem'), # Feminine
(7, 'n'): ('Gender', 'Neut'), # Neuter
(8, 'n'): ('Case', 'Nom'), # Nominative
(8, 'g'): ('Case', 'Gen'), # Genitive
(8, 'a'): ('Case', 'Acc'), # Accusative
(9, 'p'): ('Degree', 'Pos'), # Positive
(9, 'c'): ('Degree', 'Cmp'), # Comparative
(9, 's'): ('Degree', 'Sup'), # Superlative
# ... and more
}
Use cases:
Extracting features from XPOS when FEATS column is empty or incomplete
Verifying consistency between XPOS and FEATS
Converting legacy annotations that only have XPOS
Component of
normalize_morphologyfunction
Validate Features
Filter features to ensure they are valid for a given UPOS:
from conllu_tools.utils.features import validate_features
from conllu_tools.io import load_language_data
# Load feature set
feature_set = load_language_data('feats', language='la')
# Validate features - invalid ones will be filtered out
validated = validate_features(
upos='NOUN',
feats='Case=Nom|Gender=Fem|Number=Sing|Mood=Ind', # Mood invalid for NOUN
feature_set=feature_set
)
print(validated)
# Output: {'Case': 'Nom', 'Gender': 'Fem', 'Number': 'Sing'}
# Note: Mood=Ind removed (not valid for NOUN)
# Works with dictionaries too
validated = validate_features(
upos='VERB',
feats={'Mood': 'Ind', 'Case': 'Nom', 'Tense': 'Pres'}, # Case invalid for most verbs
feature_set=feature_set
)
print(validated)
# Output: {'Mood': 'Ind', 'Tense': 'Pres'}
# Note: Case=Nom removed (typically invalid for finite verbs)
What it does:
Checks if each feature is valid for the given UPOS
Removes features marked as invalid (0) in the feature set
Removes unknown features not in the feature set
Normalizes feature names (case-insensitive matching)
Returns only valid features as a dictionary
Use cases:
Pre-validation before file-level validation
Checking which features are compatible with a UPOS
Cleaning annotations during conversion
Component of
normalize_morphologyfunction
XPOS Utilities
Format XPOS
Auto-detect and convert XPOS formats to Perseus standard.
The format_xpos function automatically detects the input XPOS format and converts it to Perseus:
from conllu_tools.utils.xpos import format_xpos
# LLCT format (pipe-separated, 10 parts)
xpos = format_xpos(
upos='VERB',
xpos='v|v|3|s|p|i|a|-|-|-',
feats='Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act'
)
print(xpos) # 'v3spia---'
# ITTB format (pipe-separated features like 'gen4|tem1|mod1')
xpos = format_xpos(
upos='VERB',
xpos='gen4|tem1|mod1',
feats='Mood=Ind|Tense=Pres|Voice=Pass'
)
print(xpos) # 'v--pip---'
# PROIEL format (minimal codes, relies on FEATS)
xpos = format_xpos(
upos='NOUN',
xpos='Nb',
feats='Case=Acc|Gender=Neut|Number=Sing'
)
print(xpos) # 'n-s---na-'
# Perseus format (already correct, just validates UPOS)
xpos = format_xpos(
upos='NOUN',
xpos='a-s---fn-', # Wrong first character
feats='Case=Nom|Gender=Fem|Number=Sing'
)
print(xpos) # 'n-s---fn-' (UPOS character corrected)
# Unknown/None - generates default
xpos = format_xpos(
upos='ADJ',
xpos=None,
feats='Case=Nom|Gender=Masc|Number=Sing'
)
print(xpos) # 'a--------' (default for ADJ)
Format Detection:
The function uses regex patterns to detect input format:
PERSEUS_XPOS_MATCHER:
[nvapmdcrileugt-]{9}- 9-character Perseus formatLLCT_XPOS_MATCHER: Pipe-separated 10-part format
ITTB_XPOS_MATCHER: Pipe-separated with feature codes (e.g.,
gen4|tem1)PROIEL_XPOS_MATCHER: Single or double character codes
Use cases:
Harmonizing annotations from different treebanks
Converting legacy annotations to standard format
Processing mixed-format corpora
Component of
normalize_morphologyfunction
Validate XPOS
Ensure XPOS positions are valid for the given UPOS:
from conllu_tools.utils.xpos import validate_xpos
# Remove invalid positions for UPOS
validated = validate_xpos(upos='NOUN', xpos='n1s---mn-')
print(validated)
# Output: 'n-s---mn-' (position 2 cleared - only valid for verbs)
# Validate verb XPOS
validated = validate_xpos(upos='VERB', xpos='v3spia---')
print(validated)
# Output: 'v3spia---' (all positions valid for verbs)
# Handle short or malformed XPOS - returns default with UPOS code
validated = validate_xpos(upos='ADJ', xpos='A')
print(validated)
# Output: 'a--------' (returns default since length != 9)
# Handle None XPOS
validated = validate_xpos(upos='NOUN', xpos=None)
print(validated)
# Output: 'n--------' (default for NOUN)
Position validity rules (Perseus format):
Position 1: Should match UPOS (set by caller or
format_xpos)Position 2: Only valid for ‘v’ (verbs)
Position 3: Valid for n, v, a, p, m (nouns, verbs, adjectives, pronouns, numerals)
Positions 4-6: Only valid for ‘v’ (verbs)
Positions 7-8: Valid for n, v, a, p, m
Position 9: Only valid for ‘a’ (adjectives)
What it does:
Validates each position (2-9) against UPOS-specific rules
Replaces invalid positions with ‘-’
Returns default XPOS if input is None or not exactly 9 characters
Returns validated Perseus-format XPOS string
Use cases:
Correcting UPOS/XPOS mismatches
Validating positional tag structure
Cleaning imported tags
Component of
normalize_morphologyfunction
Low-level Conversion Tools
The XPOS converters normalize language-specific POS tags from different treebanks to a common Perseus-style format.
These are used by the higher level functions described above.
Supported Treebanks
PROIEL: e.g.,
PpITTB: e.g.,
J3|modJ|tem3|gen6LLCT: e.g.,
v|v|1|s|r|i|a|-|-|-Perseus: Target format (e.g.,
v3spia---)
ITTB to Perseus
The ITTB converter takes UPOS and XPOS:
from conllu_tools.utils.xpos import ittb_to_perseus
print(ittb_to_perseus('ADJ', 'gen2|casB|grp3'))
# Returns 'a-s---fgs'
print(ittb_to_perseus('ADJ', 'gen1|casA|grn2'))
# Returns 'a-s---mnc'
print(ittb_to_perseus('NOUN', 'gen1|casA'))
# Returns 'n-s---mn-'
PROIEL to Perseus
The PROIEL converter takes UPOS and FEATS:
from conllu_tools.utils.xpos import proiel_to_perseus
print(proiel_to_perseus('NOUN', 'Case=Nom|Gender=Masc|Number=Sing'))
# Returns 'n-s---mn-'
print(proiel_to_perseus('VERB', 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Pass'))
# Returns 'v3spip---'
print(proiel_to_perseus('PRON', 'Case=Dat|Number=Sing|Person=1'))
# Returns 'p1s----d-'
LLCT to Perseus
LLCT uses a pipe-separated format that combines UPOS, XPOS, and FEATS information. The converter needs all three columns to generate standard XPOS:
from conllu_tools.utils.xpos import llct_to_perseus
# LLCT format: requires UPOS, XPOS (10-part pipe-separated), and FEATS
upos = 'VERB'
xpos = 'v|v|3|s|p|i|a|-|-|-' # POS|POS_repeat|person|number|tense|mood|voice|gender|case|degree
feats = 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act'
new_xpos = llct_to_perseus(upos, xpos, feats)
print(new_xpos) # "v3spia---"