Utils

The conllu_tools.utils module provides utilities for working with morphological annotations, including XPOS format conversion, feature validation, and UPOS tag mapping. These tools are particularly useful for harmonizing annotations from different Latin treebanks (Perseus, PROIEL, ITTB, LLCT) into a common format.

Morphology Normalization

Normalizes XPOS and FEATS together, with automatic format detection and validation.

Quick Start

Normalizing morphological annotations:

from conllu_tools.utils.normalization import normalize_morphology
from conllu_tools.io import load_language_data

# Load feature set for Latin
feature_set = load_language_data('feats', language='la')

# Normalize XPOS and FEATS together
xpos, feats = normalize_morphology(
    upos='VERB',
    xpos='v-s-ga-g-',
    feats='Aspect=Perf|Case=Gen|Degree=Pos|Number=Sing|Voice=Act',
    feature_set=feature_set,
    ref_features='VerbForm=Ger'  # Missing feature added from reference
)

print(xpos)
# Output: 'v-stga-g-'

print(feats)
# Output: {'Aspect': 'Perf', 'Case': 'Gen', 'Degree': 'Pos', 'Number': 'Sing', 'VerbForm': 'Ger', 'Voice': 'Act'}

Basic Usage

from conllu_tools.utils.normalization import normalize_morphology
from conllu_tools.io import load_language_data

feature_set = load_language_data('feats', language='la')

# Basic normalization
xpos, feats = normalize_morphology(
    upos='NOUN',
    xpos='n-s---mn-',
    feats='Case=Nom|Gender=Masc|Number=Sing',
    feature_set=feature_set
)

print(xpos)  # 'n-s---mn-' (validated)
print(feats)  # {'Case': 'Nom', 'Gender': 'Masc', 'Number': 'Sing'}

With Reference Features

Use ref_features to fill in missing features from a reference source:

# Features are incomplete - missing NumForm
xpos, feats = normalize_morphology(
    upos='NUM',
    xpos='m-p---fa-',
    feats='Case=Acc|Gender=Fem|Number=Plur',
    feature_set=feature_set,
    ref_features='NumForm=Word'  # Will be added
)

print(feats)
# Output: {'Case': 'Acc', 'Gender': 'Fem', 'NumForm': 'Word', 'Number': 'Plur'}

What Gets Normalized

The normalizer performs these operations:

  1. Formats XPOS: Auto-detects and converts LLCT, ITTB, PROIEL formats to Perseus

  2. Validates XPOS: Checks each position against UPOS-specific validity rules

  3. Reconciles features: Merges feats with ref_features (feats take precedence)

  4. Validates features: Filters out features invalid for the given UPOS

  5. Generates XPOS from features: Creates XPOS positions from validated features

  6. Reconciles XPOS: Merges provided and generated XPOS (provided takes precedence)

  7. Returns tuple: (normalized_xpos, validated_features)

UPOS Utilities

Convert language-specific POS tags to Universal POS tags.

DALME to UPOS

Convert DALME project tags to Universal POS:

from conllu_tools.utils.upos import dalme_to_upos

dalme_tag = 'coordinating conjunction'
upos_tag = dalme_to_upos(dalme_tag)
print(upos_tag)  # 'CCONJ'

UPOS to Perseus

Convert a UPOS tag to a Perseus XPOS tag:

from conllu_tools.utils.upos import upos_to_perseus

upos = "NOUN"
perseus_tag = upos_to_perseus(upos)
print(perseus_tag)  # 'n'

Feature Utilities

Convert Features to XPOS

The features_to_xpos function creates Perseus XPOS positions from feature dictionaries:

from conllu_tools.utils.features import features_to_xpos

# Generate XPOS from features
xpos = features_to_xpos('Case=Nom|Gender=Masc|Number=Sing')
print(xpos)  # '-----mn-' (positions 7,8 filled)

xpos = features_to_xpos('Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act')
print(xpos)  # '-3spia---' (verb positions filled)

# Works with dictionaries too
xpos = features_to_xpos({'Case': 'Acc', 'Gender': 'Fem', 'Number': 'Plur'})
print(xpos)  # '-p---fa-' (positions 3,7,8 filled)

xpos = features_to_xpos({'Degree': 'Sup'})
print(xpos)  # '--------s' (position 9 filled - superlative adjective)

Feature-to-Position Mapping:

The function uses the FEATS_TO_XPOS mapping constant:

# Example mappings (feature, value) → (position, character)
FEATS_TO_XPOS = {
    ('Person', '1'): (2, '1'),  # First person
    ('Person', '2'): (2, '2'),  # Second person
    ('Person', '3'): (2, '3'),  # Third person
    ('Number', 'Sing'): (3, 's'),  # Singular
    ('Number', 'Plur'): (3, 'p'),  # Plural
    ('Aspect', 'Imp'): (4, 'i'),  # Imperfect
    ('Aspect', 'Perf'): (4, 't'),  # Future Perfect
    ('Tense', 'Pres'): (4, 'p'),  # Present
    ('Tense', 'Past'): (4, 'r'),  # Perfect
    ('Tense', 'Pqp'): (4, 'l'),  # Pluperfect
    ('Tense', 'Fut'): (4, 'f'),  # Future
    ('VerbForm', 'Inf'): (5, 'n'),  # Infinitive
    ('VerbForm', 'Part'): (5, 'p'),  # Participle
    ('VerbForm', 'Ger'): (5, 'd'),  # Gerund
    ('VerbForm', 'Gdv'): (5, 'g'),  # Gerundive
    ('VerbForm', 'Sup'): (5, 'u'),  # Supine
    ('Mood', 'Ind'): (5, 'i'),  # Indicative
    ('Mood', 'Sub'): (5, 's'),  # Subjunctive
    ('Mood', 'Imp'): (5, 'm'),  # Imperative
    ('Voice', 'Act'): (6, 'a'),  # Active
    ('Voice', 'Pass'): (6, 'p'),  # Passive
    ('VerbType', 'Deponent'): (6, 'd'),  # Deponent
    ('Gender', 'Fem'): (7, 'f'),  # Feminine
    ('Gender', 'Masc'): (7, 'm'),  # Masculine
    ('Gender', 'Neut'): (7, 'n'),  # Neuter
    ('Case', 'Abl'): (8, 'b'),  # Ablative
    ('Case', 'Acc'): (8, 'a'),  # Accusative
    ('Case', 'Dat'): (8, 'd'),  # Dative
    ('Case', 'Gen'): (8, 'g'),  # Genitive
    ('Case', 'Nom'): (8, 'n'),  # Nominative
    ('Case', 'Voc'): (8, 'v'),  # Vocative
    ('Case', 'Loc'): (8, 'l'),  # Locative
    ('Case', 'Ins'): (8, 'i'),  # Instrumental
    ('Degree', 'Cmp'): (9, 'c'),  # Comparative
    ('Degree', 'Pos'): (9, 'p'),  # Positive
    ('Degree', 'Sup'): (9, 's'),  # Superlative
    ('Degree', 'Abs'): (9, 'a'),  # Absolute
}

Use cases:

  • Generating XPOS when only morphological features are available

  • Filling missing XPOS positions from features

  • Creating initial tags for new annotations

  • Component of normalize_morphology function

Convert XPOS to Features

The xpos_to_features function extracts morphological features from a Perseus-format XPOS string:

from conllu_tools.utils.features import xpos_to_features

# Extract features from verb XPOS
feats = xpos_to_features('v3spia---')
print(feats)
# Output: {'Person': '3', 'Number': 'Sing', 'Tense': 'Pres', 'Mood': 'Ind', 'Voice': 'Act'}

# Extract features from noun XPOS
feats = xpos_to_features('n-s---mn-')
print(feats)
# Output: {'Number': 'Sing', 'Gender': 'Masc', 'Case': 'Nom'}

# Extract features from adjective XPOS
feats = xpos_to_features('a-p---fgs')
print(feats)
# Output: {'Number': 'Plur', 'Gender': 'Fem', 'Case': 'Gen', 'Degree': 'Sup'}

# Positions with '-' are skipped
feats = xpos_to_features('---------')
print(feats)
# Output: {}

Position-to-Feature Mapping:

The function uses the XPOS_TO_FEATS mapping constant (inverse of FEATS_TO_XPOS):

# Example mappings (position, character) → (feature, value)
XPOS_TO_FEATS = {
    (2, '1'): ('Person', '1'),   # First person
    (2, '2'): ('Person', '2'),   # Second person
    (2, '3'): ('Person', '3'),   # Third person
    (3, 's'): ('Number', 'Sing'), # Singular
    (3, 'p'): ('Number', 'Plur'), # Plural
    (4, 'p'): ('Tense', 'Pres'),  # Present
    (4, 'r'): ('Tense', 'Past'),  # Perfect
    (4, 'i'): ('Aspect', 'Imp'),  # Imperfect
    (5, 'i'): ('Mood', 'Ind'),    # Indicative
    (5, 's'): ('Mood', 'Sub'),    # Subjunctive
    (5, 'n'): ('VerbForm', 'Inf'), # Infinitive
    (5, 'p'): ('VerbForm', 'Part'), # Participle
    (6, 'a'): ('Voice', 'Act'),   # Active
    (6, 'p'): ('Voice', 'Pass'),  # Passive
    (7, 'm'): ('Gender', 'Masc'), # Masculine
    (7, 'f'): ('Gender', 'Fem'),  # Feminine
    (7, 'n'): ('Gender', 'Neut'), # Neuter
    (8, 'n'): ('Case', 'Nom'),    # Nominative
    (8, 'g'): ('Case', 'Gen'),    # Genitive
    (8, 'a'): ('Case', 'Acc'),    # Accusative
    (9, 'p'): ('Degree', 'Pos'),  # Positive
    (9, 'c'): ('Degree', 'Cmp'),  # Comparative
    (9, 's'): ('Degree', 'Sup'),  # Superlative
    # ... and more
}

Use cases:

  • Extracting features from XPOS when FEATS column is empty or incomplete

  • Verifying consistency between XPOS and FEATS

  • Converting legacy annotations that only have XPOS

  • Component of normalize_morphology function

Validate Features

Filter features to ensure they are valid for a given UPOS:

from conllu_tools.utils.features import validate_features
from conllu_tools.io import load_language_data

# Load feature set
feature_set = load_language_data('feats', language='la')

# Validate features - invalid ones will be filtered out
validated = validate_features(
    upos='NOUN',
    feats='Case=Nom|Gender=Fem|Number=Sing|Mood=Ind',  # Mood invalid for NOUN
    feature_set=feature_set
)

print(validated)
# Output: {'Case': 'Nom', 'Gender': 'Fem', 'Number': 'Sing'}
# Note: Mood=Ind removed (not valid for NOUN)

# Works with dictionaries too
validated = validate_features(
    upos='VERB',
    feats={'Mood': 'Ind', 'Case': 'Nom', 'Tense': 'Pres'},  # Case invalid for most verbs
    feature_set=feature_set
)

print(validated)
# Output: {'Mood': 'Ind', 'Tense': 'Pres'}
# Note: Case=Nom removed (typically invalid for finite verbs)

What it does:

  • Checks if each feature is valid for the given UPOS

  • Removes features marked as invalid (0) in the feature set

  • Removes unknown features not in the feature set

  • Normalizes feature names (case-insensitive matching)

  • Returns only valid features as a dictionary

Use cases:

  • Pre-validation before file-level validation

  • Checking which features are compatible with a UPOS

  • Cleaning annotations during conversion

  • Component of normalize_morphology function

XPOS Utilities

Format XPOS

Auto-detect and convert XPOS formats to Perseus standard.

The format_xpos function automatically detects the input XPOS format and converts it to Perseus:

from conllu_tools.utils.xpos import format_xpos

# LLCT format (pipe-separated, 10 parts)
xpos = format_xpos(
    upos='VERB',
    xpos='v|v|3|s|p|i|a|-|-|-',
    feats='Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act'
)
print(xpos)  # 'v3spia---'

# ITTB format (pipe-separated features like 'gen4|tem1|mod1')
xpos = format_xpos(
    upos='VERB',
    xpos='gen4|tem1|mod1',
    feats='Mood=Ind|Tense=Pres|Voice=Pass'
)
print(xpos)  # 'v--pip---'

# PROIEL format (minimal codes, relies on FEATS)
xpos = format_xpos(
    upos='NOUN',
    xpos='Nb',
    feats='Case=Acc|Gender=Neut|Number=Sing'
)
print(xpos)  # 'n-s---na-'

# Perseus format (already correct, just validates UPOS)
xpos = format_xpos(
    upos='NOUN',
    xpos='a-s---fn-',  # Wrong first character
    feats='Case=Nom|Gender=Fem|Number=Sing'
)
print(xpos)  # 'n-s---fn-' (UPOS character corrected)

# Unknown/None - generates default
xpos = format_xpos(
    upos='ADJ',
    xpos=None,
    feats='Case=Nom|Gender=Masc|Number=Sing'
)
print(xpos)  # 'a--------' (default for ADJ)

Format Detection:

The function uses regex patterns to detect input format:

  • PERSEUS_XPOS_MATCHER: [nvapmdcrileugt-]{9} - 9-character Perseus format

  • LLCT_XPOS_MATCHER: Pipe-separated 10-part format

  • ITTB_XPOS_MATCHER: Pipe-separated with feature codes (e.g., gen4|tem1)

  • PROIEL_XPOS_MATCHER: Single or double character codes

Use cases:

  • Harmonizing annotations from different treebanks

  • Converting legacy annotations to standard format

  • Processing mixed-format corpora

  • Component of normalize_morphology function

Validate XPOS

Ensure XPOS positions are valid for the given UPOS:

from conllu_tools.utils.xpos import validate_xpos

# Remove invalid positions for UPOS
validated = validate_xpos(upos='NOUN', xpos='n1s---mn-')
print(validated)
# Output: 'n-s---mn-' (position 2 cleared - only valid for verbs)

# Validate verb XPOS
validated = validate_xpos(upos='VERB', xpos='v3spia---')
print(validated)
# Output: 'v3spia---' (all positions valid for verbs)

# Handle short or malformed XPOS - returns default with UPOS code
validated = validate_xpos(upos='ADJ', xpos='A')
print(validated)
# Output: 'a--------' (returns default since length != 9)

# Handle None XPOS
validated = validate_xpos(upos='NOUN', xpos=None)
print(validated)
# Output: 'n--------' (default for NOUN)

Position validity rules (Perseus format):

  • Position 1: Should match UPOS (set by caller or format_xpos)

  • Position 2: Only valid for ‘v’ (verbs)

  • Position 3: Valid for n, v, a, p, m (nouns, verbs, adjectives, pronouns, numerals)

  • Positions 4-6: Only valid for ‘v’ (verbs)

  • Positions 7-8: Valid for n, v, a, p, m

  • Position 9: Only valid for ‘a’ (adjectives)

What it does:

  • Validates each position (2-9) against UPOS-specific rules

  • Replaces invalid positions with ‘-’

  • Returns default XPOS if input is None or not exactly 9 characters

  • Returns validated Perseus-format XPOS string

Use cases:

  • Correcting UPOS/XPOS mismatches

  • Validating positional tag structure

  • Cleaning imported tags

  • Component of normalize_morphology function

Low-level Conversion Tools

The XPOS converters normalize language-specific POS tags from different treebanks to a common Perseus-style format.

These are used by the higher level functions described above.

Supported Treebanks

  • PROIEL: e.g., Pp

  • ITTB: e.g., J3|modJ|tem3|gen6

  • LLCT: e.g., v|v|1|s|r|i|a|-|-|-

  • Perseus: Target format (e.g., v3spia---)

ITTB to Perseus

The ITTB converter takes UPOS and XPOS:

from conllu_tools.utils.xpos import ittb_to_perseus

print(ittb_to_perseus('ADJ', 'gen2|casB|grp3'))
# Returns 'a-s---fgs'

print(ittb_to_perseus('ADJ', 'gen1|casA|grn2'))
# Returns 'a-s---mnc'

print(ittb_to_perseus('NOUN', 'gen1|casA'))
# Returns 'n-s---mn-'

PROIEL to Perseus

The PROIEL converter takes UPOS and FEATS:

from conllu_tools.utils.xpos import proiel_to_perseus

print(proiel_to_perseus('NOUN', 'Case=Nom|Gender=Masc|Number=Sing'))
# Returns 'n-s---mn-'

print(proiel_to_perseus('VERB', 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Pass'))
# Returns 'v3spip---'

print(proiel_to_perseus('PRON', 'Case=Dat|Number=Sing|Person=1'))
# Returns 'p1s----d-'

LLCT to Perseus

LLCT uses a pipe-separated format that combines UPOS, XPOS, and FEATS information. The converter needs all three columns to generate standard XPOS:

from conllu_tools.utils.xpos import llct_to_perseus

# LLCT format: requires UPOS, XPOS (10-part pipe-separated), and FEATS
upos = 'VERB'
xpos = 'v|v|3|s|p|i|a|-|-|-'  # POS|POS_repeat|person|number|tense|mood|voice|gender|case|degree
feats = 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act'

new_xpos = llct_to_perseus(upos, xpos, feats)
print(new_xpos)  # "v3spia---"