Quick Start Guide
This guide will get you up and running with CoNLL-U Tools in minutes.
Installation
First, install the package:
pip install conllu_tools
Verify the installation:
python -c "import conllu_tools; print('Ready to go!')"
Conversion
Let’s convert a CoNLL-U file to brat format for visual annotation.
CoNLL-U to brat
from conllu_tools.io import conllu_to_brat
# Convert CoNLL-U to brat format
conllu_to_brat(
conllu_filename='my_corpus.conllu',
output_directory='brat_annotations/',
output_root=True, # Show ROOT nodes
sents_per_doc=10, # 10 sentences per document
)
print("Conversion complete! Check brat_annotations directory")
What happens:
Creates
.txtfiles with raw textCreates
.annfiles with standoff annotationsAdds configuration files for brat visualization
Adds
metadata.jsonwith information about the conversion parameters (used bybrat_to_conllu)Splits long files into manageable documents
brat to CoNLL-U
After annotating in brat, convert back to CoNLL-U:
from conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data
# Load feature set for validation
feature_set = load_language_data('feats', language='la')
# Convert back to CoNLL-U
brat_to_conllu(
input_directory='brat_annotations/',
output_directory='updated_conllu/',
ref_conllu='my_corpus.conllu', # Original for features
feature_set=feature_set,
)
print("Converted back to CoNLL-U!")
Validation
Validate CoNLL-U files for format and linguistic correctness:
from conllu_tools import ConlluValidator
# Create validator
validator = ConlluValidator(lang='la', level=2)
# Run validation checks
reporter = validator.validate_file('path/to/yourfile.conllu')
# Print error count
print(f'Errors found: {reporter.get_error_count()}')
# Inspect first error
sent_id, order, testlevel, error = reporter.errors[0]
print(f'Sentence ID: {sent_id}') # e.g. 34
print(f'Testing at level: {testlevel}') # e.g. 2
print(f'Error test level: {error.testlevel}') # e.g. 1
print(f'Error type: {error.error_type}') # e.g. "Metadata"
print(f'Test ID: {error.testid}') # e.g. "text-mismatch"
print(f'Error message: {error.msg}') # Full error message (see below)
# Print all errors formatted as strings
for error in reporter.format_errors():
print(error)
# Example output:
# Sentence 34:
# [L2 Metadata text-mismatch] The text attribute does not match the text
# implied by the FORM and SpaceAfter=No values. Expected: 'Una scala....'
# Reconstructed: 'Una scala ....' (first diff at position 9)
Evaluation
Evaluate parser output against gold standard:
from conllu_tools import ConlluEvaluator
# Compare gold standard with system output
evaluator = ConlluEvaluator(eval_deprels=True, treebank_type='0')
scores = evaluator.evaluate_files(
gold_path='path/to/gold_standard.conllu',
system_path='path/to/parser_output.conllu',
)
# Print scores
print(f"Unlabeled Attachment Score (UAS): {scores['UAS']:.2f}%")
print(f"Labeled Attachment Score (LAS): {scores['LAS']:.2f}%")
Pattern Matching
Find linguistic patterns in your CoNLL-U corpus.
Basic Pattern Search
import conllu
from conllu_tools.matching import build_pattern, find_in_corpus
# Load corpus
with open('corpus.conllu', encoding='utf-8') as f:
corpus = conllu.parse(f.read())
# Find all adjective + noun sequences
pattern = build_pattern('ADJ+NOUN', name='adj-noun')
matches = find_in_corpus(corpus, [pattern])
for match in matches:
print(f"[{match.sentence_id}] {match.substring}")
print(f" Forms: {match.forms}")
print(f" Lemmata: {match.lemmata}")
Pattern Syntax Examples
# Match by UPOS
build_pattern('NOUN') # Any noun
build_pattern('NOUN|VERB') # Noun or verb
build_pattern('*') # Any token
# Match with conditions
build_pattern('NOUN:lemma=rex') # Noun with lemma 'rex'
build_pattern('NOUN:feats=(Case=Abl)') # Ablative noun
build_pattern('NOUN:feats=(Case=Nom,Number=Sing)') # Singular nominative
# Multi-token sequences
build_pattern('DET+NOUN') # Determiner + noun
build_pattern('ADP+NOUN:feats=(Case=Acc)') # Preposition + accusative noun
build_pattern('DET+ADJ{0,2}+NOUN') # Det + 0-2 adjectives + noun
# Negation and substring matching
build_pattern('!PUNCT') # Not punctuation
build_pattern('*:form=<ae>') # Form contains 'ae'
build_pattern('NOUN:form=um>') # Noun ending in 'um'
Utils
Utilities for tagset conversion, morphology normalization, and feature validation.
Normalize Morphology
The main utility function for harmonizing morphological annotations:
from conllu_tools.io import load_language_data
from conllu_tools.utils import normalize_morphology
feature_set = load_language_data('feats', language='la')
xpos, feats = normalize_morphology(
upos='VERB',
xpos='v|v|3|s|p|i|a|-|-|-', # LLCT format (auto-detected)
feats='Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act',
feature_set=feature_set,
)
print(xpos) # 'v3spia---' (converted to Perseus format)
print(feats) # {'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', ...}
What it does: Auto-detects XPOS format (LLCT, ITTB, PROIEL) → converts to Perseus → validates against UPOS → reconciles features.
Feature Conversion
from conllu_tools.utils import feature_string_to_dict, feature_dict_to_string
# String ↔ dictionary conversion
feat_dict = feature_string_to_dict("Case=Nom|Gender=Masc|Number=Sing")
# {'Case': 'Nom', 'Gender': 'Masc', 'Number': 'Sing'}
feat_str = feature_dict_to_string({'Number': 'Sing', 'Case': 'Gen'})
# 'Case=Gen|Number=Sing' (sorted alphabetically)
XPOS Conversion
from conllu_tools.utils import format_xpos, upos_to_perseus
# Convert UPOS to Perseus code
upos_to_perseus('NOUN') # 'n'
# Auto-detect and convert any XPOS format to Perseus
format_xpos('VERB', 'v|v|3|s|p|i|a|-|-|-', feats) # LLCT → 'v3spia---'
format_xpos('NOUN', 'gen2|casA', feats) # ITTB → 'n-s---fn-'
format_xpos('NOUN', 'Nb', feats) # PROIEL → 'n-s---na-'
Next Steps
Now that you’ve seen the basics, dive deeper:
User Guides
Conversion - Detailed conversion workflows
Validation - Comprehensive validation guide
Evaluation - Advanced evaluation metrics
Matching - Pattern matching patterns
Utils - Utility functions guide
Examples
I/O Examples - I/O examples
Validation Examples - Validation examples
Evaluation Examples - Evaluation examples
Normalization Examples - Normalization examples
API Reference
IO Module - I/O and conversion API
Validation Module - Validation API
Evaluation Module - Evaluation API
Matching Module - Pattern matching API
Utils Module - Utility functions API
Constants Module - Package constants
Need Help?
Documentation: Use the search bar in these docs
More examples: Check the test files for more usage examples