Validation Module
The conllu_tools.validation module provides comprehensive validation tools for CoNLL-U
format files with configurable validation levels.
Validation Levels:
Level 1: Basic format validation (Unicode, ID sequence, basic structure)
Level 2: Universal guidelines (metadata, MISC column, feature format, enhanced deps)
Level 3: Content validation (relation direction, functional leaves, projective punct)
Level 4: Language-specific format validation
Level 5: Language-specific content validation
Main Classes
ConlluValidator
- class conllu_tools.validation.validator.ConlluValidator(lang='ud', level=2, add_features=None, add_deprels=None, add_auxiliaries=None, add_whitespace_exceptions=None, load_dalme=False, sentence_concordance=None)[source]
Bases:
FormatValidationMixin,IdSequenceValidationMixin,UnicodeValidationMixin,EnhancedDepsValidationMixin,MetadataValidationMixin,MiscValidationMixin,StructureValidationMixin,ContentValidationMixin,CharacterValidationMixin,LanguageFormatValidationMixin,LanguageContentValidationMixin,FeatureValidationMixinValidator for CoNLL-U files with configurable validation levels.
- __init__(lang='ud', level=2, add_features=None, add_deprels=None, add_auxiliaries=None, add_whitespace_exceptions=None, load_dalme=False, sentence_concordance=None)[source]
Initialize the validator.
- Parameters:
lang (
str) – Language code for language-specific validation.level (
int) – Validation level (1-5).add_features (
str|None) – Path to additional features JSON file.add_deprels (
str|None) – Path to additional deprels JSON file.add_auxiliaries (
str|None) – Path to additional auxiliaries JSON file.add_whitespace_exceptions (
str|None) – Path to additional whitespace exceptions file.load_dalme (
bool) – Whether to load DALME data.sentence_concordance (
dict[str,dict[str,Any]] |None) – Mapping of sentence IDs to additional metadata.
- validate_file(filepath)[source]
Validate a CoNLL-U file.
- Parameters:
- Return type:
- Returns:
List of formatted error messages
ErrorReporter
ErrorEntry
Exceptions
Validation Mixins
The validator is composed of multiple mixin classes, each handling a specific aspect of
validation. These are combined in ConlluValidator but can be referenced for understanding
the validation architecture.
Note
These mixins are internal implementation details and are not part of the public API. They are documented here for advanced users who want to understand or extend the validation system.
Format Validation Mixins:
FormatValidationMixin- Basic CoNLL-U format validationIdSequenceValidationMixin- Token ID sequence validationUnicodeValidationMixin- Unicode normalization and character validationFeatureValidationMixin- FEATS column format validation
Content Validation Mixins:
MetadataValidationMixin- Sentence metadata validationMiscValidationMixin- MISC column validationStructureValidationMixin- Dependency tree structure validationContentValidationMixin- Content-level validation (relations, subjects, etc.)EnhancedDepsValidationMixin- Enhanced dependency validationCharacterValidationMixin- Character constraint validation
Language-Specific Mixins:
LanguageFormatValidationMixin- Language-specific format rulesLanguageContentValidationMixin- Language-specific content rules