Validation Module

The conllu_tools.validation module provides comprehensive validation tools for CoNLL-U format files with configurable validation levels.

Validation Levels:

  • Level 1: Basic format validation (Unicode, ID sequence, basic structure)

  • Level 2: Universal guidelines (metadata, MISC column, feature format, enhanced deps)

  • Level 3: Content validation (relation direction, functional leaves, projective punct)

  • Level 4: Language-specific format validation

  • Level 5: Language-specific content validation

Main Classes

ConlluValidator

class conllu_tools.validation.validator.ConlluValidator(lang='ud', level=2, add_features=None, add_deprels=None, add_auxiliaries=None, add_whitespace_exceptions=None, load_dalme=False, sentence_concordance=None)[source]

Bases: FormatValidationMixin, IdSequenceValidationMixin, UnicodeValidationMixin, EnhancedDepsValidationMixin, MetadataValidationMixin, MiscValidationMixin, StructureValidationMixin, ContentValidationMixin, CharacterValidationMixin, LanguageFormatValidationMixin, LanguageContentValidationMixin, FeatureValidationMixin

Validator for CoNLL-U files with configurable validation levels.

__init__(lang='ud', level=2, add_features=None, add_deprels=None, add_auxiliaries=None, add_whitespace_exceptions=None, load_dalme=False, sentence_concordance=None)[source]

Initialize the validator.

Parameters:
  • lang (str) – Language code for language-specific validation.

  • level (int) – Validation level (1-5).

  • add_features (str | None) – Path to additional features JSON file.

  • add_deprels (str | None) – Path to additional deprels JSON file.

  • add_auxiliaries (str | None) – Path to additional auxiliaries JSON file.

  • add_whitespace_exceptions (str | None) – Path to additional whitespace exceptions file.

  • load_dalme (bool) – Whether to load DALME data.

  • sentence_concordance (dict[str, dict[str, Any]] | None) – Mapping of sentence IDs to additional metadata.

validate_file(filepath)[source]

Validate a CoNLL-U file.

Parameters:

filepath (str | Path) – Path to the CoNLL-U file

Return type:

ErrorReporter

Returns:

List of formatted error messages

validate_string(content)[source]

Validate CoNLL-U content from a string.

Parameters:

content (str) – CoNLL-U content as string

Return type:

ErrorReporter

Returns:

List of formatted error messages

ErrorReporter

class conllu_tools.validation.error_reporter.ErrorReporter[source]

Bases: object

Manages error collection and reporting for validation.

__init__()[source]

Initialize the error reporter.

reset()[source]

Reset the error reporter state.

Return type:

None

warn(msg, error_type, testlevel=0, testid='some-test', line_no=None, node_id=None)[source]

Record a validation warning/error.

Parameters:
  • msg (str) – Error message

  • error_type (str) – Type/category of error

  • testlevel (int) – Level of the test (1-5)

  • testid (str) – Identifier for the test

  • line_no (int | None) – Line number where error occurred

  • node_id (str | None) – Node ID if applicable

Return type:

None

format_errors()[source]

Format all errors as a list of strings.

Return type:

list[str]

get_error_count()[source]

Get the total number of errors.

Return type:

int

ErrorEntry

class conllu_tools.validation.error_reporter.ErrorEntry(alt_id, testlevel, error_type, testid, msg, node_id, line_no, tree_counter)[source]

Bases: object

Represents a single validation error.

alt_id: str | None
testlevel: int
error_type: str
testid: str
msg: str
node_id: str | None
line_no: int | None
tree_counter: int | None
__str__()[source]

Format the error as a string.

Return type:

str

__init__(alt_id, testlevel, error_type, testid, msg, node_id, line_no, tree_counter)

Exceptions

exception conllu_tools.validation.error_reporter.ValidationError[source]

Bases: Exception

Custom exception for validation errors.

Validation Mixins

The validator is composed of multiple mixin classes, each handling a specific aspect of validation. These are combined in ConlluValidator but can be referenced for understanding the validation architecture.

Note

These mixins are internal implementation details and are not part of the public API. They are documented here for advanced users who want to understand or extend the validation system.

Format Validation Mixins:

  • FormatValidationMixin - Basic CoNLL-U format validation

  • IdSequenceValidationMixin - Token ID sequence validation

  • UnicodeValidationMixin - Unicode normalization and character validation

  • FeatureValidationMixin - FEATS column format validation

Content Validation Mixins:

  • MetadataValidationMixin - Sentence metadata validation

  • MiscValidationMixin - MISC column validation

  • StructureValidationMixin - Dependency tree structure validation

  • ContentValidationMixin - Content-level validation (relations, subjects, etc.)

  • EnhancedDepsValidationMixin - Enhanced dependency validation

  • CharacterValidationMixin - Character constraint validation

Language-Specific Mixins:

  • LanguageFormatValidationMixin - Language-specific format rules

  • LanguageContentValidationMixin - Language-specific content rules