Utils Module

The conllu_tools.utils module provides utilities for working with different tagsets and formats, including morphology normalization, XPOS format conversion, and feature validation.

Key Capabilities:

  • Normalize morphological annotations across different treebank formats

  • Convert between UPOS tags and Perseus XPOS codes

  • Validate and convert features and XPOS strings

  • Convert XPOS formats from LLCT, ITTB, and PROIEL treebanks to Perseus standard

Morphology Normalization

The main entry point for normalizing morphological information.

conllu_tools.utils.normalization.normalize_morphology(upos, xpos, feats, feature_set, ref_features=None)[source]

Normalize morphological information.

Takes UPOS, XPOS, and FEATS, normalizes and validates them against a provided feature set, and reconciles with reference features if provided.

Parameters:
  • upos (str) – The Universal Part of Speech tag.

  • xpos (str) – The language-specific Part of Speech tag.

  • feats (dict[str, str] | str) – A string or dictionary of features.

  • feature_set (dict[str, Any]) – A feature set dictionary defining valid features.

  • ref_features (dict[str, str] | str | None) – A reference feature string or dictionary to reconcile with (optional).

Return type:

tuple[str, dict[str, str]]

Returns:

A tuple containing the normalized XPOS string and validated feature dictionary.

UPOS Utilities

Convert between different POS tag systems.

Convert from and to UPOS.

conllu_tools.utils.upos.upos_to_perseus(upos_tag)[source]

Convert a UPOS tag to a Perseus XPOS tag.

Return type:

str

conllu_tools.utils.upos.dalme_to_upos(dalmepos_tag)[source]

Convert a DALME POS tag to a Universal POS tag.

Return type:

str

Feature Utilities

Convert and validate morphological features.

Feature string and dictionary conversion utilities.

conllu_tools.utils.features.feature_string_to_dict(feat_string)[source]

Convert a feature string to a dictionary.

Parameters:

feat_string (str | None) – A feature string.

Return type:

dict[str, str]

Returns:

A dictionary of features.

conllu_tools.utils.features.feature_dict_to_string(feat_dict)[source]

Convert a feature dictionary to a string.

Parameters:

feat_dict (dict[str, Any] | None) – A dictionary of features.

Return type:

str

Returns:

A feature string.

conllu_tools.utils.features.features_to_xpos(feats)[source]

Convert features to XPOS in Perseus format.

Parameters:

feats (dict[str, str] | str) – A feature string or dictionary of features.

Return type:

str

Returns:

An XPOS string in Perseus format.

conllu_tools.utils.features.xpos_to_features(xpos)[source]

Convert XPOS in Perseus format to features.

Parameters:

xpos (str) – An XPOS string in Perseus format.

Return type:

dict[str, str]

Returns:

A dictionary of features.

conllu_tools.utils.features.validate_features(upos, feats, feature_set)[source]

Ensure features are valid for given UPOS based on feature set.

Parameters:
  • upos (str) – The Universal Part of Speech tag.

  • feats (dict[str, str] | str) – A feature string or dictionary of features.

  • feature_set (dict[str, Any]) – A feature set dictionary defining valid features.

Return type:

dict[str, str]

Returns:

A validated feature dictionary.

XPOS Utilities

Convert and validate XPOS tags across different treebank formats.

Format XPOS

Auto-detect and convert XPOS formats to Perseus standard.

Convert various XPOS formats to Perseus XPOS format.

conllu_tools.utils.xpos.format_xpos.format_xpos(upos, xpos, feats)[source]

Convert morphology data in various formats to Perseus XPOS.

Parameters:
  • upos (str) – The Universal Part of Speech tag.

  • xpos (str | None) – XPOS string formatted in almost styles (LLCT, ITTB, PROIEL, Perseus, DALME, etc).

  • feats (dict[str, str] | str | None) – A dictionary of features.

Return type:

str

Returns:

A Perseus XPOS string.

Validate XPOS

Validate XPOS positions against UPOS-specific rules.

XPOS validation.

conllu_tools.utils.xpos.validate.validate_xpos(upos, xpos)[source]

Ensure XPOS are valid for given UPOS.

Parameters:
  • upos (str) – The Universal Part of Speech tag.

  • xpos (str | None) – The language-specific Part of Speech tag.

Return type:

str

Returns:

A validated XPOS string.

ITTB to Perseus

Convert Index Thomisticus Treebank XPOS to Perseus format.

Functions for converting between ITTB and Perseus XPOS tags.

conllu_tools.utils.xpos.ittb_converters.ittb_to_perseus(upos, xpos)[source]

Convert ITTB UPOS and XPOS to Perseus XPOS tag.

Parameters:
  • upos (str | None) – The Universal Part of Speech tag.

  • xpos (str | None) – The ITTB XPOS tag.

Return type:

str

Returns:

A Perseus XPOS string.

PROIEL to Perseus

Convert PROIEL Treebank XPOS to Perseus format.

Functions for converting between PROIEL and Perseus XPOS tags.

conllu_tools.utils.xpos.proiel_converters.proiel_to_perseus(upos, feats)[source]

Convert PROIEL UPOS and FEATS to Perseus XPOS tag.

Parameters:
  • upos (str) – The Universal Part of Speech tag.

  • feats (dict[str, str] | str) – A feature string or dictionary of features.

Return type:

str

Returns:

A Perseus XPOS string.

LLCT to Perseus

Convert Late Latin Charter Treebank XPOS to Perseus format.

Functions for converting between LLCT and Perseus XPOS tags.

conllu_tools.utils.xpos.llct_converters.llct_to_perseus(upos, xpos, feats)[source]

Convert LLCT UPOS, XPOS, and FEATS to Perseus XPOS tag.

Parameters:
  • upos (str) – The Universal Part of Speech tag.

  • xpos (str) – An LLCT XPOS string.

  • feats (dict[str, str] | str) – A feature string or dictionary of features.

Return type:

str

Returns:

A Perseus XPOS string.

brat Utilities

Utilities for working with the brat standoff annotation format. These are used by the conversion tools in the IO module but can also be used independently.

Utilities for BRAT standoff format.

conllu_tools.utils.brat.type_to_safe_type(typestring)[source]

Rewrite characters in CoNLL-X types that cannot be directly used in identifiers in brat-flavored standoff.

Parameters:

typestring (str) – The original CoNLL-X type string.

Return type:

str

Returns:

A brat-compatible type string.

conllu_tools.utils.brat.safe_type_to_type(typestring)[source]

Rewrite characters in brat-flavored standoff types back to CoNLL-X format.

Parameters:

typestring (str) – The brat-safe type string.

Return type:

str

Returns:

The original CoNLL-X type string.

conllu_tools.utils.brat.parse_annotation_line(line)[source]

Parse a BRAT annotation line into its components.

Parameters:

line (str) – A single line from a BRAT .ann file.

Return type:

dict[str, Any] | None

Returns:

A dictionary with annotation details, or None if the line is invalid.

conllu_tools.utils.brat.format_annotation(ann)[source]

Format an annotation dict back into BRAT format.

Parameters:

ann (dict[str, Any]) – A dictionary with annotation details.

Return type:

str

Returns:

A string formatted for a BRAT .ann file.

conllu_tools.utils.brat.read_annotations(filepath)[source]

Read and parse all annotations from a BRAT .ann file.

Parameters:

filepath (str) – Path to the BRAT .ann file.

Return type:

list[dict[str, str | int]]

Returns:

A list of annotation dictionaries.

conllu_tools.utils.brat.read_text_lines(filepath)[source]

Read the text content from a BRAT .txt file.

Parameters:

filepath (str) – Path to the BRAT .txt file.

Return type:

list[str]

Returns:

The text content of the file as a list of strings.

conllu_tools.utils.brat.sort_annotations_set(annotations)[source]

Sort set of annotations by ID number to maintain consistent ordering.

Parameters:

annotations (list[dict[str, Any]]) – A list of annotation dictionaries.

Return type:

list[dict[str, str | int]]

Returns:

A sorted list of annotation dictionaries.

conllu_tools.utils.brat.sort_annotations(annotations)[source]

Sort annotations by type and ID number.

Parameters:

annotations (list[dict[str, Any]]) – A list of annotation dictionaries.

Return type:

list[dict[str, str | int]]

Returns:

A sorted list of annotation dictionaries.

conllu_tools.utils.brat.write_annotations(filepath, annotations)[source]

Write annotations to a BRAT .ann file.

Parameters:
  • filepath (str | Path) – Path to the output BRAT .ann file.

  • annotations (list[dict[str, str | int]]) – A list of annotation dictionaries to write.

Return type:

None

conllu_tools.utils.brat.write_text(filepath, doctext)[source]

Write document text to a BRAT .txt file.

Parameters:
  • filepath (str | Path) – Path to the output BRAT .txt file.

  • doctext (list[str]) – A list of strings representing the document text.

Return type:

None

conllu_tools.utils.brat.write_auxiliary_files(output_directory, metadata)[source]

Add metadata and default BRAT configuration files to the output directory.

Parameters:
  • output_directory (str) – The directory to write the configuration files to.

  • metadata (dict[str, Any]) – Dictionary with metadata values for the directory.

Return type:

None

conllu_tools.utils.brat.get_next_id_number(annotations, prefix)[source]

Find the next available ID number for a given prefix (T or R).

Parameters:
  • annotations (list[dict[str, Any]]) – A list of annotation dictionaries.

  • prefix (str) – The prefix to search for (‘T’ for entities, ‘R’ for relations).

Return type:

int

Returns:

The next available ID number for the given prefix.