Conversion
This guide covers converting between brat standoff annotation format and CoNLL-U format.
Brat is a web-based annotation tool that uses standoff format (separate .ann and .txt files).
CoNLL-U is a tabular format for representing linguistic annotations, especially dependency trees.
The conversion utilities allow you to:
Convert CoNLL-U treebanks to Brat format for annotation/review
Convert Brat annotations back to CoNLL-U format
Handle enhanced dependencies and complex structures
Preserve morphological features and metadata
CoNLL-U to Brat
Converting CoNLL-U files to Brat standoff format enables visual annotation and review using the Brat tool.
Basic Usage
from conllu_tools.io import conllu_to_brat
conllu_to_brat(
conllu_filename='input.conllu',
output_directory='brat_output/',
output_root=True,
sents_per_doc=10
)
This creates:
.txtfiles containing the raw text.annfiles containing the annotations in Brat standoff formatConfiguration files (
annotation.conf,visual.conf,tools.conf)metadata.jsonfile containing conversion parameters (used for round-trip conversion)
Parameters
- conllu_filename (str)
Path to the input CoNLL-U file.
- output_directory (str)
Directory where Brat files will be created.
- output_root (bool, optional)
If
True, creates explicit ROOT nodes for sentence roots. This makes the dependency structure more visible in Brat. Default:True.- sents_per_doc (int, optional)
Number of sentences per output document. If specified, long files are split into multiple documents (e.g.,
doc-001,doc-002). IfNone, all sentences go into one document. Default:None.
Output Structure
The output directory will contain:
brat_output/
├── annotation.conf # Defines annotation types
├── visual.conf # Visual styling
├── tools.conf # Tool configuration
├── metadata.json # Conversion parameters (for round-trip conversion)
├── input-doc-001.txt # Text for document 1
├── input-doc-001.ann # Annotations for document 1
├── input-doc-002.txt
└── input-doc-002.ann
Features
Multiword tokens: Correctly handles multi-word tokens
ROOT nodes: Optional explicit ROOT entity for dependency visualization
Metadata preservation: Stores original CoNLL-U metadata
Automatic splitting: Can split large files into manageable documents
metadata.json: Automatically creates a metadata file containing conversion parameters
The metadata.json File
The conllu_to_brat function automatically creates a metadata.json file in the output directory containing:
{
"conllu_filename": "input.conllu",
"sents_per_doc": 10,
"output_root": true
}
Purpose: This file stores the conversion parameters so that brat_to_conllu can automatically use the same settings for round-trip conversion. When you call brat_to_conllu, it will:
Check for
metadata.jsonin the input directoryLoad the parameters from the file if present
Use those values if you don’t explicitly provide them
Benefits:
Ensures consistency between forward and reverse conversions
Eliminates the need to remember or document conversion parameters
Reduces errors from parameter mismatches
Note: You can override metadata values by explicitly passing parameters to brat_to_conllu.
Brat to CoNLL-U
Converting Brat annotations back to CoNLL-U format after manual annotation or review.
Basic Usage
from conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data
# Load feature set for validation
feature_set = load_language_data('feats', language='la')
brat_to_conllu(
input_directory='brat_output/',
output_directory='conllu_output/',
ref_conllu='original.conllu',
feature_set=feature_set,
output_root=True,
sents_per_doc=10
)
Parameters
- input_directory (str)
Directory containing Brat
.annand.txtfiles.- output_directory (str)
Directory where the output CoNLL-U file will be created. The output file will be named
{basename}-from_brat.conlluwhere{basename}is derived from the reference file name.- ref_conllu (str)
Path to a reference CoNLL-U file. This provides morphological features (LEMMA, XPOS, FEATS) and other information not captured in Brat annotations. Optional if
metadata.jsonexists in the input directory with this information.- feature_set (dict)
Feature set definition loaded from
conllu_tools.io.load_language_data(). Used to validate and normalize features. Required parameter.- output_root (bool, optional)
Must match the value used when creating the Brat files (whether ROOT entities were included). While not used in the conversion logic itself, this parameter is required to ensure consistency and must match the original conversion settings. Optional if
metadata.jsonexists in the input directory with this information.- sents_per_doc (int, optional)
Must match the value used when creating the Brat files. Can be
Noneif all sentences were in one document. Optional ifmetadata.jsonexists in the input directory.
Features
Morphology from reference: Retrieves LEMMA, XPOS, FEATS from reference file
Dependency from Brat: Uses HEAD and DEPREL from Brat annotations
Enhanced dependencies: Supports enhanced dependency graphs
ID renumbering: Handles ID conflicts when merging multiple files
Automatic UPOS correction: Changes VERB→AUX for auxiliary verbs based on deprel
Automatic parameter loading: Reads conversion parameters from
metadata.jsonif present
Using metadata.json for Automatic Configuration
If you converted to Brat using conllu_to_brat, a metadata.json file was created. You can use it to simplify the conversion back:
from conllu_tools.io import brat_to_conllu
from conllu_tools.io import load_language_data
# Load feature set (still required)
feature_set = load_language_data('feats', language='la')
# metadata.json in input directory provides ref_conllu, sents_per_doc, and output_root
brat_to_conllu(
input_directory='brat_output/',
output_directory='conllu_output/',
feature_set=feature_set
)
# No need to specify ref_conllu, output_root, or sents_per_doc!
The function will automatically:
Look for
metadata.jsoninbrat_output/Read
conllu_filename(used asref_conllu)Read
output_rootandsents_per_docvaluesUse those parameters for conversion
Manual parameters override metadata: If you provide parameters explicitly, they take precedence over metadata values.
Troubleshooting
Missing metadata.json
If you get AssertionError: No ref_conllu value passed and no metadata file found:
Cause: The brat_to_conllu function needs to know the reference file and conversion parameters, but no metadata.json exists and you didn’t provide them explicitly.
Solutions:
If you have the original parameters, provide them explicitly:
brat_to_conllu( input_directory='brat_output/', output_directory='conllu_output/', feature_set=feature_set, ref_conllu='original.conllu', output_root=True, sents_per_doc=10 )
If you don’t know the parameters, check how the Brat files were created or try:
# Start with defaults brat_to_conllu( input_directory='brat_output/', output_directory='conllu_output/', feature_set=feature_set, ref_conllu='original.conllu', output_root=False, # Try False first sents_per_doc=None )
ID Mismatches
If you get “ID mismatch” or “Sentence length mismatch” errors during brat_to_conllu:
Ensure the reference file has the same sentences as the Brat files
Check that sentence order hasn’t changed
Verify
sents_per_docparameter matches original conversion (or use metadata.json)Check that you haven’t added or removed sentences in Brat
Token Mismatch Errors
If you get “Token mismatch in sentence X” errors:
Cause: The tokens in the Brat .txt file don’t match the tokens in the reference CoNLL-U file.
Solutions:
Ensure you’re using the correct reference file (the one originally converted)
Check that no tokens were edited in the Brat
.txtfilesVerify that tokenization is identical between files (case-insensitive comparison is used)
Missing Features
If converted CoNLL-U is missing morphological features:
Ensure you provided a valid reference CoNLL-U file with features
Check that the reference file sentences align with Brat sentences
Verify the feature_set is loaded correctly with:
load_language_data('feats', language='la')Remember: Brat only stores UPOS, HEAD, and DEPREL - all other fields come from the reference file
Missing Output File
If no output file is created:
Check that the input directory contains
.annfilesVerify the output directory path is writable
Ensure
feature_setparameter is provided (it’s required)Check for error messages about missing reference files
Encoding Issues
If you see garbled characters:
Ensure all files use UTF-8 encoding
Check that your text editor/viewer supports UTF-8
See Also
IO Module - Detailed API documentation
I/O Examples - Conversion examples