estnltk.teicorpus module

Module for reading koondcorpus files. See http://www.cl.ut.ee/korpused/segakorpus/index.php?lang=et

The corpus contains variety of documents from different domains and can be used freely for non-commercial purposes. Estnltk is capable of reading XML formatted files of the corpus and parse the documents, paragraphs, sentences and words with some additional metadata found in XML files.

The implementation is currently quite simplistic, though. But it should be sufficient for simpler use cases. The resulting documents have paragraphs separated by two newlines and sentences by single newline. The original plain text is not known for XML TEI files. Note that all punctuation has been separated from words in the TEI files.

TODO: Make this module faster, it is dead slow, probably due to BeautifulSoup4.

estnltk.teicorpus.parse_tei_corpus(path, target=['artikkel'])[source]

Parse documents from a TEI style XML file.

Parameters:

path: str

The path of the XML file.

target: list of str

List of <div> types, that are considered documents in the XML files (default: [“artikkel”]).

Returns:

list of esnltk.text.Text

estnltk.teicorpus.parse_tei_corpora(root, prefix='', suffix='.xml', target=['artikkel'])[source]

Parse documents from TEI style XML files.

Gives each document FILE attribute that denotes the original filename.

Parameters:

root: str

The directory path containing the TEI corpora XMl files.

prefix: str

The prefix of filenames to include (default: ‘’)

suffix: str

The suffix of filenames to include (default: ‘.xml’)

target: list of str

List of <div> types, that are considered documents in the XML files (default: [“artikkel”]).

Returns:

list of estnltk.text.Text

Corpus containing parsed documents from all files. The file path is stored in FILE attribute of the documents.