estnltk.text module

Text module contains central functionality of Estnltk. It sets up standard functionality for tokenziation and tagging and hooks it up with Text class.

class estnltk.text.Text(text_or_instance, **kwargs)[source]

Central class of Estnltk that is the main interface of performing all NLP operations.

Attributes

get

Methods

capitalize()
clean() Return a copy of this Text instance with invalid characters removed.
clear(() -> None.  Remove all items from D.)
copy(() -> a shallow copy of D)
count(sub, \*args)
divide([layer, by]) Divide the Text into pieces by keeping references to original elements, when possible.
ends(layer) Retrieve end positions of elements if given layer.
endswith(suffix, \*args)
find(sub, \*args)
fix_spelling() Fix spelling of the text.
fromkeys Returns a new dict with keys from iterable and values equal to value.
get_analysis_element(element[, sep]) The list of analysis elements of words layer.
get_elements_in_span(element, span)
get_kwargs() Get the keyword arguments that were passed to the Text when it was constructed.
index(sub, \*args)
is_multi(layer)
is_simple(layer)
is_tagged(layer) Is the given element tokenized/tagged?
is_valid() Does this text contain allowed/valid characters as defined by the TextCleaner instances supplied to this Text.
isalnum()
isalpha()
isdigit()
islower()
isspace()
istitle()
isupper()
items(...)
keys(...)
lower()
lstrip(\*args)
pop((k[,d]) -> v, ...) If key is not found, d is returned if given, otherwise KeyError is raised
popitem(() -> (k, v), ...) 2-tuple; but raise KeyError if D is empty.
replace(old, new, \*args)
rfind(sub, \*args)
rindex(sub, \*args)
rstrip(\*args)
setdefault((k[,d]) -> D.get(k,d), ...)
spans(layer) Retrieve (start, end) tuples denoting the spans of given layer elements.
split_by(layer[, sep]) Split the text into multiple instances defined by elements of given layer.
split_by_regex(regex_or_pattern[, flags, gaps]) Split the text into multiple instances using a regex.
split_by_sentences() Split the text into individual sentences.
split_by_words() Split the text into individual words.
split_given_spans(spans[, sep]) Split the text into several pieces.
starts(layer) Retrieve start positions of elements if given layer.
startswith(prefix, \*args)
strip(\*args)
tag(layer) Tag the annotations of given layer.
tag_all() Tag all layers.
tag_analysis() Tag words layer with morphological analysis attributes.
tag_clause_annotations() Tag clause annotations in words layer.
tag_clauses() Create clauses multilayer.
tag_labels() Tag named entity labels in the words layer.
tag_named_entities() Tag named_entities layer.
tag_syntax() Tag syntax attribute in the words layer.
tag_timexes() Create timexes layer.
tag_verb_chains() Create verb_chains layer.
tag_with_regex(name, pattern[, flags])
tag_wordnet(\*\*kwargs) Create wordnet attribute in words layer.
texts(layer[, sep]) Retrieve texts for given layer.
texts_from_spans(spans[, sep]) Retrieve texts from a list of (start, end) position spans.
tokenize_paragraphs() Apply paragraph tokenization to this Text instance.
tokenize_sentences() Apply sentence tokenization to this Text instance.
tokenize_words() Apply word tokenization and create words layer.
update(([E, ...) If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
values(...)
analysis

The list of analysis of words layer elements.

clause_annotations

The list of clause annotations in words layer.

clause_indices

The list of clause indices in words layer. The indices are unique only in the boundary of a single sentence.

clause_texts

The texts of clauses multilayer elements. Non-consequent spans are concatenated with space character by default. Use texts() method to supply custom separators.

clauses

The elements of clauses multilayer.

clean()[source]

Return a copy of this Text instance with invalid characters removed.

descriptions

Human readable word descriptions.

divide(layer='words', by='sentences')[source]

Divide the Text into pieces by keeping references to original elements, when possible. This is not possible only, if the _element_ is a multispan.

Parameters:

element: str

The element to collect and distribute in resulting bins.

by: str

Each resulting bin is defined by spans of this element.

Returns:

list of (list of dict)

endings

The list of word endings.

Ambiguous cases are separated with pipe character by default. Use get_analysis_element() to specify custom separator for ambiguous entries.

ends(layer)[source]

Retrieve end positions of elements if given layer.

fix_spelling()[source]

Fix spelling of the text.

Note that this method uses the first suggestion that is given for each misspelled word. It does not perform any sophisticated analysis to determine which one of the suggestions fits best into the context.

Returns:

Text

A copy of this instance with automatically fixed spelling.

forms

Tthe list of word forms.

Ambiguous cases are separated with pipe character by default. Use get_analysis_element() to specify custom separator for ambiguous entries.

get_analysis_element(element, sep='|')[source]

The list of analysis elements of words layer.

Parameters:

element: str

The name of the element, for example “lemma”, “postag”.

sep: str

The separator for ambiguous analysis (default: “|”). As morphological analysis cannot always yield unambiguous results, we return ambiguous values separated by the pipe character as default.

get_kwargs()[source]

Get the keyword arguments that were passed to the Text when it was constructed.

invalid_characters

List of invalid characters found in this text.

is_tagged(layer)[source]

Is the given element tokenized/tagged?

is_valid()[source]

Does this text contain allowed/valid characters as defined by the TextCleaner instances supplied to this Text.

labels

Named entity labels.

layer_tagger_mapping

Dictionary that maps layer names to taggers that can create that layer.

lemma_lists

Lemma lists.

Ambiguous cases are separate list elements.

lemmas

The list of lemmas.

Ambiguous cases are separated with pipe character by default. Use get_analysis_element() to specify custom separator for ambiguous entries.

named_entities

The elements of named_entities layer.

named_entity_labels

The named entity labels without BIO prefixes.

named_entity_spans

The spans of named entities.

named_entity_texts

The texts representing named entities.

paragraph_ends

The end positions of paragraphs layer elements.

paragraph_spans

The list of spans representing paragraphs layer elements.

paragraph_starts

The start positions of paragraphs layer elements.

paragraph_texts

The list of texts representing paragraphs layer elements.

paragraphs

Return the list of paragraphs layer elements.

postag_descriptions

Human-readable POS-tag descriptions.

postags

The list of word part-of-speech tags.

Ambiguous cases are separated with pipe character by default. Use get_analysis_element() to specify custom separator for ambiguous entries.

root_tokens

Root tokens of word roots.

roots

The list of word roots.

Ambiguous cases are separated with pipe character by default. Use get_analysis_element() to specify custom separator for ambiguous entries.

sentence_ends

The list of end positions representing sentences layer elements.

sentence_spans

The list of spans representing sentences layer elements.

sentence_starts

The list of start positions representing sentences layer elements.

sentence_texts

The list of texts representing sentences layer elements.

sentences

The list of sentences layer elements.

spans(layer)[source]

Retrieve (start, end) tuples denoting the spans of given layer elements.

Returns:

list of (int, int)

List of (start, end) tuples.

spellcheck_results

The list of True/False values denoting the correct spelling of words.

spelling

Flag incorrectly spelled words. Returns a list of booleans, where element at each position denotes, if the word at the same position is spelled correctly.

spelling_suggestions

The list of spelling suggestions per misspelled word.

split_by(layer, sep=' ')[source]

Split the text into multiple instances defined by elements of given layer.

The spans for layer elements are extracted and feed to split_given_spans() method.

Parameters:

layer: str

String determining the layer that is used to define the start and end positions of resulting splits.

sep: str (default: ‘ ‘)

The separator to use to join texts of multilayer elements.

Returns:

list of Text

split_by_regex(regex_or_pattern, flags=32, gaps=True)[source]

Split the text into multiple instances using a regex.

Parameters:

regex_or_pattern: str or compiled pattern

The regular expression to use for splitting.

flags: int (default: re.U)

The regular expression flags (only used, when user has not supplied compiled regex).

gaps: boolean (default: True)

If True, then regions matched by the regex are not included in the resulting Text instances, which is expected behaviour. If False, then only regions matched by the regex are included in the result.

Returns:

list of Text

The Text instances obtained by splitting.

split_by_sentences()[source]

Split the text into individual sentences.

split_by_words()[source]

Split the text into individual words.

split_given_spans(spans, sep=' ')[source]

Split the text into several pieces.

Resulting texts have all the layers that are present in the text instance that is splitted. The elements are copied to resulting pieces that are covered by their spans. However, this can result in empty layers if no element of a splitted layer fits into a span of a particular output piece.

The positions of layer elements that are copied are translated according to the container span, so they are consistent with returned text lengths.

Parameters:

spans: list of spans.

The positions determining the regions that will end up as individual pieces. Spans themselves can be lists of spans, which denote multilayer-style text regions.

sep: str

The separator that is used to join together text pieces of multilayer spans.

Returns:

list of Text

One instance of text per span.

starts(layer)[source]

Retrieve start positions of elements if given layer.

synsets

The list of annotated synsets of words layer.

syntax_lists

Return syntax annotation variants for every word.

tag(layer)[source]

Tag the annotations of given layer. It can automatically tag any built-in layer type.

tag_all()[source]

Tag all layers.

tag_analysis()[source]

Tag words layer with morphological analysis attributes.

tag_clause_annotations()[source]

Tag clause annotations in words layer. Depends on morphological analysis.

tag_clauses()[source]

Create clauses multilayer.

Depends on clause annotations.

tag_labels()[source]

Tag named entity labels in the words layer.

tag_named_entities()[source]

Tag named_entities layer.

This automatically performs morphological analysis along with all dependencies.

tag_syntax()[source]

Tag syntax attribute in the words layer.

tag_timexes()[source]

Create timexes layer. Depends on morphological analysis data in words layer and tags it automatically, if it is not present.

tag_verb_chains()[source]

Create verb_chains layer. Depends on clauses layer.

tag_wordnet(**kwargs)[source]

Create wordnet attribute in words layer.

See tag_text() method for applicable keyword arguments.

text

The raw underlying text that was used to initialize the Text instance.

texts(layer, sep=' ')[source]

Retrieve texts for given layer.

Parameters:

sep: str

Separator for multilayer elements (default: ‘ ‘).

Returns:

list of str

List of strings that make up given layer.

texts_from_spans(spans, sep=' ')[source]

Retrieve texts from a list of (start, end) position spans.

Parameters:

sep: str

Separator for multilayer elements (default: ‘ ‘).

Returns:

list of str

List of strings that correspond to given spans.

timex_ends

The list of end positions of timexes layer elements.

timex_ids

The list of timex id-s of timexes layer elements.

timex_spans

The list of spans of timexes layer elements.

timex_starts

The list of start positions of timexes layer elements.

timex_texts

The list of texts representing timexes layer elements.

timex_types

The list of timex types of timexes layer elements.

timex_values

The list of timex values of timexes layer elements.

timexes

The list of elements in timexes layer.

tokenize_paragraphs()[source]

Apply paragraph tokenization to this Text instance. Creates paragraphs layer.

tokenize_sentences()[source]

Apply sentence tokenization to this Text instance. Creates sentences layer. Automatically tokenizes paragraphs, if they are not already tokenized. Also, if word tokenization has already been performed, tries to fit the sentence tokenization into the existing word tokenization;

tokenize_words()[source]

Apply word tokenization and create words layer.

Automatically creates paragraphs and sentences layers.

verb_chain_clause_indices

The clause indices of verb_chains elements.

verb_chain_ends

The end positions of verb_chains elements.

verb_chain_moods

The mood attributes of verb_chains elements.

verb_chain_morphs

The morph attributes of verb_chains elements.

verb_chain_other_verbs

The other verb attributes of verb_chains elements.

verb_chain_patterns

The patterns of verb_chains elements.

verb_chain_polarities

The polarities of verb_chains elements.

verb_chain_roots

The chain roots of verb_chains elements.

verb_chain_starts

The start positions of verb_chains elements.

verb_chain_tenses

The tense attributes of verb_chains elements.

verb_chain_texts

The list of texts of verb_chains layer elements.

verb_chain_voices

The voice attributes of verb_chains elements.

verb_chains

The list of elements of verb_chains layer.

word_ends

The list of end positions representing words layer elements.

word_literals

The list of literals per word in words layer.

word_spans

The list of spans representing words layer elements.

word_starts

The list of start positions representing words layer elements.

word_texts

The list of words representing words layer elements.

wordnet_annotations

The list of wordnet annotations of words layer.

words

The list of word elements in words layer.

class estnltk.text.ZipBuilder(text)[source]

Helper class to aggregate various Text properties in a simple way. Uses builder pattern.

Example:

text = Text('Alles see oli, kui käisin koolis')
text.get.word_texts.lemmas.postags.as_dataframe

test.get - this initiates a new ZipBuilder instance on the Text object.

.word_texts - adds word texts .postags - adds postags

.as_dataframe - builds the final object and returns a dataframe

Attributes

as_dataframe
as_dict
as_list
as_zip

Methods

__call__(props)