estnltk.textcleaner module

class estnltk.textcleaner.TextCleaner(alphabet='abcdefghijklmnoprsšzžtuvwõäöüxyzABCDEFGHIJKLMNOPRSŠZŽTUVWÕÄÖÜXYZ0123456789 tnrx0bx0c!"#$%&'()*+, -./:;<=>?@[\]^_`{|}~–')[source]

Class for comparing texts against a predefined alphabet and filtering out unwanted characters.

Methods

clean(text) Remove all unwanted characters from text.
compute_report(texts[, context_size]) Compute statistics of invalid characters on given texts.
find_invalid_chars(text[, context_size]) Find invalid characters in text and store information about the findings.
invalid_characters(text) Give simple list of invalid characters present in text.
is_valid(text) Check if the text is valid and contains no unwanted characters.
report(texts[, n_examples, context_size, f]) Compute statistics of invalid characters and print them.
clean(text)[source]

Remove all unwanted characters from text.

compute_report(texts, context_size=10)[source]

Compute statistics of invalid characters on given texts.

Parameters:

texts: list of str

The texts to search for invalid characters.

context_size: int

How many characters to return as the context.

Returns:

dict of (char -> list of tuple (index, context))

Returns a dictionary, where keys are invalid characters. Values are lists containign tuples with character indices and context strings.

find_invalid_chars(text, context_size=20)[source]

Find invalid characters in text and store information about the findings.

Parameters:

context_size: int

How many characters to return as the context.

invalid_characters(text)[source]

Give simple list of invalid characters present in text.

is_valid(text)[source]

Check if the text is valid and contains no unwanted characters.

Returns:True, if text has no unwanted characters.
report(texts, n_examples=10, context_size=10, f=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Compute statistics of invalid characters and print them.

Parameters:

texts: list of str

The texts to search for invalid characters.

n_examples: int

How many examples to display per invalid character.

context_size: int

How many characters to return as the context.

f: file

The file to print the report (default is sys.stdout)