estnltk.vabamorf.morf module

Morphoanalysis/synthesis functionality of estnltk vabamorf package.

Attributes

PACKAGE_PATH: str
The path where the vabamorf package is located.
DICT_PATH: str
The path of the default vabamorf dictionary embedded with vabamorf.
DEFAULT_ET_PATH: str
The path to the default morphoanalyzer lexicon that comes with the vabamorf library.
DEFAULT_ET3_PATH: str
The path to the default disambiguator lexicon that comes with the vabamorf library.
phonetic_markers: str
List of characters that make up phonetic markup.
compound_markers: str
List of characters that make up compound markup.
phonetic_regex: regex
Regular expression matching any phonetic marker.
compound_regex: regex
Regular expression matching any compound marker.
class estnltk.vabamorf.morf.Vabamorf(lex_path='/home/uku/anaconda3/lib/python3.5/site-packages/estnltk/vabamorf/dct/et.dct', disamb_lex_path='/home/uku/anaconda3/lib/python3.5/site-packages/estnltk/vabamorf/dct/et3.dct')[source]

Class for performing main tasks of morphological analysis.

Attributes

pid: int Current process id.
morf: Vabamorf instance of the Vabamorf class. There should be only one instance per process as vabamorf has a global state that might get corrputed by using multiple instances in a single process.

Methods

analyze(words, \*\*kwargs) Perform morphological analysis and disambiguation of given text.
disambiguate(words) Disambiguate previously analyzed words.
fix_spelling(words[, join, joinstring]) Simple function for quickly correcting misspelled words.
instance() Return an PyVabamorf instance.
spellcheck(words[, suggestions]) Spellcheck given sentence.
synthesize(lemma, form[, partofspeech, ...]) Synthesize a single word based on given morphological attributes.
analyze(words, **kwargs)[source]

Perform morphological analysis and disambiguation of given text.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

disambiguate: boolean (default: True)

Disambiguate the output and remove incosistent analysis.

guess: boolean (default: True)

Use guessing in case of unknown words

propername: boolean (default: True)

Perform additional analysis of proper names.

compound: boolean (default: True)

Add compound word markers to root forms.

phonetic: boolean (default: False)

Add phonetic information to root forms.

Returns:

list of (list of dict)

List of analysis for each word in input.

disambiguate(words)[source]

Disambiguate previously analyzed words.

Parameters:

words: list of dict

A sentence of words.

Returns:

list of dict

Sentence of disambiguated words.

fix_spelling(words, join=True, joinstring=' ')[source]

Simple function for quickly correcting misspelled words.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

join: boolean (default: True)

Should we join the list of words into a single string.

joinstring: str (default: ‘ ‘)

The string that will be used to join together the fixed words.

Returns:

str

In case join is True

list of str

In case join is False.

static instance()[source]

Return an PyVabamorf instance.

It returns the previously initialized instance or creates a new one if nothing exists. Also creates new instance in case the process has been forked.

spellcheck(words, suggestions=True)[source]

Spellcheck given sentence.

Note that spellchecker does not respect pre-tokenized words and concatenates token sequences such as “New York”.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

suggestions: boolean (default: True)

Add spell suggestions to result.

Returns:

list of dict

Each dictionary contains following values:

‘word’: the original word ‘spelling’: True, if the word was spelled correctly ‘suggestions’: list of suggested strings in case of incorrect spelling

synthesize(lemma, form, partofspeech='', hint='', guess=True, phonetic=False)[source]

Synthesize a single word based on given morphological attributes.

Note that spellchecker does not respect pre-tokenized words and concatenates token sequences such as “New York”.

Parameters:

lemma: str

The lemma of the word(s) to be synthesized.

form: str

The form of the word(s) to be synthesized.

partofspeech: str

Part-of-speech.

hint: str

Hint.

guess: boolean (default: True)

Use heuristics when synthesizing unknown words.

phonetic: boolean (default: False)

Add phonetic markup to synthesized words.

Returns:

list

List of synthesized words.

estnltk.vabamorf.morf.analyze(words, **kwargs)[source]

Perform morphological analysis and disambiguation of given text.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

disambiguate: boolean (default: True)

Disambiguate the output and remove incosistent analysis.

guess: boolean (default: True)

Use guessing in case of unknown words

propername: boolean (default: True)

Perform additional analysis of proper names.

compound: boolean (default: True)

Add compound word markers to root forms.

phonetic: boolean (default: False)

Add phonetic information to root forms.

Returns:

list of (list of dict)

List of analysis for each word in input.

estnltk.vabamorf.morf.convert(word)[source]

This method converts given word to UTF-8 encoding and bytes type for the SWIG wrapper.

estnltk.vabamorf.morf.deconvert(word)[source]

This method converts back the output from wrapper. Result should be unicode for Python2 and str for Python3

estnltk.vabamorf.morf.disambiguate(words)[source]

Disambiguate previously analyzed words.

Parameters:

words: list of dict

A sentence of words.

Returns:

list of dict

Sentence of disambiguated words.

estnltk.vabamorf.morf.fix_spelling(words, join=True, joinstring=' ')[source]

Simple function for quickly correcting misspelled words.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

join: boolean (default: True)

Should we join the list of words into a single string.

joinstring: str (default: ‘ ‘)

The string that will be used to join together the fixed words.

Returns:

str

In case join is True

list of str

In case join is False.

estnltk.vabamorf.morf.get_group_tokens(root)[source]

Function to extract tokens in hyphenated groups (saunameheks-tallimeheks).

Parameters:

root: str

The root form.

Returns:

list of (list of str)

List of grouped root tokens.

estnltk.vabamorf.morf.get_root(root, phonetic, compound)[source]

Get the root form without markers.

Parameters:

root: str

The word root form.

phonetic: boolean

If True, add phonetic information to the root forms.

compound: boolean

if True, add compound word markers to root forms.

estnltk.vabamorf.morf.postprocess_result(morphresult, trim_phonetic, trim_compound)[source]

Postprocess vabamorf wrapper output.

estnltk.vabamorf.morf.regex_from_markers(markers)[source]

Given a string of characters, construct a regex that matches them.

Parameters:

markers: str

The list of string containing the markers

Returns:

regex

The regular expression matching the given markers.

estnltk.vabamorf.morf.spellcheck(words, suggestions=True)[source]

Spellcheck given sentence.

Note that spellchecker does not respect pre-tokenized words and concatenates token sequences such as “New York”.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

suggestions: boolean (default: True)

Add spell suggestions to result.

Returns:

list of dict

Each dictionary contains following values:

‘word’: the original word ‘spelling’: True, if the word was spelled correctly ‘suggestions’: list of suggested strings in case of incorrect spelling

estnltk.vabamorf.morf.synthesize(lemma, form, partofspeech='', hint='', guess=True, phonetic=False)[source]

Synthesize a single word based on given morphological attributes.

Note that spellchecker does not respect pre-tokenized words and concatenates token sequences such as “New York”.

Parameters:

lemma: str

The lemma of the word(s) to be synthesized.

form: str

The form of the word(s) to be synthesized.

partofspeech: str

Part-of-speech.

hint: str

Hint.

guess: boolean (default: True)

Use heuristics when synthesizing unknown words.

phonetic: boolean (default: False)

Add phonetic markup to synthesized words.

Returns:

list

List of synthesized words.

estnltk.vabamorf.morf.trim_compounds(root)[source]

Function that trims compound markup from the root.

Parameters:

root: str

The string to remove the compound markup.

Returns:

str

The string with compound markup removed.

estnltk.vabamorf.morf.trim_phonetics(root)[source]

Function that trims phonetic markup from the root.

Parameters:

root: str

The string to remove the phonetic markup.

Returns:

str

The string with phonetic markup removed.

class estnltk.vabamorf.morf.Vabamorf(lex_path='/home/uku/anaconda3/lib/python3.5/site-packages/estnltk/vabamorf/dct/et.dct', disamb_lex_path='/home/uku/anaconda3/lib/python3.5/site-packages/estnltk/vabamorf/dct/et3.dct')[source]

Class for performing main tasks of morphological analysis.

Attributes

pid: int Current process id.
morf: Vabamorf instance of the Vabamorf class. There should be only one instance per process as vabamorf has a global state that might get corrputed by using multiple instances in a single process.

Methods

analyze(words, \*\*kwargs) Perform morphological analysis and disambiguation of given text.
disambiguate(words) Disambiguate previously analyzed words.
fix_spelling(words[, join, joinstring]) Simple function for quickly correcting misspelled words.
instance() Return an PyVabamorf instance.
spellcheck(words[, suggestions]) Spellcheck given sentence.
synthesize(lemma, form[, partofspeech, ...]) Synthesize a single word based on given morphological attributes.
analyze(words, **kwargs)[source]

Perform morphological analysis and disambiguation of given text.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

disambiguate: boolean (default: True)

Disambiguate the output and remove incosistent analysis.

guess: boolean (default: True)

Use guessing in case of unknown words

propername: boolean (default: True)

Perform additional analysis of proper names.

compound: boolean (default: True)

Add compound word markers to root forms.

phonetic: boolean (default: False)

Add phonetic information to root forms.

Returns:

list of (list of dict)

List of analysis for each word in input.

disambiguate(words)[source]

Disambiguate previously analyzed words.

Parameters:

words: list of dict

A sentence of words.

Returns:

list of dict

Sentence of disambiguated words.

fix_spelling(words, join=True, joinstring=' ')[source]

Simple function for quickly correcting misspelled words.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

join: boolean (default: True)

Should we join the list of words into a single string.

joinstring: str (default: ‘ ‘)

The string that will be used to join together the fixed words.

Returns:

str

In case join is True

list of str

In case join is False.

static instance()[source]

Return an PyVabamorf instance.

It returns the previously initialized instance or creates a new one if nothing exists. Also creates new instance in case the process has been forked.

spellcheck(words, suggestions=True)[source]

Spellcheck given sentence.

Note that spellchecker does not respect pre-tokenized words and concatenates token sequences such as “New York”.

Parameters:

words: list of str or str

Either a list of pretokenized words or a string. In case of a string, it will be splitted using default behaviour of string.split() function.

suggestions: boolean (default: True)

Add spell suggestions to result.

Returns:

list of dict

Each dictionary contains following values:

‘word’: the original word ‘spelling’: True, if the word was spelled correctly ‘suggestions’: list of suggested strings in case of incorrect spelling

synthesize(lemma, form, partofspeech='', hint='', guess=True, phonetic=False)[source]

Synthesize a single word based on given morphological attributes.

Note that spellchecker does not respect pre-tokenized words and concatenates token sequences such as “New York”.

Parameters:

lemma: str

The lemma of the word(s) to be synthesized.

form: str

The form of the word(s) to be synthesized.

partofspeech: str

Part-of-speech.

hint: str

Hint.

guess: boolean (default: True)

Use heuristics when synthesizing unknown words.

phonetic: boolean (default: False)

Add phonetic markup to synthesized words.

Returns:

list

List of synthesized words.