HTML Prettyprinter

Visualizing information is one of the most crucial steps in text processing software and arises in many uses cases. Estnltk comes with HTML PrettyPrinter that can help building Web applications and custom tools that deal with text processing.

PrettyPrinter is capable of very different types of visualization. From visualization of simple given word to multiple and overlapping word types and even parts of whole sentences. Here is a list of properties that can be modified with the help of PrettyPrinter and the matching name of the value that the module is expecting:

  • Change font color - ‘color’
  • Change background color - ‘background’
  • Change font style - ‘font’
  • Change font weight - ‘weight’
  • Change font style - ‘italics’
  • Add underline - ‘underline’
  • Change font size - ‘size’
  • Change letter spacing - ‘tracking’

Example #1 Formating specific word in all of text with different visual format.:

from IPython.display import HTML
from estnltk import Text
from estnltk import PrettyPrinter

text = Text('This must be formatted here and here')
text.tag_with_regex('annotations', 'here')

pp = PrettyPrinter(background='annotations')

The result of this short program will be:

HTML(pp.render(text, True))
PrettyPrinter This must be formatted here and here

Class Text is what does all the analysis. If we are looking to mark a specific word as in this case is the word ‘here’ then we must bind the annotation to the word ‘here’ with the help of a function of Text called tags_with_regex('annotations', 'here') that tags the value of ‘annotations’ to the word ‘here’. This will later be used to find the exact index where to start and end the selected formating.

When we create a new class PrettyPrinter variable by pp = PrettyPrinter(background='annotations'), we add arguments describing what property will be added to which tag, in our case, everything that is tagged as ‘annotations’ will get a different background color. The rgb(102, 204, 255) is a stock value that is added as background color if no other color is specified during initiation of the PrettyPrinter class object.

Keep in mind that if we activate PrettyPrinter function with the argument False instead of True, then the result will not be the full HTML text, but only the formatted text inside the HTML body paragraph.

Example #2 Formating the same property with different visual format depending on the specific word:

text = Text('Nimisõnad värvitakse').tag_analysis()
rules =[
            ('Nimisõnad', 'green'),
            ('värvitakse', 'blue')
        ]
pp = PrettyPrinter(background='words', background_value=rules)
html = pp.render(text, True)

The result of this program will be:

HTML(html)
PrettyPrinter Nimisõnad värvitakse

This time we gave the PrettyPrinter class object two arguments: background='words', background_value=rules. The background value ‘words’ means that we will not be adding any specific tags as in the previous case, but instead use the original tag that is used in case of every word. PrettyPrinter will check itself what words match the rules specified in the list ‘rules’. Now the second argument background_value=rules shows PrettyPrinter what values will be given to what tag values. Basically what our ‘rules’ say to the PrettyPrinter is that each word ‘Nimisõnad’ will be given a green background color and the word ‘värvitakse’ will be given a blue background color. Because different words can have different visual properties of the same type(eg. background color, font color, font size etc.) the css marks are numbered based on the number of overlapping values.

Example #3 Using word type tags as rule parameters:

# TODO: currently doesn't work on POS-tags

text = Text('Suured kollased kõrvad ja').tag_analysis()
rules =[
            ('A', 'blue'),
            ('S', 'green')
        ]
pp = PrettyPrinter(background='words', background_value=rules)
html = pp.render(text, True)

This time the defining parameters are ‘A’ and ‘S’ which stand for different word types. The list of different tags can be found below:

  • A - adjective
  • C - comparative adjective
  • D - adverb
  • G - non declinable adjective
  • H - real name
  • I - interjection
  • J - conjunction
  • K - co-expression
  • N - cardinal numeral
  • O - ordinal numeral
  • P - pronoun
  • S - noun
  • U - superlative adjective
  • V - verb
  • X - other type of word belonging together with a verb
  • Y - abbreviation
  • Z - sign

PrettyPrinter will sort everything else out by itself. The result of this will be:

HTML(html)
PrettyPrinter Suured kollased kõrvad ja

As we can see from the results, all adjectives have been marked with a css background mark tag for color blue and the noun in the sentence has been marked with a css background mark tag for color green. In this way it is possible to visually separate all words that are of a specific type simply and effectively.

Example #4 Using different category visual representation dor different parts of text:

# TODO: not working correctly

text = Text('Esimene ja teine märgend')
text.tag_with_regex('A', 'Esimene ja')
text.tag_with_regex('B', 'ja teine')

pp = PrettyPrinter(color='A', background='B')
html = pp.render(text, False)

This time we want to highlight two different word types with different properties, font color and background color. To do this, we have to add both layers as PrettyPrinter class parameters and tie those to a certain value. With text.tag_with_regex('A', 'Esimene ja') we bind the formating option in PerttyPrinter parameters color='A' applies to ‘Esimene ja’ part of the text. What happens is that we will have two different css formats, each changing different things. Here we can also see that the formatting works with overlapping layers, because the word ‘ja’ is in both ‘A’ and ‘B’. The output with ‘False’ as the second parameter in render, will be the following:

HTML(html)
Esimene ja teine märgend

Here we can see, that the word ‘ja’ has two class tags, ‘background’ and ‘color’.

Generating just the css

It is possible, to use PrettyPrinter to generate just the css formatting without the HTML or the actual word content. In this case we just supply the PrettyPrinter class object with the necessary parameters and additional rules(if needed) and the class will generate the required css mark tags.

Example #5 generating one layer css:

pp = PrettyPrinter(color='layer')
css_format = pp.css

This is the simplest form and the result will be:

print(css_format)
mark {
        background:none;
}
mark.color {
        color: rgb(0, 0, 102);
}

Example #6 generating css with user defined color value:

pp = PrettyPrinter(color='layer', color_value='color_you_have_never_seen')
css_format = pp.css

Similar to last one, the result will be simple color marking, but with the user define value:

print(css_format)
mark {
        background:none;
}
mark.color {
        color: color_you_have_never_seen;
}

Example #7 generating css with rules:

rules = [
    ('Nimisõnad', 'green'),
    ('värvitakse', 'blue')
]
pp = PrettyPrinter(color='layer', color_value=rules)
css_format = pp.css

This simple program generates two mark color classes that define two sets of font color. :

print(css_format)
mark {
        background:none;
}
mark.color_0 {
        color: green;
}
mark.color_1 {
        color: blue;
}

Example #8 generating full css without rules :

AESTHETICS = {
    'color': 'layer1',
    'background': 'layer2',
    'font': 'layer3',
    'weight': 'layer4',
    'italics': 'layer5',
    'underline': 'layer6',
    'size': 'layer7',
    'tracking': 'layer8'
}
pp = PrettyPrinter(**AESTHETICS)
css_format = pp.css

This program returns the css default formatting for all the properties in AESTHETICS. :

print(css_format)
mark {
        background:none;
}
mark.tracking {
        letter-spacing: 0.03em;
}
mark.weight {
        font-weight: bold;
}
mark.background {
        background-color: rgb(102, 204, 255);
}
mark.underline {
        text-decoration: underline;
}
mark.font {
        font-family: sans-serif;
}
mark.italics {
        font-style: italic;
}
mark.size {
        font-size: 120%;
}
mark.color {
        color: rgb(0, 0, 102);
}