HOCR


hOCR is an open standard of data representation for formatted text obtained from optical character recognition. The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language in the form of Hypertext Markup Language or XHTML.

Software

The following OCR software can output the recognition result as hOCR file:
The following example is an extract of an hOCR file:

...



Die
Darlehenssumme
ist
in
ihrem
ursprünglichen
Umfange
zu
ver-

...

The recognized text is stored in normal text nodes of the HTML file.
The distribution into separate lines and words is here given by the
surrounding span tags. Moreover, the usual HTML entities are used,
for example the p tag for a paragraph. Additional information is
given in the properties such as:

  • different layout elements such as "ocr_par", "ocr_line", "ocrx_word"
  • geometric information for each element with a bounding box "bbox"
  • language information "lang"
  • some confidence values "x_wconf"