Free PDF Converter Utilities
How Optical Character Recognition (OCR)
OCR is a complex technology that converts images with text into
editable formats. OCR allows you to process scanned books, screenshots
and photos with text and get editable documents like TXT, DOC or
PDF files. This technology is widely used in many areas and the
most advanced OCR systems can handle almost all types of images,
even such complex as scanned magazine pages with images and columns
or photos from a mobile phone.
How does modern OCR work? The process of converting an image
to editable document is separated to several steps; every step is
a set of related algorithms which do a piece of OCR job.
General steps in OCR process are:
Loading image as bitmap from given source. Source can be a file
or a pointer to a memory block, also good OCR system must understand
a lot of image formats: BMP, TIFF (both one-page and multi-pages
images), JPEG, PNG and so on. PDF files must be supported as well,
many documents are stored as images in PDF format and the only way
to extract text from such files is to perform OCR.
Detecting the most important image features like resolution and
inversion. Many OCR algorithms expect some predefined range of font
sizes and foreground/background colors so the image must be rescaled
and inverted before processing when necessary.
Image can be skewed or it can have a lot of noise, so deskew
and denoising algorithms are applied to improve the image quality.
Many OCR algorithms require bi-tonal image, therefore color or
gray image must be converted to black-white image. This process
is called "binarization" and it is very important step because incorrect
binarization will cause a lot of problems.
Lines detection and removing. This step is required to improve
page layout analysis, to achieve better recognition quality for
underlined text, to detect tables, etc.
Page layout analysis; this steps is also called "zoning". At
this stage OCR system must detect positions and types of all important
areas on the image.
Detection of text lines and words. Sometimes is it not an easy
task because of different font sizes and small spaces between words.
Combined-broken characters analysis. It is very common situation
when some characters are broken to several parts, or when a few
characters touch each one; it is necessary to detect such cases
and find correct position of every character.
Recognition of characters. This is the main algorithm of OCR;
an image of every character must be converted to appropriate character
code. Sometimes this algorithm produces several character codes
for uncertain images. For instance, recognition of the image of
"I" character can produce "I", "|" "1", "l" codes and the final
character code will be selected later.
Dictionary support. This step can improve recognition quality,
some characters like "1" and "I", "C" and "G" can look very similar
and the dictionary can help to make the decision.
Saving results to selected output format, for instance, searchable
PDF, DOC, RTF, TXT. It is important to save original page layout:
columns, fonts, colors, pictures, background and so on.
It is not a complete list, a lot of other minor algorithms also
must be implemented to achieve good recognition on various image
types, but they are not principal in most cases and can vary in
different OCR systems.
Every OCR step is very important; the whole OCR process will
fail if only one its step cannot handle given image correctly. Every
algorithm must work correctly on the highest range of images, that
is why there are only few good universal OCR systems are available.
On the other hand, if some features of given images are know the
task becomes much easier, it is possible to get better recognition
quality if only one kind of images must be processed. To achieve
the best results if some features of images are known, good OCR
system must have ability to adjust the most important parameters
of every algorithm; sometimes this is the only way to improve recognition
quality. Unfortunately, nowadays there are not OCR systems that
can be comparable with human eyes and it seems they will not be
created in the near future.
Like what you see?
Download Free OCR here:
Free OCR is an efficient tool that designed to convert scanned
documents and PDF files into editable electronic text files
fast and easily. OCR (Optical Character Recognition) can really
come in handy when you need to grab some text out of an image
generated from digital cameras and mobile phones along with
scanners. Just a few clicks will get all your image files extracted
to text - download and try this program for free now!