Scanning
GMS uses high-quality production scanners to create an image of a document. This allows us to efficiently create high-quality images of your documents.
Scanning works as follows
When digitizing or scanning images, a raster technique is used. A grid is placed over the image within which point measurements are carried out. These point measurements are also called pixels. The more pixels, the more detail.
Pixels
You probably know that, if you zoom in closely on an image, the image becomes ‘pixelated’ into individual blocks. Each block represents a point on the grid and therefore one pixel. These pixels, or coloured blocks, are a way of storing images.
Resolution
The resolution is the number of pixels per unit area, which is referred to as DPI (or PPI) when scanning; DPI stands for Dots Per Inch (PPI is the official term, namely Pixels Per Inch). The most common quality for making scans is 300 DPI. 300 pixels in width by 300 pixels in length per inch. The detail at this resolution is so high that document details remain readable when zooming in on scans, while the file size remains acceptable.
To use OCR, a minimum resolution of 300 DPI is required, so that the character can be made legible for the software.
Contrast
Another important variable in digitizing and making characters recognisable is contrast. And especially the contrast with the background (brightness contrast). Think of the whitest background with a black character.
The character must be clear and distinguishable from the background. GMS uses special software for this to create the highest possible contrast, which optimises recognition. We do this by working with drop-out colours and/or indexable colours. This will maximise the contrast.
This is important because, especially in archives, the source material may have yellowed or the ink faded. This affects the contrast ratio and makes the background difficult for our software to distinguish from the text. Emphasising the text (making it blacker) and brightening the background creates a higher contrast.
By using software to increase the contrast, some details may be lost (disappear in the background). It is therefore not possible to recognise all document types. Our advisers are always ready to inform you about this.
OCR
Because the image must be recognised, the OCR software will straighten the image so that text can be recognised properly. The software does this in its memory, so that the original is not affected.
Where necessary, the software will also add (automatic) corrections, because not every document is the same.
After the image has been made suitable for OCR recognition, it will be recognised by software. The software does this by recognising patterns in the image.
The software has patterns of pixels that can be translated into ASCII characters. Because the software can distinguish a background and a character, a pattern in (black) pixels can be recognised. This is then compared to an index of ASCII characters, in order to arrive at a plausible result.
By also recognising spaces (white space), words can be formed. By recognising special characters, sentences can be formed. In this way, the entire document is in fact indexed and added to the metadata of the images.
By adding this data to the metadata, the document becomes searchable. This metadata can also be used for follow-up processes, for example by adding certain values to your workflows, so that the document can be processed automatically.
OCR is completely reliable
No, OCR on its own is not completely reliable, but a high degree of reliability can be created.
Because the software is based on pattern recognition and links it to the most plausible result (ASCII character), it is not 100% reliable. You can imagine that an “I” (uppercase i) and an “l” (lower case l) cannot be distinguished in this way. Another common error is between 0 and o.
The distinction between these pixel patterns is almost negligible, which is why it is difficult for the software to make a choice.
However, reliability can be significantly improved when the software works on context. If you know that no numeric values can appear in the text, then the numeric values can be excluded and prevent (for example) the 0 from being recognised instead of the o.
This goes even much further, for example when working with word books (such as used in spelling checks), or formatting (such as used with zip code recognition). Then the quality of the recognition can increase considerably.
However, to achieve a 100% reliable result, you must apply a visual check. During a visual check, our software provides all characters (or words) that it has doubts about for a user to visual check. We are able to set the software parameters so that doubtful cases below a certain percentage are presented for visual inspection. This allows 99.99% certainty that the text is properly recognised.
It is often not necessary that the text throughout the document is 100% reliable, but a few key index fields are. For example, information necessary for subsequent steps in your process (indexing, classification, etc.). In this case the visual check only needs to be done on a few words in a given document so it does not have to take too much time (costs less).
Full text OCR
Characters are not the only elements that can be recognised; even the font, images and the layout in which the source text is formatted can be recognised. This type of OCR is also referred to as “Full text OCR”. We mainly apply this when recognising books or scanning editable documents.
Previously, only OCR A and OCR B fonts were supported for proper recognition. Nowadays almost all fonts are recognised by the software. Text formatting, logos, graphics, etc. are also recognised by the software. This makes an almost exact editable replica of the image.