Request information

Do you have a question? Let us know and we will contact you withing one working day!

OCR (Optical Character Recognition)

Looking for a solution to make your scanned documents searchable? GMS is the digitization specialist. With our advanced recognition software, we make it possible to make your documents (fully) automatically searchable. From scanning to recognition, GMS offers the total solution.

Scanning

GMS uses high-quality production scanners to create an image of a document. This allows us to efficiently create high-quality images of your documents.

Scanning works as follows

When digitizing or scanning images, a raster technique is used. A grid is placed over the image within which point measurements are carried out. These point measurements are also called pixels. The more pixels, the more detail.

Pixels

You probably know that, if you zoom in closely on an image, the image becomes ‘pixelated’ into individual blocks. Each block represents a point on the grid and therefore one pixel. These pixels, or coloured blocks, are a way of storing images.

Resolution

The resolution is the number of pixels per unit area, which is referred to as DPI (or PPI) when scanning; DPI stands for Dots Per Inch (PPI is the official term, namely Pixels Per Inch). The most common quality for making scans is 300 DPI. 300 pixels in width by 300 pixels in length per inch. The detail at this resolution is so high that document details remain readable when zooming in on scans, while the file size remains acceptable.

To use OCR, a minimum resolution of 300 DPI is required, so that the character can be made legible for the software.

Contrast

Another important variable in digitizing and making characters recognisable is contrast. And especially the contrast with the background (brightness contrast). Think of the whitest background with a black character.

The character must be clear and distinguishable from the background. GMS uses special software for this to create the highest possible contrast, which optimises recognition. We do this by working with drop-out colours and/or indexable colours. This will maximise the contrast.

This is important because, especially in archives, the source material may have yellowed or the ink faded. This affects the contrast ratio and makes the background difficult for our software to distinguish from the text. Emphasising the text (making it blacker) and brightening the background creates a higher contrast.

By using software to increase the contrast, some details may be lost (disappear in the background). It is therefore not possible to recognise all document types. Our advisers are always ready to inform you about this.

OCR

Because the image must be recognised, the OCR software will straighten the image so that text can be recognised properly. The software does this in its memory, so that the original is not affected.

Where necessary, the software will also add (automatic) corrections, because not every document is the same.

After the image has been made suitable for OCR recognition, it will be recognised by software. The software does this by recognising patterns in the image.

The software has patterns of pixels that can be translated into ASCII characters. Because the software can distinguish a background and a character, a pattern in (black) pixels can be recognised. This is then compared to an index of ASCII characters, in order to arrive at a plausible result.

By also recognising spaces (white space), words can be formed. By recognising special characters, sentences can be formed. In this way, the entire document is in fact indexed and added to the metadata of the images.

By adding this data to the metadata, the document becomes searchable. This metadata can also be used for follow-up processes, for example by adding certain values ​​to your workflows, so that the document can be processed automatically.

OCR is completely reliable

No, OCR on its own is not completely reliable, but a high degree of reliability can be created.

Because the software is based on pattern recognition and links it to the most plausible result (ASCII character), it is not 100% reliable. You can imagine that an “I” (uppercase i) and an “l” (lower case l) cannot be distinguished in this way. Another common error is between 0 and o.

The distinction between these pixel patterns is almost negligible, which is why it is difficult for the software to make a choice.

However, reliability can be significantly improved when the software works on context. If you know that no numeric values ​​can appear in the text, then the numeric values ​​can be excluded and prevent (for example) the 0 from being recognised instead of the o.

This goes even much further, for example when working with word books (such as used in spelling checks), or formatting (such as used with zip code recognition). Then the quality of the recognition can increase considerably.

However, to achieve a 100% reliable result, you must apply a visual check. During a visual check, our software provides all characters (or words) that it has doubts about for a user to visual check. We are able to set the software parameters so that doubtful cases below a certain percentage are presented for visual inspection. This allows 99.99% certainty that the text is properly recognised.

It is often not necessary that the text throughout the document is 100% reliable, but a few key index fields are. For example, information necessary for subsequent steps in your process (indexing, classification, etc.). In this case the visual check only needs to be done on a few words in a given document so it does not have to take too much time (costs less).

Full text OCR

Characters are not the only elements that can be recognised; even the font, images and the layout in which the source text is formatted can be recognised. This type of OCR is also referred to as “Full text OCR”. We mainly apply this when recognising books or scanning editable documents.

Previously, only OCR A and OCR B fonts were supported for proper recognition. Nowadays almost all fonts are recognised by the software. Text formatting, logos, graphics, etc. are also recognised by the software. This makes an almost exact editable replica of the image.

Roadmap for OCR

1ASSESSMENT

Before we start the digitization process, we inventory the source documents. We also discuss your requests and requirements in detail. We involve our specialists, each of whom provides input from his or her own discipline.

2PROJECT PLAN

GMS makes a special project plan for every project. It is essential, especially when it comes to files and documents, to properly frame the project, including the customer’s requests and requirements. Dates, appointments and deadlines are processed here.

3RECOGNITION

GMS digitizes in accordance with the requirements. When agreed, GMS implements the project in accordance with the requirements of substitution. Additional checks on content, quality and indexing are performed to meet the substitution requirements.

4FINISHING

The digitized files and data are processed and delivered to the client in accordance with the requests and requirements, so that they seamlessly match your work processes.

Are the documents suitable for OCR?

PDF documents of hundreds of pages, you know how it is… It would be easier if these were searchable, so that you and your employees could quickly find the correct information.

Your scanned documents must be recognised in order to arrive at a searchable file or archive. That means that the text in the documents must be recognised, so that you or your employees can search for texts.

However, not every document is suitable for OCR, which is why our specialists first assess your documents and what it takes to have your documents reliably recognised. For example, your documents may need to be edited before they can be recognised. For example, by using software to increase the contrast so that the text is better displayed. Then the emphasis of the scans is not on visual resemblance to the original but on reliable recognition. Our advisers are ready to advise you on this process.

Quality and reliability

The reliability of the recognised data depends on the source material. Well-optimised scans can provide a very high degree of reliability. However, in order to achieve a reliable output, it is not only the quality of the scans that is important, but also the possibility of performing checks and using reference tables.

GMS offers the appropriate automated and visual checks on the recognised data, so that we can guarantee an almost error-free result. However, it is often the case that only part of the data is important. That part is called “key index fields”. In short, values that are much searched for by your employees. For example, reference numbers, product numbers, patient numbers, etc. In these cases, we can add extra checks to these key index fields.

OCR for documents and regular books

Characters are not the only elements that can be recognised; even the font, images and the layout in which the source text is formatted, can be recognised. This type of OCR is also referred to as ‘Full text OCR’. We mainly apply this when recognising books or scanning editable documents. This makes an almost exact, editable replica of the image.

Documents and books are usually stored as a multipage file at the ‘issue’ level. The PDF or PDF/a file formats are ideal for this. The major advantage of PDF or PDF/a is integrated searchability. Both the image and the OCR result are embedded in the PDF. In addition, the PDF standard is also suitable for ECM and DMS applications.

Related