XtremeDocumentStudio .NET
Next-generation multi-format document-processing component suite for .NET developers
Compatibility
Visual Studio 2010/2012/2013/2015

Search text on images with XtremeDocumentStudio .NET's HTML5 Document Viewer

Learn to enable text search capability on images using OCR
By Abhishek, Shivaranjini & Pradeep

In Version 2016 R3 of XtremeDocumentStudio .NET's HTML5 Document Viewer control, we introduced text search capability on scanned documents using Optical Character Recognition (OCR). This feature is supported on all the majorly used image formats such as TIFF, JPEG, PNG etc. In future we plan to introduce this feature for scanned PDFs and Word formats (DOCX, DOC, etc.) containing images and also make the viewer more interactive for highlighting text, copying, etc.

To enable search capability for images we need to process the images through an OCR engine. OCR involves the recognition of text in the input image document. For example, a JPEG image may contain textual information, but naturally, as JPEG is a raster image format the text is not stored as text. It just appears like text to our eyes, as opposed to a paragraph of text in a web page, which would be wrapped as text in a paragraph tag. When the image is embedded in a PDF or a web page, the text is not going to be available as text. Wouldn't it be great if we could search for text when images are loaded on the viewer?

We can now do that with XtremeDocumentStudio .NET’s HTML5 Document Viewer control.

Enabling this feature involves the use of the "Tesseract" library which is the most popular library for OCR.
Note: Tesseract and the associated libraries used for OCR are native Windows DLLs. We have designed it in such a way that these DLLs remain as external modules that can be optionally enabled and used. The DLLs are dynamically loaded only when the main digitization module (Gnostice.XtremeDigitizationEngine.dll) is referred in your application. The core XtremeDocumentStudio library still is fully managed code.

To enable text search on images please follow the steps listed below:

We have provided "DigitizerSettings" on the "Preferences" class of "DocumentViewer" control. Please set it as follows:


<script>

$(document).ready(function() {
    var preferences = new gnostice.Preferences();
     
    //Digitizer Settings
	
    //To enable text search, set this property to true	
    preferences.digitizerSettings.digitizationEnabled = true;	
	
    //This feature supports searching for text in multiple languages. 
    //Please provide the set of languages you intend to support. 
    //An example of specifying English and French is shown here:
    preferences.digitizerSettings.textLanguage = "eng+fra";
	
	
    var documentViewer = new gnostice.DocumentViewer('doc-viewer-id', preferences);
);

</script>

We now need to copy the supporting DLLs to the viewer application's "bin" directory.

Please copy the following DLLs from "[installation folder]\XtremeDocumentStudio Ultimate\Bin\XDE:" (XDE)

The following DLLs are present in the x86 and x64 folders in the XDE folder specified above. Please choose the appropriate files as per your application's architecture. For example if your viewer application is targetting "x86" then copy these files from x86 folder to viewer application's "bin" directory.

Also ensure "Gnostice.XtremeDigitizationEngine.dll" is referred in your viewer application.

The final step is to copy the "tessdata" folder present in the XDE folder specified above to the viewer application's "bin" directory. This folder contains data for the OCR engine, called Training Data, for the languages the application needs to support. The default folder we ship only contains training data for "English" language. If you need to support more languages, please download Training Data for those additional langauges from https://github.com/tesseract-ocr/tessdata

If you take a look at the folder, you will see all the files starting with "eng" prefix. This is for "English" language.

preferences.digitizerSettings.textRecognitionLanguage = "eng+fra";

The above line from the code snippet shows how to specify English and French as the languages to match the text against. For this to work you need to also download the French language Training Data files, which start with prefix "fra", and place them in the bin folder.

Related articles:

---o0O0o---

Our .NET Developer Tools
XtremeDocumentStudio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
XtremeDocumentStudio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
XtremeDocumentStudio (for Java)

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2017 Gnostice Information Technologies Private Limited. All rights reserved.