XtremeDocumentStudio .NET
Next-generation multi-format document-processing component suite for .NET developers
Compatibility
Visual Studio 2010/2012/2013/2015

How to convert scanned images to searchable PDF using OCR in .NET

Learn to use the new digitization feature of XtremeDocumentStudio .NET.
DOWNLOAD
xtremedocumentstudio_n_t_ult.exe

In Version 2015 R11 of XtremeDocumentStudio .NET, we introduced a document digitization feature. This is a feature that many of customers have asked us in the past, even though it requires the use of an unmanaged library.

Digitization involves the recognition of specific content elements in the input document and converting them to a format that supports those elements in a better or more useful fashion. For example, a JPEG image might contain text but the JPEG raster content does not store the text as text. It just appears like text to our eyes, as opposed to a paragraph of text in a web page, which would be wrapped as text in a paragraph tag. When the image is embedded in a PDF or a web page, the text is not going to be available as text. Wouldn't it be great if the text was selectable, just as text on a web page or a MS Word document?

In the last release, we added a new class called DigitizerSettings. The Preferences property of document converter component exposes a DigitizerSettings instance. Using it, you need to specify what you would like to digitize. Here is a simple code snippet that demonstrates how it can be done.

private void button1_Click(object sender, EventArgs e) {
  DocumentConverter dc = new DocumentConverter();
  dc.Preferences.DigitizerSettings.DigitizationMode = 
    Gnostice.Core.DigitizationEngine.DigitizationMode.AllImages;
  dc.Preferences.DigitizerSettings.RecognizeElementTypes = 
    Gnostice.Core.DigitizationEngine.RecognizeElementTypes.TEXT;

  try {
    dc.ConvertToFile(@"H:\Screenshot-2.png", "searchable.pdf");
  } catch (Exception err) {
    MessageBox.Show("Error:\n" + err.Message);
  }
  Close();
}
This image shows the original image with rasterized text, which was converted to a PDF with selectable and searchable text. (You can click to download this PDF.)

To get this working, you need to add references to the Gnostice.XtremeDigitizationEngine.dll and Tesseract.dll in your project. You also need to copy the contents of the [installation folder]\XtremeDocumentStudio Ultimate\Bin\XDE to the bin folder so that the OCR component can do its job.

---o0O0o---

Our .NET Developer Tools
XtremeDocumentStudio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
XtremeDocumentStudio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
XtremeDocumentStudio (for Java)

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2017 Gnostice Information Technologies Private Limited. All rights reserved.