Scanned to Searchable PDF Conversion (OCR)

<< Click to Display Table of Contents >>

Navigation:  Gnostice Document Studio .NET > Getting Started > Document Conversion >

Scanned to Searchable PDF Conversion (OCR)

The Document Converter component supports digitization of scanned documents (in the form of images or scanned PDF) during conversion to the PDF format. The recognized text from the image is added as an invisible layer when the image is written to the PDF, thus making the PDF "searchable". The open source library, Tesseract, is used for the digitization process.

 

The steps to convert image or scanned PDF to "searchable" PDF are shown below.

 

1.Open Visual Studio and create a new console application

2.Use either the GUI or the Package Manager Console and install the following NuGet packages to your project.

a.Gnostice.DocumentStudio.Converter

b.Gnostice.DocumentStudio.OCR

3.Use the code snippet shown below to perform the conversion along with OCR

 

    // Create DocumentConverter instance
    DocumentConverter documentConverter = new DocumentConverter();
 
    // List of files to be converted (images and/or scanned PDF)
    List<string> inputFiles = new List<string>() {
        @"scanned_image.jpg",
        @"scanned_pdf.pdf"
    };

 
    // Base file name for output file
    string baseFileName = "converted";
 
    // Convert to PDF
    string outputFileFormat = "pdf";
 
    // Additional parameters for OCR
    documentConverter.Preferences.DigitizerSettings.DigitizationMode = DigitizationMode.AllImages;
    documentConverter.Preferences.DigitizerSettings.ImageEnhancementSettings.ImageEnhancementMode =
        ImageEnhancementMode.OFF;
 
    // Languages used in the scanned document
    documentConverter.Preferences.DigitizerSettings.OCRSettings.DocumentLanguage = "eng";
    documentConverter.Preferences.DigitizerSettings.RecognizeElementTypes =
        RecognizeElementTypes.TEXT;
 
    // Convert to searchable PDF
    documentConverter.ConvertToFile(inputFiles, outputFileFormat, Environment.CurrentDirectory, 

        baseFileName, ConversionMode.ConvertToSeperateFiles);

 

4.Run application to see converted output file in application bin directory.

Enabling additional languages

The Tesseract OCR library uses training data to recognize text. The Gnostice.DocumentStudio.OCR add-on NuGet package ships with training data only for the English language. Tesseract can recognize many more languages. To enable additional languages you can download the training data for additional languages from the Tesseract GitHub page and copy it to the tessdata folder, which is located in the same folder as your binaries. Also remember to set the list of languages in the DocumentLanguage setting as shown in the code snippet. Multiple languages can be specified by separating them with a plus sign. For example for English, German, and French use "eng+deu+fra".

Enabling image enhancement

Before passing the images to Tesseract the document converter can also optionally enhance them to improve the text detection accuracy. The selection of image enhancement techniques that can help in improving the detection accuracy depend on many factors such as the quality of the original physical document which was scanned, the scanning fidelity, and the quality and resolution of the scanned image. Incorrect application of image processing technique can sometimes also degrade the detection accuracy. Also performing image processing considerably slows down the conversion process. Given all this one needs to judiciously choose the processing technique or set of techniques for a given set of scanned images. The supported enhancements techniques are gray, skew correction, and scaling. The following code snippet shows the way to enable these techniques.

 

 

    ImageEnhancementSettings imeSettings = documentConverter.Preferences.DigitizerSettings.ImageEnhancementSettings;
    imeSettings.ImageEnhancementMode = ImageEnhancementMode.USE_SPECIFIED_TECHNIQUES;
 
    // Gray
    imeSettings.ImageEnhancementTechniques.Add(new Gray());
 
    // Skew correction
    imeSettings.ImageEnhancementTechniques.Add(new SkewCorrection());
 
    // Scaling
    float scaleFactor = 2.5f;
    imeSettings.ImageEnhancementTechniques.Add(new Scaling(scaleFactor));