XtremeDocumentStudio
.NET
PDFOne
.NET
XtremeDocumentStudio
(for Java)
PDFOne
(for Java)
XtremeDocumentStudio
Delphi
eDocEngine
VCL
PDFtoolkit
VCL
StarDocs
Web APIs

Convert scanned documents to searchable PDF using Web APIs

Scanned images or PDF files to searchable PDF files
by Santosh Patil

Select the language for the code snippets


If you are new to StarDocs, we suggest you read the introductory article and the getting started article first. This article builds on the steps explained in those foundational articles to avoid repetition.

The API reference documentation can be found here.

Digitization is the process of converting analog content to a digital form which makes the content amenable for further processing. Digitizing printed matter consists of two steps. The first step is the acquisition of the printed page(s) as a set of images, which is accomplished using a scanner or a high resolution camera. The second step is the optical recognition and digitization of the content present in the acquired image. Going further the recognized text and the original image can be combined such that the image is retained as-is while also making the document searchable. The PDF file format allows such composition where the original image forms the main page content and the recognized text is superimposed on the image as an invisible layer. The placement and sizing of the invisible text is matched as closely as possible to the original content in the image so that selection highlight of the invisible text closely matches the corresponding printed text on the image. The searchable PDF thus produced retains the legibility for human reading while also making the document amenable for further processing such as content search and copy/extraction.

StarDocs provides APIs to create a searchable PDF given a scanned image or a scanned PDF. These APIs are part of the Document Converter APIs. These APIs can accept a scanned image either as an image file(s) or as a PDF file and can produce a searchable PDF file.

The below screenshot shows a scanned image being viewed in the StarDocs HTML viewer after it has been acquired.

The below screenshot shows the same image after it is converted to a searchable PDF file. The user can now search for content in the viewer.

Let's look at the API for converting scanned content to searchable PDF. Before proceeding further please make sure you have selected the appropriate language for the code snippets using the drop down menu at the top of this article.

After authentication and uploading of the scanned document (image or PDF file) you need to make searchable, you will get the document URL or list of URLs. We pass in this URL (or list) to the document converter API as shown below.

// Set up connection details
var stardocs = new Gnostice.StarDocs(
  new Gnostice.ConnectionInfo(
    'https://api.gnostice.com/stardocs/v1', 
    '<API Key>', 
    '<API Secret>'),
  new Preferences(
    // Whether to force full permissions on PDF files protected 
    // with a permissions/owner/master password
    new DocPasswordSettings(true))
);

// Authenticate
stardocs.auth.loginApp()
  .done(function(response) {
    // Upload file
    var selectedFile = document.getElementById('input').files[0];
    stardocs.storage.upload(selectedFile) 
      .done(function(response) {
        var documentUrl = response.documents[0].url;

        // Setup the digitizer settings
        var digitizerSettings = {
          // Supported values are "off" (default), "allImages"
          digitizationMode: "allImages",
          // Array of strings listing the languages of the text present in 
          // the scanned document. "eng" is default.
          documentLanguages: ["eng", "deu"],
          // The type of elements that need to be recognized and digitized.
          // Currently only "text" is supported
          recognizeElements: ["eng"],
          // Whether any skew correction should be performed (default is true)
          skewCorrection: true,
          // Which image enhancement techniques (if any) should be applied to 
          // the input image before attempting to recognize the elements
          imageEnhancementSettings: 
          {
            // Supported values are "off" (default), "auto" and "useSpecified"
            enhancementMode: "auto"
          }
        };
        // Convert to searchable PDF
        stardocs.docOperations.convertToPDF("convertToSingleFile", [docUrls],
            null, null, null, digitizerSettings)
          .done(function(response) {
            var newDocUrl = response.documents[0].url;
            
            // Do something with resultant document (newDocUrl)
            // ...
          });
      });
  });
// Set up connection details
StarDocs starDocs = new StarDocs(
  new ConnectionInfo(
    new Uri("https://api.gnostice.com/stardocs/v1"),
    "<API Key>",
    "<API Secret>"), 
  new Preferences(
    // Force full permissions on PDF files protected 
    // with an permissions/owner/master password
    new DocPasswordSettings(true))
);

// Authenticate
starDocs.Auth.loginApp();

// Input file
FileObject fileObjectInput = new FileObject(@"C:\Documents\Statement.pdf");
List fileObjectInputs = new List() { fileObjectInput };

// Setup the digitizer settings
ConverterDigitizerSettings digitizerSettings = new ConverterDigitizerSettings();
digitizerSettings.DigitizationMode = DigitizationMode.AllImages;
// Array of strings listing the languages of the text present in 
// the scanned document. "eng" is default.
digitizerSettings.DocumentLanguages = new string[] { "eng", "deu" };
// The type of elements that need to be recognized and digitized.
// Currently only text is supported
digitizerSettings.RecognizeElements = RecognizableElementType.Text;
// Which image enhancement techniques (if any) should be applied to 
// the input image before attempting to recognize the elements
digitizerSettings.ImageEnhancementSettings.ImageEnhancementMode = 
  ImageEnhancementMode.Auto;
// Whether any skew correction should be performed (default is true)
digitizerSettings.SkewCorrection = true;

// Convert to searchable PDF
List outFiles = 
  starDocs.DocOperations.ConvertToPDF(fileObjectInputs, null, null, 
    null, ConversionMode.ConvertToSingleFile, digitizerSettings);

DocObject docObjectOutput = outFiles[0];

// Do something with resultant document (docObjectOutput)
// ...
var
  StarDocs: TgtStarDocsSDK;
  LInFiles: TObjectList;
  LOutFiles: TObjectList;
  FileObjectInput: TgtFileObject;
  DocObjectOutput: TgtDocObject;
  DocumentLanguages: TArray;
begin
  StarDocs := nil;
  LInFiles := nil;
  LOutFiles := nil;
  DocObjectOutput := nil;
  try
    // Set up connection details
    StarDocs := TgtStarDocsSDK.Create(nil);
    StarDocs.ConnectionInfo.ApiServerUri.URI :=
      'http://api.gnostice.com/stardocs/v1';
    StarDocs.ConnectionInfo.ApiKey := '<API Key>';
    StarDocs.ConnectionInfo.ApiSecret := '<API Secret>';
    // Force full permissions on PDF files protected 
    // with an permissions/owner/master password
    StarDocs.Preferences.DocPasswordSettings.ForceFullPermission := True;

    // Authenticate
    StarDocs.Auth.loginApp;

    // Input file
    LInFiles := TObjectList.Create;
    LInFiles.Add(TgtFileObject.Create
        ('D:\Work\Demos\build2016\demos\SampleFiles\OCR\Deutsch.png'));

    // Setup the digitizer settings
    StarDocs.DocOperations.ConverterDigitizerSettings.DigitizationMode 
      := dmoAllImages;
    // Array of strings listing the languages of the text present in 
    // the scanned document. "eng" is default.
    DocumentLanguages := TArray.Create();
    SetLength(DocumentLanguages, 2);
    DocumentLanguages[0] := 'eng';
    DocumentLanguages[1] := 'deu';
    StarDocs.DocOperations.ConverterDigitizerSettings.DocumentLanguages 
      := DocumentLanguages;
    // The type of elements that need to be recognized and digitized.
    // Currently only text is supported
    StarDocs.DocOperations.ConverterDigitizerSettings.RecognizeElements 
      := [retText];
    // Which image enhancement techniques (if any) should be applied to 
    // the input image before attempting to recognize the elements
    StarDocs.DocOperations.ConverterDigitizerSettings.
      ImageEnhancementSettings.ImageEnhancementMode := iemAuto;
    // Whether any skew correction should be performed (default is true)
    StarDocs.DocOperations.ConverterDigitizerSettings.SkewCorrection := True;

    // Convert to searchable PDF
    OutFiles := StarDocs.DocOperations.ConvertToPDF(LInFiles, nil, nil);
    DocObjectOutput := OutFiles[0];

    // Do something with resultant document (DocObjectOutput)
    // ...

  finally
    // Free objects
    if Assigned(LOutFiles) then
      FreeAndNil(LOutFiles);
    if Assigned(LInFiles) then
      FreeAndNil(LInFiles);
    if Assigned(StarDocs) then
      FreeAndNil(StarDocs);
  end;
end;
// Set up connection details
StarDocs starDocs = new StarDocs(
  new ConnectionInfo(
    new java.net.URI("https://api.gnostice.com/stardocs/v1"),
    "<API Key>",
    "<API Secret>"), 
  new Preferences(
    // Force full permissions on PDF files protected 
    // with an permissions/owner/master password
    new DocPasswordSettings(true))
);

// Authenticate
starDocs.auth.loginApp();

// Input file
FileObject fileObjectInput = new FileObject("C:\\Documents\\Statement.pdf");
ArrayList fileObjectInputs = new 
  ArrayList(Arrays.asList(new FileObject[] {fileObjectInput}));

// Setup the digitizer settings
ConverterDigitizerSettings digitizerSettings = new 
  ConverterDigitizerSettings();
digitizerSettings.setDigitizationMode(DigitizationMode.AllImages);
// Array of strings listing the languages of the text present in 
// the scanned document. "eng" is default.
digitizerSettings.setDocumentLanguages(new String[] { "eng", "deu" });
// The type of elements that need to be recognized and digitized.
// Currently only text is supported
digitizerSettings.setRecognizeElements(
  EnumSet.of(RecognizableElementType.Text));
// Which image enhancement techniques (if any) should be applied to 
// the input image before attempting to recognize the elements
digitizerSettings.getImageEnhancementSettings().setImageEnhancementMode(
  ImageEnhancementMode.Auto);
// Whether any skew correction should be performed (default is true)
digitizerSettings.setSkewCorrection(true);

// Convert to searchable PDF
outFiles = starDocs.docOperations.convertToPDF(inFiles, null, null, null, 
  ConversionMode.ConvertToSeparateFiles, digitizerSettings);

DocObject docObjectOutput = outFiles.get(0);

// Do something with resultant document (docObjectOutput)
// ...

That's it! This article showed how to use the Gnostice StarDocs Document Converter API to convert scanned documents to searchable PDF files.

---o0O0o---

Our .NET Developer Tools
XtremeDocumentStudio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
XtremeDocumentStudio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
XtremeDocumentStudio (for Java)

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2017 Gnostice Information Technologies Private Limited. All rights reserved.