PDFtoolkit VCL
Edit, enhance, secure, merge, split, view, print PDF and AcroForms documents
Compatibility
Delphi C++Builder

Extraction of Structured Text Data From PDF Documents

Use PDFtoolkit VCL to extract text data occurring in specific locations in a PDF document.
By Mohammed Najeemudheen & Shine Babu

The inspiration for this article is from a query sent by one of our customers.

The customer is a user of PDFtoolkit VCL. He receives a lot of PDF documents containing demographic data - output of some process over which he had no control. He had to extract the demographic data from the PDF files and use that data for some other process.

The data was in a structured format and occurred in the same locations on the first page of all the documents. Now, given the location of the data, was there a way to extract the data, he wanted to know.

The following is a slightly abridged version of the code snippet we sent to the client.

var
  PageElements: TgtPDFPageElementList;
  PageItem: TgtPDFTextElement;
  LI, JI : Integer;
  XCord, YCord : Double;
begin
  try
    Result := "";    
    PDFDoc.LoadFromFile("input.pdf");

    // Gets text elements from page 1
    PageElements :=
            PDFDoc.GetPageElements(1,[etText],muPixels);
    // Parses the text elements in page 1
    for JI := 0 to PageElements.Count -1 do
    begin
      PageItem :=  TgtPDFTextElement(PageElements.Items[JI]);
      // Retrieves coordinates of the text element
      XCord :=  TgtPDFPageElement(PageItem).XCordOrigin;
      YCord := TgtPDFPageElement(PageItem).YCordOrigin;
      // Checks if the text element is at (100, 250)
      if ((Trunc(XCord) = 100) and
          (Trunc(YCord) = 250)) then
      begin
        Result := PageItem.Text;
        break;
      end;
    end;
  finally
    FreeAndNil(PageElements);
  end;
end;
 

This method is written so that it will extract text data occurring at coordinates (100, 250) on page 1 of a PDF document input.pdf. So, the method parses all text elements on page 1 of the PDF file, checks coordinates of each, and when the coordinates match (100, 250) returns the text string represented by that text element.

---o0O0o---

Our .NET Developer Tools
XtremeDocumentStudio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
XtremeDocumentStudio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
XtremeDocumentStudio (for Java)

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2017 Gnostice Information Technologies Private Limited. All rights reserved.