PDFtoolkit VCL
Edit, enhance, secure, merge, split, view, print PDF and AcroForms documents
Compatibility
Delphi C++Builder

Extraction of Structured Text Data From PDF Documents

Use PDFtoolkit VCL to extract text data occurring in specific locations in a PDF document.
By Mohammed Najeemudheen & Shine Babu

The inspiration for this article is from a query sent by one of our customers.

The customer is a user of PDFtoolkit VCL. He receives a lot of PDF documents containing demographic data - output of some process over which he had no control. He had to extract the demographic data from the PDF files and use that data for some other process.

The data was in a structured format and occurred in the same locations on the first page of all the documents. Now, given the location of the data, was there a way to extract the data, he wanted to know.

The following is a slightly abridged version of the code snippet we sent to the client.

var
  PageElements: TgtPDFPageElementList;
  PageItem: TgtPDFTextElement;
  LI, JI : Integer;
  XCord, YCord : Double;
begin
  try
    Result := "";    
    PDFDoc.LoadFromFile("input.pdf");

    // Gets text elements from page 1
    PageElements :=
            PDFDoc.GetPageElements(1,[etText],muPixels);
    // Parses the text elements in page 1
    for JI := 0 to PageElements.Count -1 do
    begin
      PageItem :=  TgtPDFTextElement(PageElements.Items[JI]);
      // Retrieves coordinates of the text element
      XCord :=  TgtPDFPageElement(PageItem).XCordOrigin;
      YCord := TgtPDFPageElement(PageItem).YCordOrigin;
      // Checks if the text element is at (100, 250)
      if ((Trunc(XCord) = 100) and
          (Trunc(YCord) = 250)) then
      begin
        Result := PageItem.Text;
        break;
      end;
    end;
  finally
    FreeAndNil(PageElements);
  end;
end;
 

This method is written so that it will extract text data occurring at coordinates (100, 250) on page 1 of a PDF document input.pdf. So, the method parses all text elements on page 1 of the PDF file, checks coordinates of each, and when the coordinates match (100, 250) returns the text string represented by that text element.

Privacy | Legal | Feedback | Newsletter © 2002-2010 Gnostice Information Technologies Private Limited. All rights reserved.

This site is best viewed on a screen with minimum resolution of 1152 x 864 pixels. Windows users are advised to use Microsoft ClearType Tuning for optimal experience. Linux and other users can enable font smoothing, as supported by their OS. Also, please use the latest version of a standards-compliant browser such as Opera, FireFox, Chrome or Safari.