PDFOne .NET
Powerful all-in-one PDF library for .NET
Compatibility
VS 2008 VS 2005 CLR 2.0

PDF Text Search and Extraction Using PDFOne .NET

Learn to search and extract text from PDF documents.
By V. Subhash

In Version 4 of PDFOne .NET, we have introduced methods to implement PDF text search.

public ArrayList Search(
   // search string
   string searchString,
   // page number
   int pageNum,
   // literal or regular expression
   PDFSearchMode searchMode,
   // generous-match, case-sensitive, whole-word
   PDFSearchOptions searchOptions
)

public ArrayList Search(
   // search begins from
   int startPageNum,
   string searchString,
   PDFSearchMode searchMode,
   PDFSearchOptions searchOptions
)

public void Search(
   string searchString,
   PDFSearchMode searchMode,
   PDFSearchOptions searchOptions,
   // event handler to be called when a match is found
   SearchElementHandler pdfSearchHandler,
   int startPageNum
)

The first two overloads return an array list containing the lines that were extracted. The third overload does not return anything. Instead, it calls the specified event handler whenever it finds a match. Inside the event handler, you will be able to access the search results from the parameters.

These methods enable you to perform simple text searches using literal strings and advanced text searches using regular expressions.

Simple Text Search

The following code snippet illustrates the former.

PDFDocument PDFDocument1 = new PDFDocument("your-license-key");

// Load PDF document
PDFDocument1.Load("sample_doc.pdf");

// Obtain all instances of the word "bike" in page 4
ArrayList ArrayList1 =
      (ArrayList) PDFDocument1.Search("bike",
                                      1,
                                      PDFSearchMode.LITERAL,
                                      PDFSearchOptions.NONE);
// Close the document
PDFDocument1.Close();

// Iterate through all search results
PDFSearchElement PdfSearchElement1;
int n = ArrayList1.Count;
for (int i = 0; i < n; i++) {
  PdfSearchElement1 = (PDFSearchElement) ArrayList1[i];
  // Print search results to console output
  Console.WriteLine("Found \"" +
                         PdfSearchElement1.MatchString +
                         "\" in page #" +
                         PdfSearchElement1.PageNumber +
                         " text \"" +
                         PdfSearchElement1.LineContainingMatchString +
                         "\"" );
}

// Close the document
PDFDocument1.Close();
Console.ReadLine();

Here is the document we used for testing this code.

Sample Document

And, here is the output.

Text Search Output

Advanced PDF Text Search

Regular expressions are performance-multipliers. Using cleverly crafted regular expressions, you can eliminate several lines from you code. All the search() methods support regular expressions. The following code snippet shows how to use them.

PDFDocument PDFDocument1 = new PDFDocument("your-license-key");

// Load PDF document
PDFDocument1.Load("sample_.pdf");


// Obtain all hyperlinks in page 2
ArrayList ArrayList1 =
      (ArrayList)PDFDocument1.Search(@"http://{1}",
                                      2,
                                      PDFSearchMode.REGEX,
                                      PDFSearchOptions.NONE);
// Close the document
PDFDocument1.Close();

// Iterate through all search results
PDFSearchElement PdfSearchElement1;
int n = ArrayList1.Count;
for (int i = 0; i < n; i++) {
  PdfSearchElement1 = (PDFSearchElement) ArrayList1[i];
  // Print search results to console output
  Console.WriteLine("Found \"" +
                         PdfSearchElement1.MatchString +
                         "\" in page #" +
                         PdfSearchElement1.PageNumber +
                         " text \"" +
                         PdfSearchElement1.LineContainingMatchString +
                         "\"" );
}

// Close the document
PDFDocument1.Close();
Console.ReadLine();

The above code snippet uses a simple regular expression that matches web page links. To test this code snippet, we used the following document.

Sample Document

And, here is the output. Note how all the hyperlinks have been neatly caught by the search.

Advanced Text Search Output

PDF Text Extraction

The search methods find text in the order it is available in the document. This may not always be in the order that a human reads a page - from top to bottom. If you want it all ordered, then you should first extract all text from the page and then search the extracted text. The following code snippet shows how to extract all text content from a PDF page.

// Create a PDF document object
PDFDocument PDFDocument1 = new PDFDocument("your-license-key");

// Load PDF document
PDFDocument1.Load("sample_doc.pdf");

// Extract text from page 1
ArrayList aExtractedText = PDFDocument1.ExtractText(1);

// Save extracted text to file
using (StreamWriter StreamWriter1 = File.CreateText("extracted_content.txt"))  {
  foreach (string sLine in aExtractedText) {
    StreamWriter1.Write(sLine);
  }

StreamWriter1.Close();

We tested this code snippet on a PDF document containing the license agreement of one of our products. Here is that document and the extracted text.

Original Document and Extracted Text

---o0O0o---

Our Developer Tools
eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

XtremePDFConverter VCL

A Delphi/C++Builder component to intelligently convert PDF to user-friendly Word RTF documents.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

XtremeDocumentStudio .NET

Multi-format document-processing component suite for .NET developers

PDFOne (for Java™)

A Java™ PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java™ applications.

XtremeFontEngine (for Java)

Java font engine to render glyphs from Type 1, Type 2 (CFF), and TrueType fonts

Our Office Productivity Applications
Free PDF Reader

A free, fast, and portable application for viewing, printing and converting PDF documents.

Privacy | Legal | Feedback | Newsletter | Resellers © 2002-2013 Gnostice Information Technologies Private Limited. All rights reserved.

This site is best viewed on a screen with minimum resolution of 1152 x 864 pixels. Windows XP users are advised to use Microsoft ClearType Tuning for optimal experience. Also, please use the latest version of a standards-compliant browser such as Firefox, Opera, or Dragon (Chromium).