Developer Tools
|
Office Productivity Applications
|
Enterprise Solutions
|
|||||||||||||||||||||||







In Version 4 of PDFOne .NET, we have introduced methods to implement PDF text search.
public ArrayList Search( // search string string searchString, // page number int pageNum, // literal or regular expression PDFSearchMode searchMode, // generous-match, case-sensitive, whole-word PDFSearchOptions searchOptions ) public ArrayList Search( // search begins from int startPageNum, string searchString, PDFSearchMode searchMode, PDFSearchOptions searchOptions ) public void Search( string searchString, PDFSearchMode searchMode, PDFSearchOptions searchOptions, // event handler to be called when a match is found SearchElementHandler pdfSearchHandler, int startPageNum )
The first two overloads return an array list containing the lines that were extracted. The third overload does not return anything. Instead, it calls the specified event handler whenever it finds a match. Inside the event handler, you will be able to access the search results from the parameters.
These methods enable you to perform simple text searches using literal strings and advanced text searches using regular expressions.
The following code snippet illustrates the former.
PDFDocument PDFDocument1 = new PDFDocument("your-license-key"); // Load PDF document PDFDocument1.Load("sample_doc.pdf"); // Obtain all instances of the word "bike" in page 4 ArrayList ArrayList1 = (ArrayList) PDFDocument1.Search("bike", 1, PDFSearchMode.LITERAL, PDFSearchOptions.NONE); // Close the document PDFDocument1.Close(); // Iterate through all search results PDFSearchElement PdfSearchElement1; int n = ArrayList1.Count; for (int i = 0; i < n; i++) { PdfSearchElement1 = (PDFSearchElement) ArrayList1[i]; // Print search results to console output Console.WriteLine("Found \"" + PdfSearchElement1.MatchString + "\" in page #" + PdfSearchElement1.PageNumber + " text \"" + PdfSearchElement1.LineContainingMatchString + "\"" ); } // Close the document PDFDocument1.Close(); Console.ReadLine();
Here is the document we used for testing this code.

And, here is the output.

Regular expressions are performance-multipliers. Using cleverly crafted regular expressions,
you can eliminate several lines from you code. All the search() methods support
regular expressions. The following code snippet shows how to use them.
PDFDocument PDFDocument1 = new PDFDocument("your-license-key"); // Load PDF document PDFDocument1.Load("sample_.pdf"); // Obtain all hyperlinks in page 2 ArrayList ArrayList1 = (ArrayList)PDFDocument1.Search(@"http://{1}", 2, PDFSearchMode.REGEX, PDFSearchOptions.NONE); // Close the document PDFDocument1.Close(); // Iterate through all search results PDFSearchElement PdfSearchElement1; int n = ArrayList1.Count; for (int i = 0; i < n; i++) { PdfSearchElement1 = (PDFSearchElement) ArrayList1[i]; // Print search results to console output Console.WriteLine("Found \"" + PdfSearchElement1.MatchString + "\" in page #" + PdfSearchElement1.PageNumber + " text \"" + PdfSearchElement1.LineContainingMatchString + "\"" ); } // Close the document PDFDocument1.Close(); Console.ReadLine();
The above code snippet uses a simple regular expression that matches web page links. To test this code snippet, we used the following document.

And, here is the output. Note how all the hyperlinks have been neatly caught by the search.

The search methods find text in the order it is available in the document. This may not always be in the order that a human reads a page - from top to bottom. If you want it all ordered, then you should first extract all text from the page and then search the extracted text. The following code snippet shows how to extract all text content from a PDF page.
// Create a PDF document object PDFDocument PDFDocument1 = new PDFDocument("your-license-key"); // Load PDF document PDFDocument1.Load("sample_doc.pdf"); // Extract text from page 1 ArrayList aExtractedText = PDFDocument1.ExtractText(1); // Save extracted text to file using (StreamWriter StreamWriter1 = File.CreateText("extracted_content.txt")) { foreach (string sLine in aExtractedText) { StreamWriter1.Write(sLine); } StreamWriter1.Close();
We tested this code snippet on a PDF document containing the license agreement of one of our products. Here is that document and the extracted text.

---o0O0o---
| Our Developer Tools | |
|---|---|
eDocEngine VCLA Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools. |
PDFtoolkit VCLA Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents. |
XtremePDFConverter VCLA Delphi/C++Builder component to intelligently convert PDF to user-friendly Word RTF documents. |
|
PDFOne .NETA .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications. |
XtremeDocumentStudio .NETMulti-format document-processing component suite for .NET developers |
PDFOne (for Java™)A Java™ PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java™ applications. |
XtremeFontEngine (for Java)Java font engine to render glyphs from Type 1, Type 2 (CFF), and TrueType fonts |
| Our Office Productivity Applications |
|---|
Free PDF ReaderA free, fast, and portable application for viewing, printing and converting PDF documents. |
| Privacy | Legal | Feedback | Newsletter | Resellers | © 2002-2013 Gnostice Information Technologies Private Limited. All rights reserved. |
This site is best viewed on a screen with minimum resolution of 1152 x 864 pixels. Windows XP users are advised to use Microsoft ClearType Tuning for optimal experience. Also, please use the latest version of a standards-compliant browser such as Firefox, Opera, or Dragon (Chromium).