Developer Tools
|
Office Productivity Applications
|
Platform-Agnostic APIs
|
Home | Online Demos | Downloads | Buy Now | Support | About Us | News | Working Together | Contact Us
In Version 4 of PDFOne .NET, we have introduced methods to implement PDF text search.
public ArrayList Search( // search string string searchString, // page number int pageNum, // literal or regular expression PDFSearchMode searchMode, // generous-match, case-sensitive, whole-word PDFSearchOptions searchOptions ) public ArrayList Search( // search begins from int startPageNum, string searchString, PDFSearchMode searchMode, PDFSearchOptions searchOptions ) public void Search( string searchString, PDFSearchMode searchMode, PDFSearchOptions searchOptions, // event handler to be called when a match is found SearchElementHandler pdfSearchHandler, int startPageNum )
The first two overloads return an array list containing the lines that were extracted. The third overload does not return anything. Instead, it calls the specified event handler whenever it finds a match. Inside the event handler, you will be able to access the search results from the parameters.
These methods enable you to perform simple text searches using literal strings and advanced text searches using regular expressions.
The following code snippet illustrates the former.
PDFDocument PDFDocument1 = new PDFDocument("your-license-key"); // Load PDF document PDFDocument1.Load("sample_doc.pdf"); // Obtain all instances of the word "bike" in page 4 ArrayList ArrayList1 = (ArrayList) PDFDocument1.Search("bike", 1, PDFSearchMode.LITERAL, PDFSearchOptions.NONE); // Close the document PDFDocument1.Close(); // Iterate through all search results PDFSearchElement PdfSearchElement1; int n = ArrayList1.Count; for (int i = 0; i < n; i++) { PdfSearchElement1 = (PDFSearchElement) ArrayList1[i]; // Print search results to console output Console.WriteLine("Found \"" + PdfSearchElement1.MatchString + "\" in page #" + PdfSearchElement1.PageNumber + " text \"" + PdfSearchElement1.LineContainingMatchString + "\"" ); } // Close the document PDFDocument1.Close(); Console.ReadLine();
Here is the document we used for testing this code.
And, here is the output.
Regular expressions are performance-multipliers. Using cleverly crafted regular expressions,
you can eliminate several lines from you code. All the search()
methods support
regular expressions. The following code snippet shows how to use them.
PDFDocument PDFDocument1 = new PDFDocument("your-license-key"); // Load PDF document PDFDocument1.Load("sample_.pdf"); // Obtain all hyperlinks in page 2 ArrayList ArrayList1 = (ArrayList)PDFDocument1.Search(@"http://{1}", 2, PDFSearchMode.REGEX, PDFSearchOptions.NONE); // Close the document PDFDocument1.Close(); // Iterate through all search results PDFSearchElement PdfSearchElement1; int n = ArrayList1.Count; for (int i = 0; i < n; i++) { PdfSearchElement1 = (PDFSearchElement) ArrayList1[i]; // Print search results to console output Console.WriteLine("Found \"" + PdfSearchElement1.MatchString + "\" in page #" + PdfSearchElement1.PageNumber + " text \"" + PdfSearchElement1.LineContainingMatchString + "\"" ); } // Close the document PDFDocument1.Close(); Console.ReadLine();
The above code snippet uses a simple regular expression that matches web page links. To test this code snippet, we used the following document.
And, here is the output. Note how all the hyperlinks have been neatly caught by the search.
The search methods find text in the order it is available in the document. This may not always be in the order that a human reads a page - from top to bottom. If you want it all ordered, then you should first extract all text from the page and then search the extracted text. The following code snippet shows how to extract all text content from a PDF page.
// Create a PDF document object PDFDocument PDFDocument1 = new PDFDocument("your-license-key"); // Load PDF document PDFDocument1.Load("sample_doc.pdf"); // Extract text from page 1 ArrayList aExtractedText = PDFDocument1.ExtractText(1); // Save extracted text to file using (StreamWriter StreamWriter1 = File.CreateText("extracted_content.txt")) { foreach (string sLine in aExtractedText) { StreamWriter1.Write(sLine); } StreamWriter1.Close();
We tested this code snippet on a PDF document containing the license agreement of one of our products. Here is that document and the extracted text.
---o0O0o---
Our .NET Developer Tools | |
---|---|
![]() Gnostice Document Studio .NETMulti-format document-processing component suite for .NET developers. |
![]() PDFOne .NETA .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications. |
Our Delphi/C++Builder developer tools | |
---|---|
![]() Gnostice Document Studio DelphiMulti-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms. |
![]() eDocEngine VCLA Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools. |
![]() PDFtoolkit VCLA Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents. |
Our Java developer tools | |
---|---|
![]() Gnostice Document Studio JavaMulti-format document-processing component suite for Java developers. |
![]() PDFOne (for Java)A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications. |
Our Platform-Agnostic Cloud and On-Premises APIs | |
---|---|
![]() StarDocsCloud-hosted and On-Premises REST-based document-processing and document-viewing APIs |
Privacy | Legal | Feedback | Newsletter | Blog | Resellers | © 2002-2023 Gnostice Information Technologies Private Limited. All rights reserved. |