PDFOne (for Java)
Create, edit, view, print & enhance PDF documents and forms in Java SE/EE
Compatibility
J2SE J2EE Windows Linux Mac (OS X)

PDF Text Search And PDF Text Extraction Using PDFOne (for Java)

Learn to search and extract text from PDF documents.
By V. Subhash

One of the new features that we introduced in Version 4 of PDFOne (for Java™) was text extraction. It was included because there were numerous requests for that feature from existing customers and trial users.

One of the methods that you could use to extract text is the PdfDocument.search() method.

List 	search(int startPageNum,
            String searchString, 
            int searchMode, 
            int searchOptions
           )

List 	search(String searchString,
            int pageNum, 
            int searchMode, 
            int searchOptions)

void 	search(String searchString,
            int searchMode, 
            int searchOptions, 
            PdfSearchHandler pdfSearchHandler, 
            int startPageNum)

Simple Text Search

The search() method finds all instances of the search text and returns a list containing the results. The following code snippet demonstrates how to use this method.

import java.io.IOException;
import java.util.ArrayList;

import com.gnostice.pdfone.PDFOne;
import com.gnostice.pdfone.PdfDocument;
import com.gnostice.pdfone.PdfException;
import com.gnostice.pdfone.PdfSearchElement;
import com.gnostice.pdfone.PdfSearchMode;
import com.gnostice.pdfone.PdfSearchOptions;

public class Text_Search_Demo
{
    public static void main(String[] args) throws IOException, PdfException, Exception {
        
        int i, n;
        PdfSearchElement pseResult;
        
        // Load a PDF document 
        PdfDocument doc = new PdfDocument();
        doc.load("Input_Docs\\input_doc.pdf");
        
        // Obtain all instances of the word "alcohol" in page 4 
        ArrayList lstSearchResults1 = 
           (ArrayList) doc.search("alcohol",
                                  4,
                                  PdfSearchMode.LITERAL,
                                  PdfSearchOptions.NONE);
        // Close the document
        doc.close();
        
        // Iterate through all search results
        n = lstSearchResults1.size();        
        for (i = 0; i < n; i++) {
            pseResult = (PdfSearchElement) lstSearchResults1.get(i);
            // Print search results to console output
            System.out.println("Found \"" + 
                               pseResult.getMatchString()  + 
                               "\" in page #" + 
                               pseResult.getPageNum() + 
                               " text \"" + 
                               pseResult.getLineContainingMatchString()  + 
                               "\"" );
        }
    }
}

For testing this code snippet, we used this document.

And, here is the output.

PDF Text Search Using Regular Expressions

You can also perform advanced text search using regex strings. In the above code snippet, we can modify the search method. We can use a regex that finds all text elements that contain a hyperlink.

// Obtain all website addresses in page 2
ArrayList lstSearchResults =
              (ArrayList) doc.search("http://{1}",  // regular expression
                                     2,             // page number
                                     PdfSearchMode.REGEX,
                                     PdfSearchOptions.NONE);

Here is the output when we perform the text search using the regular expression.

Here is the document where the search was performed. Please note that text elements that contain multiple hyperlinks have been printed as many times.

Original PDF Document

PDF Text Extraction - All Text From A PDF Page

You may have noted that list contains text elements in the order that they were found in the PDF document. If you would like to maintain the order that the text is found when a human reads the document, then you need to use the saveAsText() method.

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;

import com.gnostice.pdfone.PDFOne;
import com.gnostice.pdfone.PdfDocument;
import com.gnostice.pdfone.PdfException;

public class Text_Export_Demo {

  public static void main(String[] args) throws IOException, PdfException, Exception {

    int i, n;

    // Create a file writer instance
    FileOutputStream fos = new FileOutputStream("Output_Docs\\extracted_text.txt");
    OutputStreamWriter osw = new OutputStreamWriter(fos, "utf-8");

    // Load a PDF document
    PdfDocument doc = new PdfDocument();
    doc.load("Input_Docs\\sample_doc.pdf");

    // Extract text from page 1 of the document
    // and save it to the file writer
    doc.saveAsText(1, osw);
    osw.close();

    // Close the PDF document
    doc.close();

  }
}
Original PDF Document and Extracted Text

---o0O0o---

Our .NET Developer Tools
Gnostice Document Studio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
Gnostice Document Studio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
Gnostice Document Studio Java

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2024 Gnostice Information Technologies Private Limited. All rights reserved.