Text extraction and text search

<< Click to Display Table of Contents >>

Navigation:  Gnostice Document Studio .NET > Going Deeper > Document Engines >

Text extraction and text search

The document engines provide API for retrieving text from a loaded document. They also provide API for searching through the text present in the document. All these APIs are present on the IDocument interface which means you can simply load a document and fetch the text present in the document or search for text without needing to worry about the type of the document. The text extraction and text search operations can be performed on all supported document formats.

Text Extraction

The text present in a document can be retrieved in the following two ways.

1.The entire text of the document as unformatted text.

2.The text in terms of how a viewer would show it out. That is, as paginated text. When retrieved as paginated text, the API allows drilling down to each page, line, word, and character. At each level, the bounding box occupied by the text can also be retrieved.

 

The code below shows retrieving of text from a document.

 

 

 

    // Input document
    string inputFile = @"proposal.docx";
 
    // Load the document
    IDocument doc = DocumentManager.LoadDocument(inputFile, null);
 
    // Wait for loading to complete

    doc.LoadCompletedNotifier.WaitOne();

 
    // Get the document text object
    DocumentText docText = doc.GetDocumentText();
 
    // Complete unformatted text of the document
    string text = docText.Text;
 
    // Layout view of the text at each level (pages -> lines -> words -> characters)
 
    // Pages
    for (int pageIndex = 0; pageIndex < docText.Pages.Count; ++pageIndex)
    {
        // Text of page number (pageIndex + 1)
        PageText pageText = docText.Pages[pageIndex];
        text = pageText.Text;
 
        // Bounding box of text on page
        SimpleRect rect = pageText.BoundingRect;
 
        // Lines
        for (int lineIndex = 0; lineIndex < pageText.Lines.Count; ++lineIndex)
        {
            // Text of line number (lineIndex + 1)
            LineText lineText = pageText.Lines[lineIndex];
            text = lineText.Text;
 
            // Bounding box of line on page
            rect = lineText.BoundingRect;
 
            // Words
            for (int wordIndex = 0; wordIndex < lineText.Words.Count; ++wordIndex)
            {
                // Text of word number (wordIndex + 1)
                WordText wordText = lineText.Words[wordIndex];
                text = wordText.Text;
 
                // Bounding box of word on page
                rect = wordText.BoundingRect;
 
                // Characters
                for (int charIndex = 0; charIndex < wordText.Chars.Count; ++charIndex)
                {
                    // Text of char number (charIndex + 1)
                    CharText charText = wordText.Chars[charIndex];
                    text = charText.Text;
 
                    // Bounding box of character on page
                    rect = charText.BoundingRect;
                }
            }
        }
    }
 

 

Text Search

The text search API supports the following features

Searching for exact occurrences using a literal search string.

Searching for exact occurrences using a literal search string.

Searching for a text pattern by specifying the search term as a regular expression.

Starting the search from a specific position in the document by specifying the character position.

Searching in both forward and backward directions, starting at specified position, default position, or last search position.

Option to specify whether to stop the search after going through the entire document or continue the search wrapping around the document.

 

For literal text search, additional options such as whole word and case sensitivity can be specified.

 

Each occurrence of the specified search term is returned as the text search result. The result contains the following details

The exact location of the occurrence in the document in terms of character indices.

The location of the occurrence in terms of the paginated view of the document such as the page index, the line index, the word index, and the character index within the word(s).

 

The code below shows searching for email addresses in a document.

 

 

    // Input document
    string inputFile = @"proposal.docx";
 
    // Load the document
    IDocument doc = DocumentManager.LoadDocument(inputFile, null);
 
    // Wait for loading to complete
    doc.LoadCompletedNotifier.WaitOne();
 
    // Get the document text object
    DocumentText docText = doc.GetDocumentText();
 
    // Start search at the 10th character in the document
    doc.CursorPosition = 10;
 
    // Search for email address by doing a pattern search
    TextSearchMode searchMode = TextSearchMode.Regex;
 
    // Regex for email address
    string regex = "[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})";
 
    // No additional search options
    TextSearchOptions searchOptions = TextSearchOptions.None;
 
    TextSearchResult searchResult = null;
    while ((searchResult = docText.FindNext(regex, searchMode, searchOptions, searchResult)) != null)
    {
        // The email address that was found
        string emailAddress = searchResult.SearchText;
 
        // The start character index of the search text within the entire document text
        int startIndex = searchResult.DocumentTextIndex;
 
        // The starting page index within the document where the search text was found
        int pageBeginIndex = searchResult.PageBeginIndex;
 
        // The ending page index within the document where the search text was found
        int pageEndIndex = searchResult.PageEndIndex;
 
        // The starting line index within the page (PageBeginIndex) where the search text was found
        int lineBeginIndex = searchResult.LineBeginIndex;
 
        // The ending line index within the page (PageEndIndex) where the search text was found
        int lineEndIndex = searchResult.LineEndIndex;
 
        // The starting word index within the line (LineBeginIndex) where the search text was found
        int wordBeginIndex = searchResult.WordBeginIndex;
 
        // The ending word index within the line (LineEndIndex) where the search text was found
        int wordEndIndex = searchResult.WordEndIndex;
 
        // The starting chracter index within the word (WordBeginIndex) where the search text was found
        int charBeginIndex = searchResult.CharBeginIndex;
 
        // The ending chracter index within the word (WordEndIndex) where the search text was found
        int charEndIndex = searchResult.CharEndIndex;
 
        // The index position where the search first started (in this case 10)
        int searchBeginIndex = searchResult.SearchBeginIndex;
 
        // Flag that indicates if the occurrence was found after a wraparound of the document boundary
        // In this example it will always be false since the Wraparound text search option is not set
        bool documentBoundaryCrossed = searchResult.DocumentBoundaryCrossed;
    }
 
    // Close the document
    doc.CloseDocument();