The document engines provide API for retrieving text from a loaded document. They also provide API for searching through the text present in the document. All these APIs are present on the IDocument interface which means you can simply load a document and fetch the text present in the document or search for text without needing to worry about the type of the document. The text extraction and text search operations can be performed on all supported document formats.
Text Extraction
The text present in a document can be retrieved in the following two ways.
1.The entire text of the document as unformatted text.
2.The text in terms of how a viewer would show it out. That is, as paginated text. When retrieved as paginated text, the API allows drilling down to each page, line, word, and character. At each level, the bounding box occupied by the text can also be retrieved.
The code below shows retrieving of text from a document.
// Input document
string inputFile = @"proposal.docx";
// Load the document
IDocument doc = DocumentManager.LoadDocument(inputFile, null);
// Wait for loading to complete
doc.LoadCompletedNotifier.WaitOne();
// Get the document text object
DocumentText docText = doc.GetDocumentText();
// Complete unformatted text of the document
string text = docText.Text;
// Layout view of the text at each level (pages -> lines -> words -> characters)
// Pages
for (int pageIndex = 0; pageIndex < docText.Pages.Count; ++pageIndex)
{
// Text of page number (pageIndex + 1)
PageText pageText = docText.Pages[pageIndex];
text = pageText.Text;
// Bounding box of text on page
SimpleRect rect = pageText.BoundingRect;
// Lines
for (int lineIndex = 0; lineIndex < pageText.Lines.Count; ++lineIndex)
{
// Text of line number (lineIndex + 1)
LineText lineText = pageText.Lines[lineIndex];
text = lineText.Text;
// Bounding box of line on page
rect = lineText.BoundingRect;
// Words
for (int wordIndex = 0; wordIndex < lineText.Words.Count; ++wordIndex)
{
// Text of word number (wordIndex + 1)
WordText wordText = lineText.Words[wordIndex];
text = wordText.Text;
// Bounding box of word on page
rect = wordText.BoundingRect;
// Characters
for (int charIndex = 0; charIndex < wordText.Chars.Count; ++charIndex)
{
// Text of char number (charIndex + 1)
CharText charText = wordText.Chars[charIndex];
text = charText.Text;
// Bounding box of character on page
rect = charText.BoundingRect;
}
}
}
}
|
Text Search
The text search API supports the following features
•Searching for exact occurrences using a literal search string.
•Searching for exact occurrences using a literal search string.
•Searching for a text pattern by specifying the search term as a regular expression.
•Starting the search from a specific position in the document by specifying the character position.
•Searching in both forward and backward directions, starting at specified position, default position, or last search position.
•Option to specify whether to stop the search after going through the entire document or continue the search wrapping around the document.
For literal text search, additional options such as whole word and case sensitivity can be specified.
Each occurrence of the specified search term is returned as the text search result. The result contains the following details
•The exact location of the occurrence in the document in terms of character indices.
•The location of the occurrence in terms of the paginated view of the document such as the page index, the line index, the word index, and the character index within the word(s).
The code below shows searching for email addresses in a document.
// Input document
string inputFile = @"proposal.docx";
// Load the document
IDocument doc = DocumentManager.LoadDocument(inputFile, null);
// Wait for loading to complete
doc.LoadCompletedNotifier.WaitOne();
// Get the document text object
DocumentText docText = doc.GetDocumentText();
// Start search at the 10th character in the document
doc.CursorPosition = 10;
// Search for email address by doing a pattern search
TextSearchMode searchMode = TextSearchMode.Regex;
// Regex for email address
string regex = "[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})";
// No additional search options
TextSearchOptions searchOptions = TextSearchOptions.None;
TextSearchResult searchResult = null;
while ((searchResult = docText.FindNext(regex, searchMode, searchOptions, searchResult)) != null)
{
// The email address that was found
string emailAddress = searchResult.SearchText;
// The start character index of the search text within the entire document text
int startIndex = searchResult.DocumentTextIndex;
// The starting page index within the document where the search text was found
int pageBeginIndex = searchResult.PageBeginIndex;
// The ending page index within the document where the search text was found
int pageEndIndex = searchResult.PageEndIndex;
// The starting line index within the page (PageBeginIndex) where the search text was found
int lineBeginIndex = searchResult.LineBeginIndex;
// The ending line index within the page (PageEndIndex) where the search text was found
int lineEndIndex = searchResult.LineEndIndex;
// The starting word index within the line (LineBeginIndex) where the search text was found
int wordBeginIndex = searchResult.WordBeginIndex;
// The ending word index within the line (LineEndIndex) where the search text was found
int wordEndIndex = searchResult.WordEndIndex;
// The starting chracter index within the word (WordBeginIndex) where the search text was found
int charBeginIndex = searchResult.CharBeginIndex;
// The ending chracter index within the word (WordEndIndex) where the search text was found
int charEndIndex = searchResult.CharEndIndex;
// The index position where the search first started (in this case 10)
int searchBeginIndex = searchResult.SearchBeginIndex;
// Flag that indicates if the occurrence was found after a wraparound of the document boundary
// In this example it will always be false since the Wraparound text search option is not set
bool documentBoundaryCrossed = searchResult.DocumentBoundaryCrossed;
}
// Close the document
doc.CloseDocument();
|