Home > Code > C# > How to Extract Text from PDF Documents Based on Columns inside .NET Apps

How to Extract Text from PDF Documents Based on Columns inside .NET Apps

by sher azam   on Jan 27, 2016   Category: C#   |  Views: 618    |  Points: 25   |  Gold 


This technical tip explains how to extract text from PDF documents based on columns inside .NET Applications. A PDF file may comprise of Text, Images, Annotations, Attachments, Graphs etc elements and Aspose.Pdf for .NET offers the feature to Add as well as manipulate all of these elements. This API is remarkable when comes to Text addition and extraction from PDF document and we may come across a scenario where a PDF document is comprised of more than one columns (multi-column) PDF document and we need to extract the page contents while honoring the same layout, then Aspose.Pdf for .NET is the right choice to accomplish this requirement. One approach is to reduce font size of contents inside PDF document and then perform text extraction. The following code snippet can be used to fulfill this requirement. There is also another approach provided with ScaleFactor. We have introduced several improvements in TextAbsorber and in internal text formatting mechanism. So now during the text extraction using ‘Pure’ mode, you may specify ScaleFactor option and it can be another approach to extract text from multi-column PDF document besides above stated approach. This scale factor may be set to adjust grid which is used for the internal text formatting mechanism during text extraction. Specifying the ScaleFactor values between 1 and 0.1 (including 0.1) has the same effect as font reducing.

//The following code snippet shows the steps to reduce text size and then try extracting text from PDF document.

//[C# Code Sample]


string path = "D:\\Temp\\";
InitLicense();
Document pdfDocument = new Document(path + "net_New-age NED's.pdf");

TextFragmentAbsorber tfa = new TextFragmentAbsorber();
pdfDocument.Pages.Accept(tfa);
TextFragmentCollection tfc = tfa.TextFragments;
foreach (TextFragment tf in tfc)
{
//need to reduce font size at least for 70%
tf.TextState.FontSize = tf.TextState.FontSize * 0.7f;
}
Stream st = new MemoryStream();
pdfDocument.Save(st);
pdfDocument = new Document(st);

TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;
textAbsorber.Visit(pdfDocument);

System.IO.File.WriteAllText(path + "Extracted.txt", extractedText);

// [VB.NET Code Sample]


Dim path As String = "D:\\Temp\\"
' instantiate Document object
Dim pdfDocument As Document = New Document(path + "net_New-age NED's.pdf")

Dim tfa As Aspose.Pdf.Text.TextFragmentAbsorber = New Aspose.Pdf.Text.TextFragmentAbsorber()
pdfDocument.Pages.Accept(tfa)
Dim tfc As Aspose.Pdf.Text.TextFragmentCollection = tfa.TextFragments
For Each tf As Aspose.Pdf.Text.TextFragment In tfc

' need to reduce font size at least for 70%
tf.TextState.FontSize = tf.TextState.FontSize * 0.7F
Next
' create temporary stream object
Dim st As Stream = New MemoryStream()
' save PDF file with reduced font size
pdfDocument.Save(st)
' Instantiate Document object with stream instance
pdfDocument = New Document(st)

Dim textAbsorber As Aspose.Pdf.Text.TextAbsorber = New Aspose.Pdf.Text.TextAbsorber()
pdfDocument.Pages.Accept(textAbsorber)
Dim extractedText As String = textAbsorber.Text
textAbsorber.Visit(pdfDocument)

System.IO.File.WriteAllText(path + "Extracted.txt", extractedText)

//Second approach - Using ScaleFactor

//[C# Code Sample]


Document pdfDocument = new Document(inputFile);

TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
//Setting scale factor to 0.5 is enough to split columns in the majority of documents
//Setting of zero allows to algorithm choose scale factor automatically
textAbsorber.ExtractionOptions.ScaleFactor = 0.5; /* 0; */
pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;

System.IO.File.WriteAllText(outFile, extractedText);

// [VB.NET Code Sample]


Dim pdfDocument As Document = New Document(inputFile)

Dim textAbsorber As Aspose.Pdf.Text.TextAbsorber = New Aspose.Pdf.Text.TextAbsorber()
textAbsorber.ExtractionOptions = New TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure)
'Setting scale factor to 0.5 is enough to split columns in the majority of documents
'Setting of zero allows to algorithm choose scale factor automatically
textAbsorber.ExtractionOptions.ScaleFactor = 0.5 ' 0;
pdfDocument.Pages.Accept(textAbsorber)
Dim extractedText As String = textAbsorber.Text

System.IO.File.WriteAllText(outFile, extractedText)

More about Aspose.Pdf for .NET

- Homepage of Aspose.Pdf for .NET: http://www.aspose.com/.net/pdf-component.aspx

- Read More about Working with Document Conversion: http://www.aspose.com/docs/display/pdfnet/Working+with+Document+Conversion



Post Code  |  Code Snippet Home

User Responses


No response found, be the first to review this code snippet.

Submit feedback about this code snippet

Please sign in to post feedback

Latest Posts