Getting Started with PDF to Text Converter for .NET |
The PDF to Text Converter can be used in any type of .NET application to extract the text from a PDF document. There are 2 ways to install ExpertPdf PdfToText Library:
Using a NuGet package (.NET Framework and .NET Core)
Downloading assemblies
The easiest way to install ExpertPdf PdfToText is by using a Nuget Package. Create your project, go to Nuget Package Manager and install one of the following packages:
.NET Framework and .NET Core version - AnyCPU - ExpertPdf.PdfToText: https://www.nuget.org/packages/ExpertPdf.PdfToText/.
.NET Framework and .NET Core version - x64 optimized version - ExpertPdf.PdfToText.x64: https://www.nuget.org/packages/ExpertPdf.PdfToText.x64/.
Note 1: If the x64 version is used, the application needs to target x64 platform and it needs to run in an x64 worker process.
Note 2: The .NET Core version is for .NET Core 2.0 or above through .NET Standard 2.0. Currently it requires a Windows system to run. It does not work on Linux or Mac or Xamarin.
The PDF to Text Converter can be used in any type of .NET application to extract the text from a PDF document. The product archive contains the development libraries for .NET 2.0 or .NET 4.0 and a ready to use Windows Forms application. The full C# and VB.NET source code for the sample is available in the Samples folder.
Here are the steps needed to get started working with the library referencing the assembly directly:
Add eppdftotext.dll and eptools.dep to the bin folder of your application.
Add a reference in the project to eppdftotext.dll.
Write your code (see the samples and the API reference for help).
The PDF to Text Converter development library 'pdftotext.dll' is a strong named .NET 2.0 or .NET 4.0 assembly, for 32 or 64 bit operating systems, that can be linked into any .NET application, either Windows Forms and Console applications or ASP.NET 2.0 or 4.0 web sites.
The main class in the assembly is PdfToTextConverter. This class exposes an overloaded method ConvertToText() that you can use to extract text from a PDF stream or from a PDF file. The result of calling any of these methods is a .NET String object that you can save for example into a file on disk as we do in our command line sample.
The PdfToTextConverter class also defines a few properties controlling the text extraction process. You can specify for example the range of PDF pages to be extracted by using the StartPageNumber and EndPageNumber properties, a user password to be used to open a password protected PDF document using the UserPassword property or you can instruct the converter to include the extracted text into a HTML document having as meta tags the data from the PDF document description like title, subject, keywords, author name, etc.
With the Layout property, the converter can be instructed to output the text in the original PDF layout or in the reading order. In reading order, if a PDF document page has for example two columns of text, instead of producing two columns of text in the resulted text document as it would do when the original order (default option) is selected, the converter will produce two blocks of text one after the other in a flow layout.
Another useful option is to mark the page breaks into the resulted text using a special character using the MarkPageBreaks property. The character that will be used to mark the page breaks is specified by the PAGE_BREAK_MARK static property of the PdfToTextConverter class.
Below there is an example of code taken from the command line sample application:
// create pdf to text converter PdfToTextConverter pdfToTextConverter = new PdfToTextConverter(); // set converter options pdfToTextConverter.Layout = layout; pdfToTextConverter.MarkPageBreaks = markPageBreaks; pdfToTextConverter.AddHtmlMetaTags = addHtmlMetaTags; pdfToTextConverter.UserPassword = userPassword; // get output file path string outFileName = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(srcPdfFile), System.IO.Path.GetFileNameWithoutExtension(srcPdfFile)); if (addHtmlMetaTags) { outFileName += ".html"; } else { outFileName += ".txt"; } // extract text from PDF string extractedText = pdfToTextConverter.ConvertToText(srcPdfFile); // write the resulted string into an output file in the working directory using UTF-8 encoding System.IO.File.WriteAllText(outFileName, extractedText, System.Text.Encoding.UTF8);
In this sample an instance of the PdfToTextConverter class is constructed and the converter properties are set based on the command line arguments. Then the converter method is called to extract the text from the source PDF document and the resulted text is saved in a file on disk in a text or HTML format using the UTF-8 encoding.
The PdfToText Converter can be used to search for text in a PDF document. To achieve this purpose, the ExtractTextPositions() methods of the PdfToTextConverter class need to be used.
The ExtractTextPositions() methods provide several parameters that allow the customization of the search (search a specific text, case sensitive or not, full words only or not).
The ExtractTextPositions() methods return an array of text positions (page, X, Y, width and height for the found text location).
Below there is an example of code that extracts text positions from a PDF document and then uses those positions and ExpertPdf Html To Pdf Conveter tool to highlight the found text in the PDF document:
TextPosition[] positions = pdfToTextConverter.ExtractTextPositions(pdfFileName, txtTextToSearch.Text, false, false); ExpertPdf.HtmlToPdf.PdfDocument.Document doc = new ExpertPdf.HtmlToPdf.PdfDocument.Document(pdfFileName); for (int i = 0; i < positions.Length; i++) { ExpertPdf.HtmlToPdf.PdfDocument.RectangleElement rect = new ExpertPdf.HtmlToPdf.PdfDocument.RectangleElement( positions[i].X, positions[i].Y, positions[i].Width, positions[i].Height); rect.BackColor = Color.Yellow; rect.Transparency = 25; doc.Pages[positions[i].PageNumber - 1].AddElement(rect); } outputFileName = outputFileName + ".pdf"; doc.Save(outputFileName);
The PDF To Text Converter Command Line Tool is a simple application constructed based on the development libraries. The command line syntax is:
pdftotextcmd.exe /pdf:source_pdf_file [/layout:reading|original] [/pagebreaks] [/html] [/userpswd:user_password]
Using the command line syntax you can specify the PDF file to be converted and the text extraction options described above. The full path of the source files must be specified in the arguments list. If the PDF file path contains spaces you should quote the file name (e.g. /pdf:"C:\My Documents\MyPdfFile.pdf").
Command line example:
pdftotextcmd.exe /pdf:source_file.pdf
The LicenseKey property of the PdfToTextConverter class should be set with the license key string you have received after the product purchase.
The trial version of PdfToText tool allows you to extract the text only from the first 5 pages of the PDF document. If you need a full featured evaluation for a limited period of time you can use the demo license key obtained from the product download page or you can request a new one by email from the product support team.