Manipulating PDF files with iTextSharp and VB.NET 2012

Introduction

Recently, I had to make a VB.NET program that reads PDF file contents and replace it with customized text. VB.NET unfortunately doesn’t have a built in PDF file reader object, so I had to make use of a third party’s product called iTextSharp. From the moment I started using it, I fell in love with it. With this article I will demonstrate how to use iTextSharp with VB.NET to manipulate PDF files.

PDF files

A detailed explanation of PDF files can be found here.

iTextSharp

A detailed explanation, and download of iTextSharp can be found here. As you can see, iTextSharp is mostly for C# and Java; hence this Visual Basic.NET article.

I would suggest that you go through the documentation properly before proceeding with our project. I cannot do everything for you, you need to have some input as well.

Our Project

Purpose

Our project’s aim is to read from a PDF file, change some of the contents and then add a watermark to the PDF document’s pages. Sound easy enough, yes, with the help of the iTextSharp library you will see how simple it is.

Design

Our project doesn’t have much of a design. All we need is a progress bar and a button. Mine looks like Figure 1 :

Our Design
Figure 1Our Design

Code

Before we can jump in and code, you need to make sure that you have downloaded the iTextSharp libraries. Once that is done, we need to add a reference to it by clicking Project->Add Reference->iTextSharp.dll. Once we have the project reference set up, we need to reference the iTextSharp libraries in our code. Add the following Imports statements:

Imports System.IO 'Working With Files
Imports System.Text 'Working With Text

'iTextSharp Libraries
Imports iTextSharp.text 'Core PDF Text Functionalities
Imports iTextSharp.text.pdf 'PDF Content
Imports iTextSharp.text.pdf.parser 'Content Parser

This imports all the needed capabilities for our little program. Now the fun starts! Add the following Sub Procedure:

    Public Sub ReplacePDFText(ByVal strSearch As String, ByVal scCase As StringComparison, ByVal strSource As String, ByVal strDest As String)

        Dim psStamp As PdfStamper = Nothing 'PDF Stamper Object
        Dim pcbContent As PdfContentByte = Nothing 'Read PDF Content

        If File.Exists(strSource) Then 'Check If File Exists

            Dim pdfFileReader As New PdfReader(strSource) 'Read Our File

            psStamp = New PdfStamper(pdfFileReader, New FileStream(strDest, FileMode.Create)) 'Read Underlying Content of PDF File

            pbProgress.Value = 0 'Set Progressbar Minimum Value
            pbProgress.Maximum = pdfFileReader.NumberOfPages 'Set Progressbar Maximum Value

            For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages

                Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks

                pcbContent = psStamp.GetUnderContent(intCurrPage) 'Look At Current Block

                'Determine Spacing of Block To See If It Matches Our Search String
                lteStrategy.UndercontentCharacterSpacing = pcbContent.CharacterSpacing
                lteStrategy.UndercontentHorizontalScaling = pcbContent.HorizontalScaling

                'Trigger The Block Reading Process
                Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)

                'Determine Match(es)
                Dim lstMatches As List(Of iTextSharp.text.Rectangle) = lteStrategy.GetTextLocations(strSearch, scCase)

                Dim pdLayer As PdfLayer 'Create New Layer
                pdLayer = New PdfLayer("Overrite", psStamp.Writer) 'Enable Overwriting Capabilities

                'Set Fill Colour Of Replacing Layer
                pcbContent.SetColorFill(BaseColor.BLACK)

                For Each rctRect As Rectangle In lstMatches 'Loop Through Each Match

                    pcbContent.Rectangle(rctRect.Left, rctRect.Bottom, rctRect.Width, rctRect.Height) 'Create New Rectangle For Replacing Layer

                    pcbContent.Fill() 'Fill With Colour Specified

                    pcbContent.BeginLayer(pdLayer) 'Create Layer

                    pcbContent.SetColorFill(BaseColor.BLACK) 'Fill aLyer

                    pcbContent.Fill() 'Fill Underlying Content

                    Dim pgState As PdfGState 'Create GState Object
                    pgState = New PdfGState()

                    pcbContent.SetGState(pgState) 'Set Current State

                    pcbContent.SetColorFill(BaseColor.WHITE) 'Fill Letters

                    pcbContent.BeginText() 'Start Text Replace Procedure

                    pcbContent.SetTextMatrix(rctRect.Left, rctRect.Bottom) 'Get Text Location

                    'Set New Font And Size
                    pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)

                    pcbContent.ShowText("AMAZING!!!!") 'Replacing Text

                    pcbContent.EndText() 'Stop Text Replace Procedure

                    pcbContent.EndLayer() 'Stop Layer replace Procedure

                Next

                pbProgress.Value = pbProgress.Value + 1 'Increase Progressbar Value

                pdfFileReader.Close() 'Close File

            Next

            psStamp.Close() 'Close Stamp Object

        End If

        'Add Watermark
        AddPDFWatermark("C:\test_words_replaced.pdf", "C:\test_Watermarked_and_Replaced.pdf", Application.StartupPath & "\Anuba.jpg")

    End Sub

Oye! What a mouthful!

Before you freak out; this code is actually not so bad. Let’s have a look at it step by step:

  1. We create a Stamper object and a Content object. The Stamper object is to enable us to write our content onto the PDF file. The content object helps us to identify the appropriate content on the file that we need to replace.
  2. We determine if the PDF file exists, and read its underlying content. We also set up our ProgressBar to compensate for the amount of pages in the PDF document.
  3. We commence our For Loop (to loop through each page) and create a LocationTextExtractionStrategy object. This object enables us to extract our desired text. This class also forms part of the iTextSharp download. We need to add this file to our project – but we’ll do that a bit later.
  4. Once we know what text we need, and what diameters the text use, we could continue to loop through all the pages until a match is found. We store each match and create a new layer for each match to be replaced.
  5. We then replace the found text with our new layer that is filled in order to highlight our change. The trick here is to replace the layer’s exact dimensions. A PDF file does not work similar to a Word document where we could just find and replace text. Why? Because each little word or phrase is actually a block, or a layer; so, to replace that particular block, we need the exact dimensions. If we do not have the exact dimensions, the layered text will not appear at the exact same place.
  6. Lastly, we include a call to the AddPDFWatermark sub (which we will create now) to add a watermark on each page. The file that is written will be stored onto the C:\.

Make sense now?

Add the next Sub procedure:

    Public Shared Sub AddPDFWatermark(ByVal strSource As String, ByVal strDest As String, ByVal imgSource As String)

        Dim pdfFileReader As PdfReader = Nothing 'Read File
        Dim psStamp As PdfStamper = Nothing 'PDF Stamper Object
        Dim imgWaterMark As Image = Nothing 'Watermark Image

        Dim pcbContent As PdfContentByte = Nothing 'Read PDF Content
        Dim rctRect As Rectangle = Nothing 'Create New Rectangle To Host Image

        Dim sngX, sngY As Single 'Page Dimensions

        Dim intPageCount As Integer = 0 'Page Count

        Try
            pdfFileReader = New PdfReader(strSource) 'Read File

            rctRect = pdfFileReader.GetPageSizeWithRotation(1) 'Store Page Size

            psStamp = New PdfStamper(pdfFileReader, New System.IO.FileStream(strDest, IO.FileMode.Create)) 'Create new Stamper Object

            imgWaterMark = Image.GetInstance(imgSource) 'Get Image To Be Used For The Watermark

            If imgWaterMark.Width > rctRect.Width OrElse imgWaterMark.Height > rctRect.Height Then 'Make Sure Image Can Fit On Page

                imgWaterMark.ScaleToFit(rctRect.Width, rctRect.Height)
                sngX = (rctRect.Width - imgWaterMark.ScaledWidth) / 2
                sngY = (rctRect.Height - imgWaterMark.ScaledHeight) / 2

            Else 'Put In Center Of Page

                sngX = (rctRect.Width - imgWaterMark.Width) / 2
                sngY = (rctRect.Height - imgWaterMark.Height) / 2

            End If

            imgWaterMark.SetAbsolutePosition(sngX, sngY)

            intPageCount = pdfFileReader.NumberOfPages() 'Apply To All Pages

            For i As Integer = 1 To intPageCount
                pcbContent = psStamp.GetUnderContent(i)
                pcbContent.AddImage(imgWaterMark)
            Next

            psStamp.Close()
            pdfFileReader.Close()

        Catch ex As Exception

            Throw ex 'Something Went Wrong

        End Try

    End Sub

This sub adds a watermark to each PDF page. You will notice that here, we almost do the same as we did in the previous sub. The only difference here is that we added an image to the undercontent of each page, instead of replacing textlayers.

The last piece of code we need to add for this form is the call to the ReplacePDFText sub from our start button:

    Private Sub Start_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Start.Click

        ReplacePDFText("just a simple test", _
                       StringComparison.CurrentCultureIgnoreCase, _
                       Application.StartupPath & "\test.pdf", _
                       "C:\test_words_replaced.pdf") 'Do Everything

    End Sub

This calls the sub to replace PDF content, and writes the new PDF file to a location on C:\. Now, we will have two files. Obviously, this is just and example and it would be easy to combine all of the changes into one file.

LocationTextExtractionStrategy

A full explanation can be found here.

This file forms part of the iTextSharp download I mentioned earlier. We need to add this file as is, to our project. Remember, we didn’t create this file or logic, neither have I. But without this file we will not be able to identify the content strings we are looking for. This demonstrates the real power of iTextSharp, and this is why iTextSharp is my preferred choice when it comes to doing any PDF manipulation.

Add a new class and add the following to it (in case you didn’t download the iTextSharp files at the location I’ve mentioned):

Imports System
Imports System.Collections.Generic
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser

''
'' * $Id$
'' *
'' * This file is part of the iText project.
'' * Copyright (c) 1998-2009 1T3XT BVBA
'' * Authors: Kevin Day, Bruno Lowagie, Paulo Soares, et al.
'' *
'' * This program is free software; you can redistribute it and/or modify
'' * it under the terms of the GNU Affero General Public License version 3
'' * as published by the Free Software Foundation with the addition of the
'' * following permission added to Section 15 as permitted in Section 7(a):
'' * FOR ANY PART OF THE COVERED WORK IN WHICH THE COPYRIGHT IS OWNED BY 1T3XT,
'' * 1T3XT DISCLAIMS THE WARRANTY OF NON INFRINGEMENT OF THIRD PARTY RIGHTS.
'' *
'' * This program is distributed in the hope that it will be useful, but
'' * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
'' * or FITNESS FOR A PARTICULAR PURPOSE.
'' * See the GNU Affero General Public License for more details.
'' * You should have received a copy of the GNU Affero General Public License
'' * along with this program; if not, see http://www.gnu.org/licenses or write to
'' * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
'' * Boston, MA, 02110-1301 USA, or download the license from the following URL:
'' * http://itextpdf.com/terms-of-use/
'' *
'' * The interactive user interfaces in modified source and object code versions
'' * of this program must display Appropriate Legal Notices, as required under
'' * Section 5 of the GNU Affero General Public License.
'' *
'' * In accordance with Section 7(b) of the GNU Affero General Public License,
'' * you must retain the producer line in every PDF that is created or manipulated
'' * using iText.
'' *
'' * You can be released from the requirements of the license by purchasing
'' * a commercial license. Buying such a license is mandatory as soon as you
'' * develop commercial activities involving the iText software without
'' * disclosing the source code of your own applications.
'' * These activities include: offering paid services to customers as an ASP,
'' * serving PDFs on the fly in a web application, shipping iText with a closed
'' * source product.
'' *
'' * For more information, please contact iText Software Corp. at this
'' * address: sales@itextpdf.com
''

''*
''     * Development preview - this class (and all of the parser classes) are still experiencing
''     * heavy development, and are subject to change both behavior and interface.
''     * 
'' * A text extraction renderer that keeps track of relative position of text on page '' * The resultant text will be relatively consistent with the physical layout that most '' * PDF files have on screen. '' *
'' * This renderer keeps track of the orientation and distance (both perpendicular '' * and parallel) to the unit vector of the orientation. Text is ordered by '' * orientation, then perpendicular, then parallel distance. Text with the same '' * perpendicular distance, but different parallel distance is treated as being on '' * the same line. '' *
'' * This renderer also uses a simple strategy based on the font metrics to determine if '' * a blank space should be inserted into the output. '' * '' * @since 5.0.2 '' Namespace LocTextExtraction Public Class LocTextExtractionStrategy Implements ITextExtractionStrategy '* set to true for debugging Private _UndercontentCharacterSpacing = 0 Private _UndercontentHorizontalScaling = 0 Private ThisPdfDocFonts As SortedList(Of String, DocumentFont) Public Shared DUMP_STATE As Boolean = False '* a summary of all found text Private locationalResult As New List(Of TextChunk)() '* ' * Creates a new text extraction renderer. ' Public Sub New() ThisPdfDocFonts = New SortedList(Of String, DocumentFont) End Sub '* ' * @see com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock() ' Public Overridable Sub BeginTextBlock() Implements ITextExtractionStrategy.BeginTextBlock End Sub '* ' * @see com.itextpdf.text.pdf.parser.RenderListener#endTextBlock() ' Public Overridable Sub EndTextBlock() Implements ITextExtractionStrategy.EndTextBlock End Sub '* ' * @param str ' * @return true if the string starts with a space character, false if the string is empty or starts with a non-space character ' Private Function StartsWithSpace(ByVal str As [String]) As Boolean If str.Length = 0 Then Return False End If Return str(0) = " "c End Function '* ' * @param str ' * @return true if the string ends with a space character, false if the string is empty or ends with a non-space character ' Private Function EndsWithSpace(ByVal str As [String]) As Boolean If str.Length = 0 Then Return False End If Return str(str.Length - 1) = " "c End Function Public Property UndercontentCharacterSpacing Get Return _UndercontentCharacterSpacing End Get Set(ByVal value) _UndercontentCharacterSpacing = value End Set End Property Public Property UndercontentHorizontalScaling Get Return _UndercontentHorizontalScaling End Get Set(ByVal value) _UndercontentHorizontalScaling = value End Set End Property Public Overridable Function GetResultantText() As [String] Implements ITextExtractionStrategy.GetResultantText If DUMP_STATE Then DumpState() End If locationalResult.Sort() Dim sb As New StringBuilder() Dim lastChunk As TextChunk = Nothing For Each chunk As TextChunk In locationalResult If lastChunk Is Nothing Then sb.Append(chunk.text) Else If chunk.SameLine(lastChunk) Then Dim dist As Single = chunk.DistanceFromEndOf(lastChunk) If dist < -chunk.charSpaceWidth Then sb.Append(" "c) ' we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space ElseIf dist > chunk.charSpaceWidth / 2.0F AndAlso Not StartsWithSpace(chunk.text) AndAlso Not EndsWithSpace(lastChunk.text) Then sb.Append(" "c) End If sb.Append(chunk.text) Else sb.Append(ControlChars.Lf) sb.Append(chunk.text) End If End If lastChunk = chunk Next Return sb.ToString() End Function Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle) Dim FoundMatches As New List(Of iTextSharp.text.Rectangle) Dim sb As New StringBuilder() Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk) Dim bStart As Boolean, bEnd As Boolean Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing Dim sTextInUsedChunks As String = vbNullString For Each chunk As TextChunk In locationalResult If ThisLineChunks.Count > 0 AndAlso Not chunk.SameLine(ThisLineChunks.Last) Then If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then Dim sLine As String = sb.ToString 'Check how many times the Search String is present in this line: Dim iCount As Integer = 0 Dim lPos As Integer lPos = sLine.IndexOf(pSearchString, 0, pStrComp) Do While lPos > -1 iCount += 1 If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length lPos = sLine.IndexOf(pSearchString, lPos, pStrComp) Loop 'Process each match found in this Text line: Dim curPos As Integer = 0 For i As Integer = 1 To iCount Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp) curPos = iFromChar iToChar = iFromChar + pSearchString.Length - 1 sCurrentText = vbNullString sTextInUsedChunks = vbNullString FirstChunk = Nothing LastChunk = Nothing 'Get first and last Chunks corresponding to this match found, from all Chunks in this line For Each chk As TextChunk In ThisLineChunks sCurrentText = sCurrentText & chk.text 'Check if we entered the part where we had found a matching String then get this Chunk (First Chunk) If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then FirstChunk = chk bStart = True End If 'Keep getting Text from Chunks while we are in the part where the matching String had been found If bStart And Not bEnd Then sTextInUsedChunks = sTextInUsedChunks & chk.text End If 'If we get out the matching String part then get this Chunk (last Chunk) If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then LastChunk = chk bEnd = True End If 'If we already have first and last Chunks enclosing the Text where our String pSearchString has been found 'then it's time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations If bStart And bEnd Then FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp)) curPos = curPos + pSearchString.Length bStart = False : bEnd = False Exit For End If Next Next End If sb.Clear() ThisLineChunks.Clear() End If ThisLineChunks.Add(chunk) sb.Append(chunk.text) Next Return FoundMatches End Function Private Function GetRectangleFromText(ByVal FirstChunk As TextChunk, ByVal LastChunk As TextChunk, ByVal pSearchString As String, _ ByVal sTextinChunks As String, ByVal iFromChar As Integer, ByVal iToChar As Integer, ByVal pStrComp As System.StringComparison) As iTextSharp.text.Rectangle 'There are cases where Chunk contains extra text at begining and end, we don't want this text locations, we need to extract the pSearchString location inside 'for these cases we need to crop this String (left and Right), and measure this excedent at left and right, at this point we don't have any direct way to make a 'Transformation from text space points to User Space units, the matrix for making this transformation is not accesible from here, so for these special cases when 'the String needs to be cropped (Left/Right) We'll interpolate between the width from Text in Chunk (we have this value in User Space units), then i'll measure Text corresponding 'to the same String but in Text Space units, finally from the relation betweeenthese 2 values I get the TransformationValue I need to use for all cases 'Text Width in User Space Units Dim LineRealWidth As Single = LastChunk.PosRight - FirstChunk.PosLeft 'Text Width in Text Units Dim LineTextWidth As Single = GetStringWidth(sTextinChunks, LastChunk.curFontSize, _ LastChunk.charSpaceWidth, _ ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex)) 'TransformationValue value for Interpolation Dim TransformationValue As Single = LineRealWidth / LineTextWidth 'In the worst case, we'll need to crop left and right: Dim iStart As Integer = sTextinChunks.IndexOf(pSearchString, pStrComp) Dim iEnd As Integer = iStart + pSearchString.Length - 1 Dim sLeft As String If iStart = 0 Then sLeft = vbNullString Else sLeft = sTextinChunks.Substring(0, iStart) Dim sRight As String If iEnd = sTextinChunks.Length - 1 Then sRight = vbNullString Else sRight = sTextinChunks.Substring(iEnd + 1, sTextinChunks.Length - iEnd - 1) 'Measure cropped Text at left: Dim LeftWidth As Single = 0 If iStart > 0 Then LeftWidth = GetStringWidth(sLeft, LastChunk.curFontSize, _ LastChunk.charSpaceWidth, _ ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex)) LeftWidth = LeftWidth * TransformationValue End If 'Measure cropped Text at right: Dim RightWidth As Single = 0 If iEnd < sTextinChunks.Length - 1 Then RightWidth = GetStringWidth(sRight, LastChunk.curFontSize, _ LastChunk.charSpaceWidth, _ ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex)) RightWidth = RightWidth * TransformationValue End If 'LeftWidth is the text width at left we need to exclude, FirstChunk.distParallelStart is the distance to left margin, both together will give us this LeftOffset Dim LeftOffset As Single = FirstChunk.distParallelStart + LeftWidth 'RightWidth is the text width at right we need to exclude, FirstChunk.distParallelEnd is the distance to right margin, we substract RightWidth from distParallelEnd to get RightOffset Dim RightOffset As Single = LastChunk.distParallelEnd - RightWidth 'Return this Rectangle Return New iTextSharp.text.Rectangle(LeftOffset, FirstChunk.PosBottom, RightOffset, FirstChunk.PosTop) End Function Private Function GetStringWidth(ByVal str As String, ByVal curFontSize As Single, ByVal pSingleSpaceWidth As Single, ByVal pFont As DocumentFont) As Single Dim chars() As Char = str.ToCharArray() Dim totalWidth As Single = 0 Dim w As Single = 0 For Each c As Char In chars w = pFont.GetWidth(c) / 1000 totalWidth += (w * curFontSize + Me.UndercontentCharacterSpacing) * Me.UndercontentHorizontalScaling / 100 Next Return totalWidth End Function Private Sub DumpState() For Each location As TextChunk In locationalResult location.PrintDiagnostics() Console.WriteLine() Next End Sub Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) Implements ITextExtractionStrategy.RenderText Dim segment As LineSegment = renderInfo.GetBaseline() Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth()) With location 'Chunk Location: Debug.Print(renderInfo.GetText) .PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1) .PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1) .PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2) .PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2) 'Chunk Font Size: (Height) .curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2) 'Use Font name and Size as Key in the SortedList Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString 'Add this font to ThisPdfDocFonts SortedList if it's not already present If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont) 'Store the SortedList index in this Chunk, so we can get it later .FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey) End With locationalResult.Add(location) End Sub '* ' * Represents a chunk of text, it's orientation, and location relative to the orientation vector ' Public Class TextChunk Implements IComparable(Of TextChunk) '* the text of the chunk Friend text As [String] '* the starting location of the chunk Friend startLocation As Vector '* the ending location of the chunk Friend endLocation As Vector '* unit vector in the orientation of the chunk Friend orientationVector As Vector '* the orientation as a scalar for quick sorting Friend orientationMagnitude As Integer '* perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system) ' * we round to the nearest integer to handle the fuzziness of comparing floats Friend distPerpendicular As Integer '* distance of the start of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) Friend distParallelStart As Single '* distance of the end of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) Friend distParallelEnd As Single '* the width of a single space character in the font of the chunk Friend charSpaceWidth As Single Private _PosLeft As Single Private _PosRight As Single Private _PosTop As Single Private _PosBottom As Single Private _curFontSize As Single Private _FontIndex As Integer Public Property FontIndex As Integer Get Return _FontIndex End Get Set(ByVal value As Integer) _FontIndex = value End Set End Property Public Property PosLeft As Single Get Return _PosLeft End Get Set(ByVal value As Single) _PosLeft = value End Set End Property Public Property PosRight As Single Get Return _PosRight End Get Set(ByVal value As Single) _PosRight = value End Set End Property Public Property PosTop As Single Get Return _PosTop End Get Set(ByVal value As Single) _PosTop = value End Set End Property Public Property PosBottom As Single Get Return _PosBottom End Get Set(ByVal value As Single) _PosBottom = value End Set End Property Public Property curFontSize As Single Get Return _curFontSize End Get Set(ByVal value As Single) _curFontSize = value End Set End Property Public Sub New(ByVal str As [String], ByVal startLocation As Vector, ByVal endLocation As Vector, ByVal charSpaceWidth As Single) Me.text = str Me.startLocation = startLocation Me.endLocation = endLocation Me.charSpaceWidth = charSpaceWidth Dim oVector As Vector = endLocation.Subtract(startLocation) If oVector.Length = 0 Then oVector = New Vector(1, 0, 0) End If orientationVector = oVector.Normalize() orientationMagnitude = CInt(Math.Truncate(Math.Atan2(orientationVector(Vector.I2), orientationVector(Vector.I1)) * 1000)) Dim origin As New Vector(0, 0, 1) distPerpendicular = CInt((startLocation.Subtract(origin)).Cross(orientationVector)(Vector.I3)) distParallelStart = orientationVector.Dot(startLocation) distParallelEnd = orientationVector.Dot(endLocation) End Sub Public Sub PrintDiagnostics() Console.WriteLine("Text (@" & Convert.ToString(startLocation) & " -> " & Convert.ToString(endLocation) & "): " & text) Console.WriteLine("orientationMagnitude: " & orientationMagnitude) Console.WriteLine("distPerpendicular: " & distPerpendicular) Console.WriteLine("distParallel: " & distParallelStart) End Sub '* ' * @param as the location to compare to ' * @return true is this location is on the the same line as the other ' Public Function SameLine(ByVal a As TextChunk) As Boolean If orientationMagnitude <> a.orientationMagnitude Then Return False End If If distPerpendicular <> a.distPerpendicular Then Return False End If Return True End Function '* ' * Computes the distance between the end of 'other' and the beginning of this chunk ' * in the direction of this chunk's orientation vector. Note that it's a bad idea ' * to call this for chunks that aren't on the same line and orientation, but we don't ' * explicitly check for that condition for performance reasons. ' * @param other ' * @return the number of spaces between the end of 'other' and the beginning of this chunk ' Public Function DistanceFromEndOf(ByVal other As TextChunk) As Single Dim distance As Single = distParallelStart - other.distParallelEnd Return distance End Function '* ' * Compares based on orientation, perpendicular distance, then parallel distance ' * @see java.lang.Comparable#compareTo(java.lang.Object) ' Public Function CompareTo(ByVal rhs As TextChunk) As Integer Implements System.IComparable(Of TextChunk).CompareTo If Me Is rhs Then Return 0 End If ' not really needed, but just in case Dim rslt As Integer rslt = CompareInts(orientationMagnitude, rhs.orientationMagnitude) If rslt <> 0 Then Return rslt End If rslt = CompareInts(distPerpendicular, rhs.distPerpendicular) If rslt <> 0 Then Return rslt End If ' note: it's never safe to check floating point numbers for equality, and if two chunks ' are truly right on top of each other, which one comes first or second just doesn't matter ' so we arbitrarily choose this way. rslt = If(distParallelStart < rhs.distParallelStart, -1, 1) Return rslt End Function '* ' * ' * @param int1 ' * @param int2 ' * @return comparison of the two integers ' Private Shared Function CompareInts(ByVal int1 As Integer, ByVal int2 As Integer) As Integer Return If(int1 = int2, 0, If(int1 < int2, -1, 1)) End Function End Class '* ' * no-op method - this renderer isn't interested in image events ' * @see com.itextpdf.text.pdf.parser.RenderListener#renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo) ' * @since 5.0.1 ' Public Sub RenderImage(ByVal renderInfo As ImageRenderInfo) Implements IRenderListener.RenderImage ' do nothing End Sub End Class End Namespace

All we need to do now is to import this namespace into our form. Add the following Imports statement to your form’s code:

Imports PDF_Play.LocTextExtraction 'Import LocationTextExtractionStrategy Capabilities

If we run our project now, it will work as intended.

I am including my project below for you to download. Sadly, the iTextSharp.dll is quite big, and unfortunately too big to include here; so you need to download it through the steps I have outlined for you.

Conclusion

Thank you for reading my article. Obviously, I am only human (don’t be so surprised!), and I can only do so much; but I couldn’t have written this article if it wasn’t for some help I received from a gentleman called jcis. Thank you – sometimes I bite off more than I can chew…

I hope you have enjoyed this article, and actually learned a thing or two from it. Now I’m off to see what new projects I can do and why VB.NET always seem to be second choice and C# first choice for real hardcore complicated projects…

Hannes DuPreez
Hannes DuPreez
Ockert J. du Preez is a passionate coder and always willing to learn. He has written hundreds of developer articles over the years detailing his programming quests and adventures. He has written the following books: Visual Studio 2019 In-Depth (BpB Publications) JavaScript for Gurus (BpB Publications) He was the Technical Editor for Professional C++, 5th Edition (Wiley) He was a Microsoft Most Valuable Professional for .NET (2008–2017).

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read