Skip to main content

Filling out a PDF form with iTextSharp

I needed to make an aspx page that would allow me to pull a PDF from disk, fill out fields in it, then stream the completed PDF to a browser. After searching around, I found that the consensus seems to be that iTextSharp is the way to go in C#. I grabbed version 4.1.2.

While the iText API docs are useful, it's hard to start out with that. I found that this example proved the most informative and useful of those that popped up on Google. However, even with that, I found that only the source code really answered all of my questions.

Extra things I found noteworthy:
  • All of the PdfReader constructors ultimately use the RandomAccessFileOrArray object to access the source PDF data.
  • The PdfReader(RandomAccessFileOrArray...) constructor results in a partial read (ReadPdfPartial()), whereas all the others read & parse the entire PDF (ReadPdf()) during construction. It appears, from this post, that the former uses less RAM, but the other constructors result in faster performance. (I suspect most of it is hard drive related)
  • You only need to call PdfReader.Close() when a partial read was requested (i.e. you used PdfReader(RandomAccessFileOrArray...)). And, in that case, it will call Close on the RandomAccessFileOrArray that was passed in. In all other cases, there's nothing unmanaged that needs to be dealt with since everything was taken care of during construction.
  • PdfStamper doesn't implement IDisposable, so you have to put in an explicit try...finally instead of a using block.
  • You must set PdfStamper.Writer.CloseStream to false if you want to use the output Stream (the one you passed into the constructor) after the call to PdfStamper.Close(), otherwise it gets closed.
  • This feels like a "duh" bit of info, but the output Stream's position will be at the end when PdfStamper is done with it). Stream.Seek or Stream.Position should be able fix that for you.
So here's something that works. This will load the entire source PDF into memory then write a filled-out copy directly to the browser. However, you cannot tell how big the output file is because Response.OutputStream.Length throws an exception (it's a write-only stream). So, since you can't set "Content-Length", the browser can't estimate how long the download's going to take.

string fileName = @"c:\path\to\file.pdf";
try
{
PdfStamper stamper = new PdfStamper(new PdfReader(fileName), Response.OutputStream);
try
{
AcroFields af = stamper.AcroFields;
af.SetField("field-name", "value");
stamper.FormFlattening = true;
}
finally
{ stamper.Close(); }
}
catch (Exception ex)
{ throw new ApplicationException("Unable to fill out PDF: " + fileName, ex); }


So, since I want to be nice to the end-user, it's back to a MemoryStream instead of Response.OutputStream. The problem there is that Response.BinaryWrite() and Response.OutputStream.Write() both take a byte[]. And MemoryStream.ToArray() returns a deep-copy. So that would give me 2 copies of the download and 1 copy of the original in RAM, all at once; lame. So, the only solution (short of learning more about Response.Filter) is to use a temporary byte[] and "chunk it out". Now I'm doing this (basically):

MemoryStream pdfOut = new MemoryStream(256000); // 256KB seems like a good starting point
try
{
PdfStamper stamper = new PdfStamper(new PdfReader(fileName), pdfOut);
try
{
stamper.Writer.CloseStream = false;
AcroFields af = stamper.AcroFields;
af.SetField("field-name", "value");
stamper.FormFlattening = true;
}
finally
{ stamper.Close(); }
}
catch (Exception ex)
{ throw new ApplicationException("Unable to fill out PDF: " + fileName, ex); }

// ...among other headers, tell the browser how much to expect.
Response.AppendHeader("Content-Length", pdfOut.Length.ToString());
pdfOut.Seek(0, SeekOrigin.Begin); // make sure we start at the beginning of the PDF

// "chunk" out the PDF
byte[] buffer = new byte[102400]; // 100KB seems like a good size
int bytesRead = 0;

for (long totalBytesRead = 0; Response.IsClientConnected && totalBytesRead < pdfOut.Length; totalBytesRead += bytesRead)
{
bytesRead = pdfOut.Read(buffer, 0, buffer.Length);
if (bytesRead < buffer.Length)
{
// We must do this because BinaryWrite will always write out the entire array
// (and we didn't fill our buffer on that last Read().
byte[] endBuffer = new byte[bytesRead];
Array.Copy(buffer, 0, endBuffer, 0, bytesRead);
Response.BinaryWrite(endBuffer);
}
else
Response.BinaryWrite(buffer);
}

With the basic equivalent of the above, I now have what I need. While I don't have a solution that uses the least possible RAM, I do have one that assures that any problems with the PDF are encountered before the headers are written and that the Content-Length can be accurately set.

Comments

Popular posts from this blog

Live Migration between domains

For those of you like me who aren't experts at all things Active Directory (AD) and Hyper-V Live Migration (LM) permissions, it can be enough of a pain to LM a Virtual Machine (VM) between domains that you simply decide to take the VMs offline to affect the move. See, I only tolerate AD because it's required for LM'ing VMs; there isn't a choice. (It's also required for Windows Clusters, but that's a different topic.) But I figured it out. My back-story is that we setup a cluster using Windows 2012 r1 as the AD Domain Controller (DC) and Hyper-V Server 2012 r1 for the VM hosts. Then we decided we wanted to use r2 for the AD DC and Hyper-V hosts. Upgrading Hyper-V was easy. But I found that there's some unresolved Microsoft bug with Windows Clustering when upgrading the AD DC from Windows 2012 r1 to Windows 2012 r2--- clustering simply doesn't work correctly anymore . So we gave up and created a from-scratch Windows 2012 r2 AD DC then made a new cluster...

SqlBulkCopy and the "colid" error

I thought there was a page explaining this somewhere out there on the Internet, but I can't find it anymore. So here's what I re-discovered. When you try to insert the rows from a DataTable and the data in one of the columns of one of the rows is too big to fit into the destination column in the database, you get a SqlException with this error message: "Received an invalid column length from the bcp client for colid N." (Where "N" is a number.) It doesn't tell you which row, and it's a pain to figure out what column to look at. To determine what column it is referring to, you first need to get a listing of all columns in the table, listed in the order as they have been defined in the database. Next, you remove any columns in the list that are not represented in SqlBulkCopy.ColumnMappings (the order of the column mappings is irrelevant). The list that remains is what "colid" is referring to, with the first column corresponding to colid ...

Outlook 2007/2010 Search Folders using email address domain

As of May 2010, the Beta of Outlook 2010 still hasn’t overcome this problem. I’m surprised this glaring omission has been left unfixed. Maybe Outlook is maintained by contractors? I have what I consider to be a simple need. I want a “Search Folder” that shows me all the email related to a particular client. What works well for me is a query that finds any email with the client name in the subject, or any email that involves an email address (from/to/cc) from the client's email domain. Back in Thunderbird, it was simple to setup a rule for this. Outlook can't do it. (I'd go back to Thunderbird, but I have to get calendaring working first.) What doesn't work When you edit the criteria for a Search Folder, on the “Messages” tab, the fields you want appear to be represented, but the way things work is wrong. All the criteria specified must be true, not any; they went with “AND” where I need “OR”. The other problem is that the “From…” and “Sent To…” fields use a “starts-with...