Generating a PDF from a web page

Guy Waldman

Guy Waldman

Basic goal

In a project I'm working on, I had a simple problem that required taking some HTML (generated by React, but that doesn't really matter), and generating a PDF from it.

This, apparently, was not as straightforward as I would have liked, as confirmed by Daniel's tweet here:

So, for the sake of Daniel and other interested parties, I decided to write something about my experience, in the hopes it would benefit someone.

What is a PDF anyway?

The Portable Document Format needs no introduction - it's a ubiquitous format used to display documents independent of the software, hardware, or operating system.
Our end goal would be to take a web page and generate a PDF from it.

An important property of the PDF is actually hinted by its name - portable; a PDF needs to be displayed consistently across platforms (and across PDF readers like Adobe Acrobat or Foxit).
What this means is that a PDF needs to contain all the information needed to display inside the actual file. This includes things like fonts, images, vector graphics etc.
Consider if Bob creates a cool document with a cool font like Inter and wants to send it to Alice. Bob creates a Word document if he's on Windows or a Pages document if he's on a Mac, right? Well, when Alice opens the document, she can't see the font that Bob used! She may even get a warning that the font used in the document is not installed on her system. To accurately represent the content that Bob wants to send, he would need to use a file format which encapsulates all the information needed to display it; that is what the PDF is used for.

To be precise:

PDF combines three technologies:

  • A subset of the PostScript page description programming language, for generating the layout and graphics.
  • A font-embedding/replacement system to allow fonts to travel with the documents.
  • A structured storage system to bundle these elements and any associated content into a single file, with data compression where appropriate.

(from https://en.wikipedia.org/wiki/PDF)

PDF is magic! Or at least, as Arthur Clarke puts it, "Any sufficiently advanced technology is indistinguishable from magic."
It is also a very inconsistent format but let's not get into that here.

Hey, there's a reason why it's the most popular religion:

(click to zoom)

Requirements

Let's quickly establish the basic premise of this blog post - I want to take some arbitrary HTML and generate a high quality PDF that answers to the following requirements:

Note that the PDF you want to generate may not have all these requirements (for example, you may be content with a single "screenshot" of the DOM element as an image inside the PDF), so consider whether these are important for you too. Later we'll go over some common approaches which do not guarantee these requirements, but may be fine for what you are trying to achieve!

  1. Consistency - The DOM element (for the uninitiated, the DOM is essentially the in-memory representation of a web page) may not look the same across browsers, or even operating systems.
    If you want your PDF to look the same no matter how your users access your website, this may be of high significance.
  2. Fidelity - The PDF should retain as much of the quality of the content in the DOM element, so the text or SVGs will retain the vector information, images will preserve their original quality, etc.
    This also means that the CSS applied on the web page will need to look the same in the generated PDF. When I present the different approaches, you'll see why this is not trivial.
  3. Generic - The approach should be able to be applied to any arbitrary HTML, so we would need to need to set anything up ahead of time. We would want no special handling for embedding certain types of content (a-la images, SVGs) or special handling for different types of HTML tags. Everything should Just Work™.
    This is especially important since, in my project, I need to be able to quickly set up a PDF conversion on changing content.
Key takeaways

We want to consistently generate a PDF from a web page that should:

  1. Be high-fidelity and preserve all information required to render the web page (CSS, fonts, images, SVGs)
  2. Be generated on the fly for any arbitrary HTML
  3. Look the same on whichever platform the users are running from

First approaches

So, let's begin by looking around the interwebs for some popular JS-to-PDF converters.
Mhm, seems like jspdf or pdfkit are pretty popular.

Alas, it seems like jspdf's main usage is building a PDF programmatically.

e.g.

1import { jsPDF } from "jspdf";
2
3const doc = new jsPDF();
4doc.text("Hello world!", 10, 10);
5doc.save("a4.pdf");
js

With this library, seems like we need to manually specify the elements we require and build the PDF ahead of time. This does not scale well and does not fulfill our "Ergonomics" requirement since we want to generate a PDF dynamically based on a dynamic web page.

Looking at pdfkit, which is another great library I considered, it seems to have the same drawback.

Mhm, no luck so far. But I was never one to back down easily!

So, coming back to jspdf - it also has a .html method which, at first glance, appeared to be what we want:

1const doc = new jsPDF();
2const source = document.getElementById("convert-me-to-pdf");
3doc.html(source, {
4   callback: (doc) => {
5      // This callback is invoked after the DOM is converted to a PDF.
6      // We can now save the generated PDF.
7      doc.save();
8});
js

Here is a small proof of concept for doing this: https://codesandbox.io/s/convert-to-pdf-jspdf-p9cnk

I used a <div> with an <svg>, an <image>, a heading and some text, with various styles (made easy with TailwindCSS). Try playing around with it and changing the content inside.

I tried a few libraries similar to that, but soon realized that this wasn't the direction I needed to go in. You would notice, if you tinkered around with it, that layout was often not preserved, images/SVGs were often not correctly rendered (if at all), etc.

This did not fulfill the "Consistency" requirement (not to mention that the PDF could still look differently across different browsers and operating systems). Another important note for using jspdf or similar libraries - all the CSS you use needs to be included in the DOM you render. That means one of these two options (or a combination of both):

  • Inlining the CSS with style attributes
  • Including all the CSS stylesheets you use. Note that this means actually including the CSS itself in <style> tags, you can't link to external CSS stylesheets from a CDN!
    In my case, of using TailwindCSS this meant downloading the stylesheet from their CDN and plopping it into <style> tags (alternatively I could build my CSS and include that)

Keeping optimistic, I stumbled across react-to-pdf (I used React for this project). This seemed promising at first glance, however it also seems to be using jspdf under the hood and similarly to other such libraries that I found, note the final bullet in the important notes section of react-to-pdf:

Not vectorized - the pdf is created from a screenshot of the component and therefore is not vectorized. If you are looking for something more advanced for generating pdf using React components, please check out other popular alternatives packages listed below.

What this means is that the generated PDF is a plain image converted to PDF. The last suggestion on checking the alternative packages seems to be PDF renderers. I saw several solutions like these (which, if I recall correctly, often take advantage of HTML <canvas>) but they do not fulfill the "Fidelity" requirement.

But wait... though doing this on the client-side seemed great at first, maybe that is not the correct way to go about it.

Generating the PDF on the client-side would mean:

  1. It would generate the PDF based on the user's platform and browser. So someone using Chrome on macOS may not see the same PDF as someone using Firefox on Windows.
    This does not fulfill the "Consistency" requirement
  2. It would run on the user's machine, so if we were to have a very rich DOM, a low-end device (especially mobile) could struggle with this, or take a long time
Key takeaways

The solutions I found to generate a PDF on the client-side are usually one (or several) of the following:

  1. Meant for programmatically generating a PDF (not what I needed, does not fulfill the "Generic" requirement)
  2. Do not generate a high-fidelity PDF (with the CSS, fonts, vector information), thus not fulfilling the "Fidelity" requirement
  3. Inconsistent or buggy, especially with rich content

Even if I were to find such a library - it would still be insufficient, due to the fact it runs on the user's machine, and thus may look different on different platforms, or may be computationally intensive on low-end devices.

Server-side - the best side?

So, as I established in the end of the previous section, generating the PDF on the user's machine has some drawbacks.

I then thought - in the browser, when we "print to PDF", we always get a very high quality PDF, right? It looks the same as the web page (you can target print specifically with CSS @media queries with great browser support), preserves the font, images and vector graphics. So, simulating something like this on the server-side seemed like it might be a good approach. Upon some further research, I decided that this was not an original idea (which is always a good sign in our line of work, to be honest), and it might just be crazy enough to work.

So, how do we simulate a "print to PDF"? Well, it essentially boils down to the question of simulating a browser, and this I know! Let's use headless browsers. For those who are unfamiliar, what headless browsers are running browser instances, except without the UI. This type of browser automation comes up a lot when testing, but is also useful for other things like scraping or in our case, generating PDFs!

I have experience with puppeteer so that is what I chose to use.
There are other very nice alternatives such as playwright, which would also do very nicely here (Playwright also has the added benefit of supporting other browsers such as Firefox, but we don't actually need that here).

Puppeteer simulates Google Chrome, and we get some nice benefits:

  1. By always using the same browser and running this PDF generator on the same platform (e.g. Ubuntu) means we get consistent looking UI between renders
  2. Puppeteer uses a recent version of Chrome which provides the ability to use the latest and greatest of web technologies
  3. Puppeteer also has a "non-headless" mode, which does the same thing but displays the actual UI when it's running. This is useful for testing it out when things go wrong

Okay, so let's draw up a game plan:

  1. Spin up a headless browser in Puppeteer
  2. Generate an HTML data URI for our web page (with all its dependencies, such as fonts or stylesheets)
  3. Print the page to an A4 PDF

Indeed, this turned out great in practice. I used Vercel serverless functions in my Next.js application.

However, note that if you simply tried this, it wouldn't immediately work - Puppeteer needs to use a Chromium (the engine behind Google Chrome) binary and the server probably does not have it installed. Vercel functions, which I used for this, run on AWS Lambdas under the hood, so fortunately there is a handy NPM package for this called chrome-aws-lambda which ships with the correct Chromium binary for Puppeteer to use.

Some sample code for a Next.js API endpoint could look like this (using Typescript):

1import type { NextApiRequest, NextApiResponse } from "next";
2import puppeteer from "puppeteer";
3import chrome from "chrome-aws-lambda";
4import React from "react";
5import ReactDOMServer from "react-dom/server";
6import ComponentToRenderToPdf from "..."; // Example
7
8async function printPDF(name: string) {
9  const browser = await puppeteer.launch(
10    process.env.NODE_ENV === "production"
11      ? {
12          args: chrome.args,
13          executablePath: await chrome.executablePath,
14          headless: chrome.headless,
15        }
16      : { headless: true }
17  );
18  const page = await browser.newPage();
19
20  // Render the desired React component to a string.
21  const html = ReactDOMServer.renderToString(
22    <html>
23      <head>
24         { /* Optionally include external stylesheets here. */ }
25      </head>
26      <body>
27         <ComponentToRenderToPdf name={name}>
28      </body>
29    </html>
30  );
31
32  // Navigate to an HTML data URI with the HTML we generated before.
33  const response = await page.goto("data:text/html," + html, { waitUntil: ["networkidle0"] });
34  if (!response.ok()) {
35     // Handle error appropriately.
36  }
37
38  await page.emulateMediaType("print");
39  const pdf = await page.pdf({ format: "a4", printBackground: true });
40
41  // Clean up after ourselves like good citizens.
42  await browser.close();
43
44  return pdf;
45}
46
47export default async (req: NextApiRequest, res: NextApiResponse<Buffer>) => {
48  // Optionally use the request and pass parameters to render the component dynamically.
49  const name = req.query.name;
50  const pdfBuffer = await printPDF(name);
51
52  // Return the result with a MIME type of `application/pdf`.
53  res.setHeader("Content-Type", "application/pdf");
54  res.setHeader("Content-Length", pdfBuffer.length);
55  res.status(200).send(pdfBuffer);
56};
ts

And this works like a charm. Navigating to my website with the relative URL /api/renderPdf?name=Guy renders the desired PDF, and the performance is great for my needs.

Another benefit of this approach is that the PDF also preserves the functionality of links 🥳

Note that there is still a drawback to this method - if you want to generate a preview of the PDF you want to render using the same React component, if you're running it on a platform that is different than the server's, the PDF may look differently. However, that is a trade-off that for my particular project I was willing to live with.

Key takeaways

We built a serverless function that, on-demand:

  1. Spins up a Google Chrome headless browser instance (by leveraging Puppeteer)
  2. Opens an HTML data URI that we created from our HTML (or React component in this case)
  3. Prints the web page to a PDF that fulfills all our original requirements

Conclusion

So, my friends, in our journey to becoming a PDF-generation masters, we realized that generating PDFs on the client-side is not straightforward and is also not the ideal solution to fulfill our requirements and so we went with a server-side approach with a headless browser.

This was not a challenge that I needed to solve in the best possible way and I settled on a "good enough" solution for my use-case.
Therefore, if you feel that I missed something or if there are better approaches out there, please do not hesitate to reach out!