Python and PDFs

Real Python has a tutorial on How to Work With a PDF in Python. I subscribe to Real Python because I find their tutorials well-written or, in the case of video tutorials, well-presented. The focus of this tutorial is the PythonPDF module, which can get metadata from a PDF, rotate pages, merge or split a PDF, and/or encrypt it. While the tutorial mentions “extract information” it does not mean PythonPDF can get text from a PDF that does not have a text layer already embedded on its pages — you could argue that the unintuitive nature of PDFs reveals their brokenness but that’s for another time. If you want to get text where there is no text layer, but you still want to use Python, it looks like you have to turn to PDFMiner — though a quick skim of its GH page doesn’t reveal if it has OCR capabilities backed in. Sigh.

HTML to PDF

In an ideal setup, my workflow would have me writing in some version of plain text — a flavor of markdown in all probability — that could be quickly and easily outputted to a variety of formats and media. In most instances, that output gets printed, or at least paginated, which means it probably has to, at least for a moment, be instantiated as a PDF. (If I remember correctly, this is essentially how the macOS display and printing system work.) What that would mean would be a collection of CSS files that transformed the generated HTML into the various kinds of documents I regularly produce: essays, reports, letters, lectures, etc.

This function is what the Marked app does and does well — it’s also functionality built into the Ulysses app if I remember. Neither of those apps, I believe, offer pagination, which is often critical to what I output. And so, I have continued to search for my own solution in hopes of building it into a workflow — for the record, when I am working on long-form plain text, my editor of choice is FoldingText because it does a brilliant job of hiding the markdown unless you are working on that sentence and, as the name implies, it makes it possible to hide all but the section of the document on which you are working. It’s brilliant. (To be clear, I am a fan of all the apps mentioned here and of their developers.)

Getting from plain text via markdown or MultiMarkdown to HTML and then pairing that HTML with a page-media aware CSS file and then outputting to PDF is not as easy as it should be. The one app of which I have been aware up until recently was PrinceXML, which its creators have made free for non-commercial use, but with the imposition of a small watermark. That’s very generous, but it’s not quite what I want and I don’t have the kind of money to afford a desktop license.

And so it was a delightful surprise to discover that there are free software options to explore:

  • wkhtmltopdf is an “open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely headless and do not require a display or display service.”
  • **WeasyPrint is a “visual rendering engine for HTML and CSS that can export to PDF. … It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.”

Next up … trying WeasyPrint and an update/report here.