A More Detailed Look at PyMuPDF's Performance

We have stated several times that MuPDF and therefor also our Python binding PyMuPDF ranges at the top when it comes to performance.

I have been wondering how these bold statements could be proved, or at least underpinned with some quantitative data. A full comparisons of all the many PDF tools on the market is merely impossible - differences in functionality, scope, intended use, platform (in)dependence, pricing and openness and so forth are just too large.

So I decided to start with a minimal approach. I just want to illustrate how fast MuPDF can read in and interpret PDF files and thus make them available for the actual processing desired.

Most of what I am writing here is also contained in the PyMuPDF documentation. This PyMuPDF docu already contains a chapter on performance and resource requirements for the text extraction methods.

I felt that simply opening a PDF and immediately saving it again as a new PDF, should cover the complete code responsible for interpreting a PDF's data and re-arranging them to form a new PDF. At the same time, every tool should at least be able to do this basic task.

We have chosen the following tools for the comparison.

PyMuPDF: appears as "fitz" in the reports
pdfrw: a pure Python tool to read and write PDFs
PyPDF2: a pure Python PDF tool with a large feature set
pdftk: a CLI PDF tool kit for cleaning, joining, splitting etc.

If anyone knows another tool worth joining this set - welcome! It should however be platform independent or multiplatform like the ones listed here.

This is the set of files to test with:

The manual Adobe PDF Reference sixth edition 1.7 November 2006 with 1310 pages and over 30 MB, a complete year of the German version of Scientific American (10 to 25 MB each), and the PDF version of PyMuPDF's documentation (0.5 MB).

The results can be seen in the following table ... (tbc)

Run Times