A More Detailed Look at PyMuPDF's Performance

We have stated during several occasions that MuPDF and therefor also our Python binding PyMuPDF ranges at the top when it comes to performance.

I have been wondering how these bold statements could proved, or at least underpinned with some quantitative data. A full comparisons of all the many PDF tools on the market is merely impossible - differences in functionality, scope, intended use, platform (in)dependence, pricing and openness and so forth are just to large.

So I decided to start with a minimal approach. I just want to illustrate how fast MuPDF can read in and interpret PDF files and thus make them available for the actual processing desired.

Most of what I am writing here is also contained in the PyMuPDF documentation. This docu already contains a chapter on performance and resource requirements for the text extraction methods.

I felt that simply opening a PDF and immediately saving it again as a new PDF, should cover the complete code responsible for interpreting a PDF's data and re-arranging them to form a new PDF. At the same time, every tool should at least be able to do this basic task.

We have chosen the following tools for the comparison.

tbc