[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr #14862

pierrechopin · 2019-07-19T14:24:32Z

Some valid jpeg files are not detected by imghdr.what

The proposed fix implements the solution proposed at
https://stackoverflow.com/questions/46802866/how-to-detect-if-the-jpg-jpeg-image-file-is-corruptedincomplete.

We detect the JPEG file by checking it starts with the Start of image
sequence xFFxD8, and ends with the end of image sequence xFFxD9.

In order to do so, we load the entire file in memory.

https://bugs.python.org/issue16512

Some valid jpeg files are not detected by imghdr.what The proposed fix implements the solution proposed at https://stackoverflow.com/questions/46802866/how-to-detect-if-the-jpg-jpeg-image-file-is-corruptedincomplete. We detect the JPEG file by checking it starts with the Start of image sequence xFFxD8, and ends with the end of image sequence xFFxD9. In order to do so, we load the entire file in memory

the-knights-who-say-ni · 2019-07-19T14:24:36Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

pierrechopin · 2019-07-19T14:50:16Z

I signed the CLA

Jorilx · 2019-07-19T15:03:29Z

I'm sorry, wouldn't it be nicer to seek 32 bytes from the end of the file, read JUST them and pass them along with the header to the test_XXX functions?

pierrechopin · 2019-07-19T15:09:04Z

Yes, I agree, I was worried about breaking backwards compatibility due to the test at line 53 of test_imghdr. What do you think ?

Jorilx · 2019-07-19T15:24:18Z

Do you mean test_pathlike_filename?

pierrechopin · 2019-07-19T15:28:22Z

no, test_register_test sorry for the typo

pierrechopin · 2019-07-19T15:28:42Z

Alternatively, we could only check for the first two bytes... I am unsure of the reason why this wasn't done like that in the first place

aeros

@pierrechopin Welcome and thanks for the contribution!

In order to verify that this patch fully addresses the issue, it would be necessary to add a new test file. This test file would preferably be a jpeg that begins with an ICC profile. After adding the test file to the repository, the entry in test files may look something like this:

TEST_FILES = (
# ...
    ('python.jpg', 'jpeg'),
    ('pythonICC.jpg', 'jpeg'), # addresses bpo-16512
# ...
)

The current tests only verify that the same functionality exists as before. From my understanding, checking the beginning of the file for FF D8 and the end for FF D9 is a more reliable way of determining whether or not something is a jpeg, so replacing that with the current method of detection might be more reliable.

However, it's not necessary to read the entire file into memory. This adds a significant amount of additional overhead, especially if users want to check larger image files. Instead, you can just read the first 2 bytes and last 2 bytes like this:

f = open(file, 'rb')
head = f.read(2) # b'\xff\xd8'
f.seek(-2, io.SEEK_END) 
tail = f.read(2) # b'\xff\xd9

This would not be the literal implementation since it would not be compatible with the existing other file tests, the point of that was simply to show that it's not needed to read all of the bytes in the file. I would recommend starting with figuring out how many bytes in total the other files require and adjust the arguments of test_jpeg() to take in both the head and the tail as args.

Also, is it needed to read the closing FF D9 or is it distinguishable as a jpeg simply from the opening FF D8? If it's the latter, the solution would be even more simple. You could leave the current behavior of collecting the first 32 bytes, and remove and h.endswith(b'\xFF\xD9') from your conditional.

The last part was purely speculation on my part and just an idea to simplify things a bit. I'm not sure if it would actually work properly.

I would recommend starting with adding the ICC file to test and then worrying about the optimization later. In this case, the lack of optimization could be quite detrimental for users though, especially if they use imghdr for detecting the file types of a large volume of uploaded images of varying sizes.

aeros

Also as a more minor suggestion, if your current method of reading the entire file is approved and the optimization is considered to be not necessary, it would be helpful to change the name of the variable on lines 17 and 20 from h to bytes. Currently, h implies head or header, since it's only reading the first 32 bytes. If you're reading all of the bytes, this name would be quite misleading.

In general, I'm also a bit opposed to one character variables when they can be avoided. f being short for file is okay, but even with it's current functionality I'd rather see h be head. Doing so makes the code a lot easier to follow, especially for those who are less familiar with IO.

aeros · 2019-07-20T04:23:10Z

Also, as a minor fix, I would recommend adjusting the title to one of the following:
[bpo-16512](https://bugs.python.org/issue16512): Recognize more jpeg formats
[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr

pierrechopin · 2019-07-31T16:11:15Z

Hi, guys, I implemented a less strict version, only checking for the SOI tags, as disscussed in the other PR addressing the issue that was opened a year ago.

AlexWaygood · 2022-04-12T06:25:02Z

Thanks for the PR, @pierrechopin, and I'm sorry that it's been open for so long without being either merged or rejected.

Unfortunately, the imghdr module is now deprecated following the acceptance of PEP 594 by the Steering Council. Bugfixes and improvements to the module are therefore now considered to be very low priority. Given the large backlog of PRs in the CPython repo, and the fact that this module has no active maintainer in the core dev team, I am therefore going to close this PR as per the policy laid out in the dev guide.

I hope that this does not dissuade you from contributing to CPython in the future 🙂

the-knights-who-say-ni added the CLA not signed label Jul 19, 2019

bedevere-bot added the awaiting review label Jul 19, 2019

aeros reviewed Jul 20, 2019

View reviewed changes

toyama0919 mentioned this pull request Jul 30, 2019

support binary by NoneSplitter. aws/sagemaker-python-sdk#954

Merged

4 tasks

pierrechopin changed the title ~~bpo-16512: Recongnise more jpeg file~~ [bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr Jul 31, 2019

only look for header

0586cf3

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Jul 31, 2019

csabella mentioned this pull request Feb 6, 2020

bpo-16512: one jpeg image added with different profile, code modified to detect all types of jpg #8322

Closed

AlexWaygood closed this Apr 12, 2022

joril mannequin mentioned this pull request Apr 12, 2022

imghdr doesn't recognize variant jpeg formats #60716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr #14862

[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr #14862

Uh oh!

pierrechopin commented Jul 19, 2019 •

edited by bedevere-bot

Loading

Uh oh!

the-knights-who-say-ni commented Jul 19, 2019

Uh oh!

pierrechopin commented Jul 19, 2019

Uh oh!

Jorilx commented Jul 19, 2019 •

edited

Loading

Uh oh!

pierrechopin commented Jul 19, 2019

Uh oh!

Jorilx commented Jul 19, 2019

Uh oh!

pierrechopin commented Jul 19, 2019

Uh oh!

pierrechopin commented Jul 19, 2019 •

edited

Loading

Uh oh!

aeros left a comment •

edited

Loading

Uh oh!

aeros left a comment

Uh oh!

aeros commented Jul 20, 2019 •

edited by bedevere-bot

Loading

Uh oh!

pierrechopin commented Jul 31, 2019

Uh oh!

AlexWaygood commented Apr 12, 2022

Uh oh!

Uh oh!

Uh oh!

[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr #14862

[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr #14862

Uh oh!

Conversation

pierrechopin commented Jul 19, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-knights-who-say-ni commented Jul 19, 2019

Uh oh!

pierrechopin commented Jul 19, 2019

Uh oh!

Jorilx commented Jul 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pierrechopin commented Jul 19, 2019

Uh oh!

Jorilx commented Jul 19, 2019

Uh oh!

pierrechopin commented Jul 19, 2019

Uh oh!

pierrechopin commented Jul 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeros left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aeros left a comment

Choose a reason for hiding this comment

Uh oh!

aeros commented Jul 20, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pierrechopin commented Jul 31, 2019

Uh oh!

AlexWaygood commented Apr 12, 2022

Uh oh!

Uh oh!

pierrechopin commented Jul 19, 2019 •

edited by bedevere-bot

Loading

Jorilx commented Jul 19, 2019 •

edited

Loading

pierrechopin commented Jul 19, 2019 •

edited

Loading

aeros left a comment •

edited

Loading

aeros commented Jul 20, 2019 •

edited by bedevere-bot

Loading