Skip to content

[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr #14862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

pierrechopin
Copy link

@pierrechopin pierrechopin commented Jul 19, 2019

Some valid jpeg files are not detected by imghdr.what

The proposed fix implements the solution proposed at
https://stackoverflow.com/questions/46802866/how-to-detect-if-the-jpg-jpeg-image-file-is-corruptedincomplete.

We detect the JPEG file by checking it starts with the Start of image
sequence xFFxD8, and ends with the end of image sequence xFFxD9.

In order to do so, we load the entire file in memory.

https://bugs.python.org/issue16512

Some valid jpeg files are not detected by imghdr.what

The proposed fix implements the solution proposed at
https://stackoverflow.com/questions/46802866/how-to-detect-if-the-jpg-jpeg-image-file-is-corruptedincomplete.

We detect the JPEG file by checking it starts with the Start of image
sequence xFFxD8, and ends with the end of image sequence xFFxD9.

In order to do so, we load the entire file in memory
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

@pierrechopin
Copy link
Author

I signed the CLA

@Jorilx
Copy link

Jorilx commented Jul 19, 2019

I'm sorry, wouldn't it be nicer to seek 32 bytes from the end of the file, read JUST them and pass them along with the header to the test_XXX functions?

@pierrechopin
Copy link
Author

Yes, I agree, I was worried about breaking backwards compatibility due to the test at line 53 of test_imghdr. What do you think ?

@Jorilx
Copy link

Jorilx commented Jul 19, 2019

Do you mean test_pathlike_filename?

@pierrechopin
Copy link
Author

no, test_register_test sorry for the typo

@pierrechopin
Copy link
Author

pierrechopin commented Jul 19, 2019

Alternatively, we could only check for the first two bytes... I am unsure of the reason why this wasn't done like that in the first place

Copy link
Contributor

@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierrechopin Welcome and thanks for the contribution!

In order to verify that this patch fully addresses the issue, it would be necessary to add a new test file. This test file would preferably be a jpeg that begins with an ICC profile. After adding the test file to the repository, the entry in test files may look something like this:

TEST_FILES = (
# ...
    ('python.jpg', 'jpeg'),
    ('pythonICC.jpg', 'jpeg'), # addresses bpo-16512
# ...
)

The current tests only verify that the same functionality exists as before. From my understanding, checking the beginning of the file for FF D8 and the end for FF D9 is a more reliable way of determining whether or not something is a jpeg, so replacing that with the current method of detection might be more reliable.

However, it's not necessary to read the entire file into memory. This adds a significant amount of additional overhead, especially if users want to check larger image files. Instead, you can just read the first 2 bytes and last 2 bytes like this:

f = open(file, 'rb')
head = f.read(2) # b'\xff\xd8'
f.seek(-2, io.SEEK_END) 
tail = f.read(2) # b'\xff\xd9

This would not be the literal implementation since it would not be compatible with the existing other file tests, the point of that was simply to show that it's not needed to read all of the bytes in the file. I would recommend starting with figuring out how many bytes in total the other files require and adjust the arguments of test_jpeg() to take in both the head and the tail as args.

Also, is it needed to read the closing FF D9 or is it distinguishable as a jpeg simply from the opening FF D8? If it's the latter, the solution would be even more simple. You could leave the current behavior of collecting the first 32 bytes, and remove and h.endswith(b'\xFF\xD9') from your conditional.

The last part was purely speculation on my part and just an idea to simplify things a bit. I'm not sure if it would actually work properly.

I would recommend starting with adding the ICC file to test and then worrying about the optimization later. In this case, the lack of optimization could be quite detrimental for users though, especially if they use imghdr for detecting the file types of a large volume of uploaded images of varying sizes.

Copy link
Contributor

@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as a more minor suggestion, if your current method of reading the entire file is approved and the optimization is considered to be not necessary, it would be helpful to change the name of the variable on lines 17 and 20 from h to bytes. Currently, h implies head or header, since it's only reading the first 32 bytes. If you're reading all of the bytes, this name would be quite misleading.

In general, I'm also a bit opposed to one character variables when they can be avoided. f being short for file is okay, but even with it's current functionality I'd rather see h be head. Doing so makes the code a lot easier to follow, especially for those who are less familiar with IO.

@aeros
Copy link
Contributor

aeros commented Jul 20, 2019

Also, as a minor fix, I would recommend adjusting the title to one of the following:
[bpo-16512](https://bugs.python.org/issue16512): Recognize more jpeg formats
[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr

@pierrechopin pierrechopin changed the title bpo-16512: Recongnise more jpeg file [bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr Jul 31, 2019
@pierrechopin
Copy link
Author

Hi, guys, I implemented a less strict version, only checking for the SOI tags, as disscussed in the other PR addressing the issue that was opened a year ago.

@AlexWaygood
Copy link
Member

Thanks for the PR, @pierrechopin, and I'm sorry that it's been open for so long without being either merged or rejected.

Unfortunately, the imghdr module is now deprecated following the acceptance of PEP 594 by the Steering Council. Bugfixes and improvements to the module are therefore now considered to be very low priority. Given the large backlog of PRs in the CPython repo, and the fact that this module has no active maintainer in the core dev team, I am therefore going to close this PR as per the policy laid out in the dev guide.

I hope that this does not dissuade you from contributing to CPython in the future 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants