-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
[bpo-16512](https://bugs.python.org/issue16512): Improve jpeg detection in imghdr #14862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some valid jpeg files are not detected by imghdr.what The proposed fix implements the solution proposed at https://stackoverflow.com/questions/46802866/how-to-detect-if-the-jpg-jpeg-image-file-is-corruptedincomplete. We detect the JPEG file by checking it starts with the Start of image sequence xFFxD8, and ends with the end of image sequence xFFxD9. In order to do so, we load the entire file in memory
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for your contribution, we look forward to reviewing it! |
I signed the CLA |
I'm sorry, wouldn't it be nicer to |
Yes, I agree, I was worried about breaking backwards compatibility due to the test at line 53 of test_imghdr. What do you think ? |
Do you mean |
no, test_register_test sorry for the typo |
Alternatively, we could only check for the first two bytes... I am unsure of the reason why this wasn't done like that in the first place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierrechopin Welcome and thanks for the contribution!
In order to verify that this patch fully addresses the issue, it would be necessary to add a new test file. This test file would preferably be a jpeg that begins with an ICC profile. After adding the test file to the repository, the entry in test files may look something like this:
TEST_FILES = (
# ...
('python.jpg', 'jpeg'),
('pythonICC.jpg', 'jpeg'), # addresses bpo-16512
# ...
)
The current tests only verify that the same functionality exists as before. From my understanding, checking the beginning of the file for FF D8 and the end for FF D9 is a more reliable way of determining whether or not something is a jpeg, so replacing that with the current method of detection might be more reliable.
However, it's not necessary to read the entire file into memory. This adds a significant amount of additional overhead, especially if users want to check larger image files. Instead, you can just read the first 2 bytes and last 2 bytes like this:
f = open(file, 'rb')
head = f.read(2) # b'\xff\xd8'
f.seek(-2, io.SEEK_END)
tail = f.read(2) # b'\xff\xd9
This would not be the literal implementation since it would not be compatible with the existing other file tests, the point of that was simply to show that it's not needed to read all of the bytes in the file. I would recommend starting with figuring out how many bytes in total the other files require and adjust the arguments of test_jpeg() to take in both the head and the tail as args.
Also, is it needed to read the closing FF D9 or is it distinguishable as a jpeg simply from the opening FF D8? If it's the latter, the solution would be even more simple. You could leave the current behavior of collecting the first 32 bytes, and remove and h.endswith(b'\xFF\xD9')
from your conditional.
The last part was purely speculation on my part and just an idea to simplify things a bit. I'm not sure if it would actually work properly.
I would recommend starting with adding the ICC file to test and then worrying about the optimization later. In this case, the lack of optimization could be quite detrimental for users though, especially if they use imghdr
for detecting the file types of a large volume of uploaded images of varying sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as a more minor suggestion, if your current method of reading the entire file is approved and the optimization is considered to be not necessary, it would be helpful to change the name of the variable on lines 17 and 20 from h
to bytes
. Currently, h
implies head or header, since it's only reading the first 32 bytes. If you're reading all of the bytes, this name would be quite misleading.
In general, I'm also a bit opposed to one character variables when they can be avoided. f
being short for file
is okay, but even with it's current functionality I'd rather see h
be head
. Doing so makes the code a lot easier to follow, especially for those who are less familiar with IO.
Also, as a minor fix, I would recommend adjusting the title to one of the following: |
Hi, guys, I implemented a less strict version, only checking for the SOI tags, as disscussed in the other PR addressing the issue that was opened a year ago. |
Thanks for the PR, @pierrechopin, and I'm sorry that it's been open for so long without being either merged or rejected. Unfortunately, the I hope that this does not dissuade you from contributing to CPython in the future 🙂 |
Some valid jpeg files are not detected by imghdr.what
The proposed fix implements the solution proposed at
https://stackoverflow.com/questions/46802866/how-to-detect-if-the-jpg-jpeg-image-file-is-corruptedincomplete.
We detect the JPEG file by checking it starts with the Start of image
sequence xFFxD8, and ends with the end of image sequence xFFxD9.
In order to do so, we load the entire file in memory.
https://bugs.python.org/issue16512