Skip to content

bpo-44289: Keep argument file object's current position in tarfile.is_tarfile #26488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 9, 2022

Conversation

amateja
Copy link
Contributor

@amateja amateja commented Jun 2, 2021

Since Python 3.9 tarfile.is_tarfile accepts not only paths but also files and file-like objects #18090.

Verification if a file or file-like object is a tar file modifies file object's current position.

https://bugs.python.org/issue44289

https://bugs.python.org/issue44289

@amateja amateja requested a review from ethanfurman as a code owner June 2, 2021 15:23
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

Recognized GitHub username

We couldn't find a bugs.python.org (b.p.o) account corresponding to the following GitHub usernames:

@amateja

This might be simply due to a missing "GitHub Name" entry in one's b.p.o account settings. This is necessary for legal reasons before we can look at this contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@akulakov
Copy link
Contributor

@amateja hi, you need to sign the CLA..

Other than that, I've looked at the patch and it looks good to me. I've also confirmed that the fix does what it's supposed to by running test with & w/out the fix.

Since this is a (minor) backwards compatibility change, I'm not sure if it needs to also be added to "what's new", I hope @ethanfurman can comment.

@amateja
Copy link
Contributor Author

amateja commented Jun 21, 2021

Hi @akulakov, thanks for reviewing.

I believe I've already signed CLA. I confirmed that with https://check-python-cla.herokuapp.com/.

Regarding "what's new" I entirely base on your judgment. Just let me know what should be done to unblock this change from being accepted.

@akulakov
Copy link
Contributor

@amateja You are right, I somehow missed the CLA accepted message. You're good as far as that goes :-)

@akulakov
Copy link
Contributor

akulakov commented Dec 9, 2021

I looked at it again and I think this should be fixed in open() rather than in is_tarfile(), because is_tarfile uses open() so it should be fixed at the source; and because there's already logic in open() that partly addresses this issue:

if fileobj is not None:

So the fix would be (a few lines above the linked line; I tested it & it works):

                    fp = func(name, "r", fileobj, **kwargs)
                    if fileobj is not None:
                        fileobj.seek(saved_pos)
                    return fp

@akulakov
Copy link
Contributor

akulakov commented Dec 9, 2021

Note also that the fix I suggested covers more issues: it also covers calling open() more than a single time, e.g.:

kwargs = {'fileobj' if hasattr(archive, 'read') else 'name': archive}
t = tarfile.open(**kwargs)
t = tarfile.open(**kwargs)
return [member.name for member in t.getmembers()]

@amateja
Copy link
Contributor Author

amateja commented Dec 9, 2021

Thanks for your review @akulakov. You are right that this should be fixed at the source. I focused myself on is_tarfile() because new functionality for file-like objects handling was introduced only there with #18090. But I'm happy to change this approach.

@amateja
Copy link
Contributor Author

amateja commented Dec 9, 2021

Here are my further suggestions:

  1. rename this PR to "Keep argument file object's current position in tarfile.open" because tarfile.is_tarfile issue is now fixed elsewhere
  2. remove CommonReadTest.test_is_tarfile_keeps_position from Lib/test/test_tarfile.py added in my initial commit because same functionality is covered in CommonReadTest.test_open_keeps_position now
  3. drop or update "what's new" entry accordingly

Do you want me to write a test for calling open multiple times?

@akulakov
Copy link
Contributor

Thanks for making the changes!

I will do a bit of testing tomorrow and look over it again; I think it's enough to have the test that checks that open keeps the same position as you have it, because both bugs are effectively covered by this test.

I will try to post an update tomorrow or in the next 2 days.

@akulakov
Copy link
Contributor

I did some profiling to make sure an extra seek doesn't affect performance, I did it by opening 1000 small tar files; here's what I got:

with patch: 415, 417, 418 msec per loop
without patch: 414, 419, 417 msec

As expected, it doesn't make noticeable difference.

2850 (~/temp2/temp) % ~/opensource/cpython2/python.exe -m timeit -c 'import tarfile' 'for x in range(1000): tarfile.open("./b%s.tar"%x)'
1 loop, best of 5: 415 msec per loop
2850 (~/temp2/temp) % ~/opensource/cpython2/python.exe -m timeit -c 'import tarfile' 'for x in range(1000): tarfile.open("./b%s.tar"%x)'
1 loop, best of 5: 417 msec per loop
2850 (~/temp2/temp) % ~/opensource/cpython2/python.exe -m timeit -c 'import tarfile' 'for x in range(1000): tarfile.open("./b%s.tar"%x)'
1 loop, best of 5: 418 msec per loop
2850 (~/temp2/temp) % r                                      --INS--
~/opensource/cpython2/python.exe -m timeit -c 'import tarfile' 'for x in range(1000): tarfile.open("./b%s.tar"%x)'
1 loop, best of 5: 414 msec per loop
2852 (~/temp2/temp) % r                                      --INS--
~/opensource/cpython2/python.exe -m timeit -c 'import tarfile' 'for x in range(1000): tarfile.open("./b%s.tar"%x)'
1 loop, best of 5: 419 msec per loop
2854 (~/temp2/temp) % r                                      --INS--
~/opensource/cpython2/python.exe -m timeit -c 'import tarfile' 'for x in range(1000): tarfile.open("./b%s.tar"%x)'
1 loop, best of 5: 417 msec per loop

@akulakov
Copy link
Contributor

Please update the PR title and the news entry.

For the unit tests, keep both of them for now. A core developer will need to review the PR, so I would wait for them to make this call.

@akulakov
Copy link
Contributor

I'm not a bit concerned that users may be relying on the initial seek of open(). I will read about the tar spec, perhaps there's a clean way to keep the initial seek but leave it for additional open()'s.

@akulakov
Copy link
Contributor

I do think it's plausible that users may be relying on open() seeking to 512 bytes, so to minimize the backwards incompatibility change, it's best to allow the first open() to go to 512 bytes, but subsequent open() will reset to 0 first and then go to 512 again.

If users are relying on multiple opens to seek further, they are relying on buggy behaviour and IMO it's ok to fix that.

Note that when open fails, we still always want to restore original position. It's only on success that special casing of 512 position comes into play.

I did some testing and the following logic works:

                try:
                    if fileobj is not None and saved_pos==BLOCKSIZE:
                        fileobj.seek(0)
                    return func(name, "r", fileobj, **kwargs)

It's a bit ugly to hardcode the initial seek position like this, it may be someone more familiar with module internals will suggest a better fix.

@amateja amateja changed the title bpo-44289: Keep argument file object's current position in tarfile.is_tarifle bpo-44289: Keep argument file object's current position in tarfile.open Dec 13, 2021
@akulakov
Copy link
Contributor

Andrzej: I have some spare time right now, if you prefer I can finish the PR and list you as a co-author (because I would have to make it into a new PR). If you are still planning to finish this, no rush.

@amateja
Copy link
Contributor Author

amateja commented Jan 24, 2022

I'm more than happy to work further on this case. I thought we are waiting for someone more familiar with tar internals to comment on your proposal.

@amateja
Copy link
Contributor Author

amateja commented Jan 26, 2022

Hi @akulakov,

I've read some GNU tar manual to understand your proposal better and followed your suggestion with some cosmetic tuning.

Your proposal

                if fileobj is not None:
                    saved_pos = fileobj.tell()
                try:
                    if fileobj is not None and saved_pos == BLOCKSIZE:
                        fileobj.seek(0)
                    return func(name, "r", fileobj, **kwargs)

and mine

                if fileobj is not None:
                    saved_pos = fileobj.tell()
                    if saved_pos == BLOCKSIZE:
                        fileobj.seek(0)
                try:
                    return func(name, "r", fileobj, **kwargs)

Your fix is just great for uncompressed archives but it does not work for any supported compression methods. Each compression method changes fileobj position to value other than 512 bytes (in my tests it was in range 121-529904 depending on compression type and archived files size).

Following case brought me here:

>>> import tarfile
>>> 
>>> with open('spam', 'rb') as fileobj:
...     # here fileobj.tell() == 0
...     if tarfile.is_tarfile(fileobj):
...         # here fileobj.tell() != 0
...         archive = tarfile.open(fileobj=fileobj)
...         # therefore listing archive fails
...         print(archive.getmembers())
... 
[]

and I have some gut feeling that we should not modify fileobj position or even restore it because compressors make a lot of harm to fileobj position making above scenario to fail. So I totally understand your reasoning and yet I think we should stick to current fix.

Please use my hacky test method for CommonReadTest class if you'd like to verify this on your own.

    def test_is_tarfile_affects_getmembers(self):
        with tarfile.open(name=self.tarname) as f:
            refer = len(f.getmembers())
        with open(self.tarname, "rb") as fileobj:
            self.assertTrue(tarfile.is_tarfile(fileobj))
            pos = fileobj.tell()
            with tarfile.open(fileobj=fileobj) as tar:
                num_files = len(tar.getmembers())
                self.assertEqual(num_files, refer, pos)

I'm open for some other proposals and really looking forward to your feedback.

@akulakov
Copy link
Contributor

Andrzej: good catch about compressed files, my fix won't work for them.

I think it's not clear if breaking backwards compatibility here is worth it.

One other idea we can explore is setting the saved position on a fileobj and then checking for it on open when there's a fileobj. hasattr is an expensive operation but here we deal with IO of files which should be much more expensive. One thing I'm not sure if it's possible to set attributes on all types of file objects.

What do you think?

@akulakov
Copy link
Contributor

Andrzej: can you also check if on subsequent opens, the seek is always a multiple of first seek? That was the case with uncompressed.

If so, we can do logic like: x=current pos; open(); y=current pos; if y==x*2: seek to x

@amateja
Copy link
Contributor Author

amateja commented Jan 31, 2022

Hi Andrei,

Let me start with answering your question. It turns out that calling subsequently open moves pointer forward by random number of bytes:

>>> import tarfile
>>> name = 'spam.tar.gz'
>>> f = open(name, 'rb')
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7b8d4fd0>
>>> f.tell()
8202
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7b9eb0a0>
>>> f.tell()
8714
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7b8d4fd0>
>>> f.tell()
9226
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7bb86d60>
>>> f.tell()
9738
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7b9eb220>
>>> f.tell()
10250
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7b9d4fa0>
>>> f.tell()
10762
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7b8d4fd0>
>>> f.tell()
11274
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7bb86d60>
>>> f.tell()
11786
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7b9eb0a0>
>>> f.tell()
12298
>>> tarfile.open(fileobj=f)
<tarfile.TarFile object at 0x7fae7bb86ac0>
>>> f.tell()
12810

So for now we have few options here:

  1. look for unclear and hacky way to fix this issue keeping most of backward compatibility
    This is not very easy as compressed files behaves oddly and we are making this module over-complicated.
  2. fall back to my original proposal with fix in is_tarfile function only
    Not the best but at least it covers quite reasonable scenario - check if a file-like object is a tar file and open if it is.
  3. disregard this issue completely
    I can live without it and most of our community will. We are dealing with checking / opening tar files differently depending if they are described by name as a string of file-like object. This is an inconsistency to me but we can cover this with some mention in the documentation.
  4. fix it completely with relevant documentation
    Slightly rough but I don't see those use cases you wrote about. I cannot imagine why someone might rely on modified state of file-like object argument. I'm sure there might be some good reason for that. You wrote that multiple opens to seek further is buggy behavior and I agree. Knowing that compressed files behaves even more unexpectedly I would say no one should use it that way. I verified that comparing pointer's position in multiple opens vs. calling Tarfile.next() and is unrelated so this is not it.

I like option 4 and option 2 is not too shabby either. I would also back option 3 up if there is no other way and I'm not very fond of option 1 in it's current state.

@amateja
Copy link
Contributor Author

amateja commented Jan 31, 2022

Oh wait... I did some more math. First open moves pointer to some location but second and next opens moves it forward by 512 bytes again. So at least we know this "unclear and hacky way".

@akulakov
Copy link
Contributor

akulakov commented Feb 2, 2022

In my testing the pointer doesn't always advance, with two tiny text files, gzip format, it advanced from 0 to 202 but with subsequent open()'s, it stayed at 202.

One additional option we have is to wait for more user feedback and for somebody to come up with a better fix. I will think more about this later tonight / tomorrow.

@akulakov
Copy link
Contributor

akulakov commented Feb 2, 2022

Andrzej: after some discussion on discord, I think the option 2) (your original solution), is the best way to address this.

@amateja amateja changed the title bpo-44289: Keep argument file object's current position in tarfile.open bpo-44289: Keep argument file object's current position in tarfile.is_tarifle Feb 4, 2022
@amateja
Copy link
Contributor Author

amateja commented Feb 4, 2022

Aye aye. Please find recent commit reverted.

@@ -0,0 +1 @@
Keep argument file object's current position in tarfile.is_tarifle method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

News entry can be more descriptive, so that users can easily tell if they need to update their codebase;
something along these lines:

Fix an issue with :meth:~tarfile.is_tarfile method when using fileobj argument: position
in the fileobj was advanced forward which made it unreadable with :meth:tarfile.TarFile.open.

Copy link
Contributor Author

@amateja amateja Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to come up with something for new entry but your version is just right. I hope you don't mind me using it, do you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, go ahead!

@@ -0,0 +1 @@
Fix an issue with :meth:`~tarfile.is_tarfile` method when using _fileobj_ argument: position in the _fileobj_ was advanced forward which made it unreadable with :meth:`tarfile.TarFile.open`.
Copy link
Contributor

@akulakov akulakov Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In documentation, bold highlight (with asterisks) is generally used for parameter names, for example see https://github.com/python/cpython/edit/main/Doc/whatsnew/3.10.rst line 913

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks. I've also noticed that here and there monospace is used and I'm not sure why.

https://github.com/python/cpython/blame/main/Doc/whatsnew/3.10.rst#L994
https://github.com/python/cpython/blame/main/Doc/whatsnew/3.10.rst#L1015

Fixed already. Happy to follow guidelines.

@akulakov
Copy link
Contributor

akulakov commented Feb 8, 2022

Aside from one small nitpick in the news file, LGTM!
I also tested with a tarfile locally and it works.

@gvanrossum As you looked at it a couple of days ago, maybe you can review?

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is what I expected it would look like. Thanks for the thorough tests! I will merge this now.

@gvanrossum
Copy link
Member

(Whoops, let's run the tests before merging rashly. :-)

Copy link
Contributor

@kumaraditya303 kumaraditya303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the PR title has a typo which should be fixed before merging.

@gvanrossum gvanrossum changed the title bpo-44289: Keep argument file object's current position in tarfile.is_tarifle bpo-44289: Keep argument file object's current position in tarfile.is_tarfile Feb 9, 2022
@gvanrossum
Copy link
Member

(And the tests need to run one more time.)

@amateja
Copy link
Contributor Author

amateja commented Feb 9, 2022

Many thanks to @akulakov for going together through this issue with me. Thank you @gvanrossum for reviewing and @kumaraditya303 for approving :)

Looking forward to working with you again.

@gvanrossum gvanrossum merged commit 128ab09 into python:main Feb 9, 2022
@akulakov
Copy link
Contributor

Andrzej: you're very welcome, thanks for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants