Skip to content

main/streams/streams: use copy_file_range() on Linux #8413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 23, 2022

Conversation

MaxKellermann
Copy link
Contributor

copy_file_range() is a Linux-specific system call which allows
efficient copying between two file descriptors, eliminating the need
to transfer data from the kernel to userspace and back. For
networking file systems like NFS and Ceph, it even eliminates copying
data to the client, and local filesystems like Btrfs and XFS can
create shared extents.

@@ -1555,6 +1555,81 @@ PHPAPI zend_result _php_stream_copy_to_stream_ex(php_stream *src, php_stream *de
return SUCCESS;
}

#ifdef __linux__
Copy link
Member

@devnexen devnexen Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to detect the support at configure time instead (ie freebsd supports it too). ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I thought about that, but feared it would break some obscure system where copy_file_range exists but does something else.
I've amended the patch to include a configure-time check. (And a fix for the CI EOVERFLOW failure, I hope.)

@MaxKellermann
Copy link
Contributor Author

After the amendmend which enabled a configure-time check for copy_file_range, the FreeBSD CI test fails:

001+ Fatal error: Uncaught UnexpectedValueException: phar error: signature cannot be verified: broken signature in zip-based phar "/tmp/cirrus-ci-build/ext/phar/tests/zip/largezip.2.phar.zip.php" in /tmp/cirrus-ci-build/ext/phar/tests/zip/largezip.php:12

It passes on my Linux box, but I don't have FreeBSD. Can somebody with FreeBSD please help and check why this fails?

@devnexen
Copy link
Member

if no one does, will look into it in the next hour.

@devnexen
Copy link
Member

seems on freebsd it can t copy a so large file, if I decrease by two the size the test pass.

php_stream_cast(dest, PHP_STREAM_AS_FD, (void*)&dest_fd, 0);

/* clamp to INT_MAX to avoid EOVERFLOW */
const size_t cfr_max = MIN(maxlen, (size_t)INT_MAX);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be SSIZE_MAX ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure what the kernel's real limit was; I remember long ago I read that Linux's limit for all I/O was 2 GB which is INT_MAX. I'll find out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ... more complicated than that.
https://github.com/torvalds/linux/blob/v5.17/fs/read_write.c#L1434=

	/* Ensure offsets don't wrap. */
	if (pos_in + count < pos_in || pos_out + count < pos_out)
		return -EOVERFLOW;

(Note that this code is buggy because the compiler is allowed to optimize this check away.)
(Also note that this check is rather clumsy because the count variable is clamped to the remaining size of the file only after this check.)
SSIZE_MAX is 2^63-1 on 64 bit CPUs, which means it's safe until you encounter a file of 2^63 bytes, which will not happen for a very loooong time. Should be safe enough to say "copy as much as you can", I guess.

@MaxKellermann
Copy link
Contributor Author

seems on freebsd it can t copy a so large file, if I decrease by two the size the test pass.

As you suggested, SSIZE_MAX is the best value for Linux, but what is FreeBSD's limit?
It's so annoying that we need to deal with kernel-specific limits for "copy as much as you can"; I wonder why the EOVERFLOW condition exists at all - the kernel should just clamp the value instead of complaining?

@devnexen
Copy link
Member

it seems it s the same limit on freebsd however looking at the source most of code consumers use it as much as they can then fall back to the write/read workflow if it fails.

@MaxKellermann
Copy link
Contributor Author

MaxKellermann commented Apr 21, 2022

Then changing the limit to SSIZE_MAX would not fix the FreeBSD problem, because INT_MAX is never larger than that. The FreeBSD problem must be something else, or it has a smaller limit.

@MaxKellermann
Copy link
Contributor Author

Just for correctness, I have force-pushed changing INT_MAX to SSIZE_MAX.
My prediction was that this doesn't fix the FreeBSD problem - but I don't know what does, so I still need help.
(Thanks @devnexen so far)

@devnexen
Copy link
Member

apparently the zip header gets corrupted after the copy in the freebsd case, copy_file_range does not copy the full data in one shot thus needs to copy chunk by chunk maybe.

@MaxKellermann
Copy link
Contributor Author

apparently the zip header gets corrupted after the copy in the freebsd case, copy_file_range does not copy the full data in one shot thus needs to copy chunk by chunk maybe.

Hmmm, my change falls back to old-style copying if copy_file_range() is partial:

php-src/main/streams/streams.c

Lines 1595 to 1601 in a12d33e

if ((maxlen != PHP_STREAM_COPY_ALL && nbytes == maxlen) ||
php_stream_eof(src)) {
/* the whole request was satisfied or
end-of-file reached - done */
*len = haveread;
return SUCCESS;
}

I tried to reproduce the problem by clamping to very small valus; for example, clamping to just 1:

3521314 copy_file_range(4, NULL, 7, NULL, 1, 0) = 1
3521314 newfstatat(4, "", {st_mode=S_IFREG|0600, st_size=1157, ...}, AT_EMPTY_PATH) = 0
3521314 read(4, "!/usr/local/stow/php-was/bin/php"..., 8192) = 1156
3521314 write(7, "!/usr/local/stow/php-was/bin/php"..., 1083) = 1083
3521314 lseek(6, 0, SEEK_SET)           = 0

This one bytes is successfully copied, and the operation is continued with classic read/write. Unit tests succeed that way.

This still isn't the real problem, I think.

@devnexen
Copy link
Member

I m guessing flags must remain 0 for Linux but for FreeBSD can you give 0x01000000 value a go just to see ?

@MaxKellermann
Copy link
Contributor Author

I m guessing flags must remain 0 for Linux but for FreeBSD can you give 0x01000000 value a go just to see ?

What does flag 0x01000000 mean?
Both Linux and FreeBSD manpages of that system call say flags must be zero. There are no documented flags in either kernel.

@devnexen
Copy link
Member

devnexen commented Apr 21, 2022

Indeed not documented really but it is to return after 1 second. The reason I ask is it worked for me locally but wanted to confirm.

@MaxKellermann
Copy link
Contributor Author

OK, force-pushed with your suggested flag. Let's see if that makes the FreeBSD CI happy.

That still isn't a proper solution. I suggest disabling copy_file_range() support on FreeBSD for now and merge this PR (if everybody likes my code and there are no other complaints); after that, investigate&fix the FreeBSD-specific problem, and re-enable it.

@devnexen
Copy link
Member

Yes fair enough.

@MaxKellermann
Copy link
Contributor Author

"All checks have passed" with this kludge ... interesting!

PHP maintainers, I'm waiting for instructions on how to proceed to get this PR merged. Merge with this kludge? Exclude FreeBSD from this feature for now, fix FreeBSD later? First fix FreeBSD properly (I need help with that!), then merge?

@devnexen
Copy link
Member

Exclude FreeBSD from this feature for now, fix FreeBSD later
I would vote for this one.

@MaxKellermann
Copy link
Contributor Author

Exclude FreeBSD from this feature for now, fix FreeBSD later I would vote for this one.

Aye. Force-pushed with #ifndef __FreeBSD__ and a TODO comment.

@Girgias
Copy link
Member

Girgias commented Apr 23, 2022

Would it be possible to write a new test file for this case? I know some tests files are very large but that's something we try to prevent nowadays (as it makes troubleshooting harder see: https://qa.php.net/write-test.php#howbig).

Otherwise not an expert but it LGTM

Verify that the file offset has been moved by exactly the amount.
copy_file_range() is a Linux-specific system call which allows
efficient copying between two file descriptors, eliminating the need
to transfer data from the kernel to userspace and back.  For
networking file systems like NFS and Ceph, it even eliminates copying
data to the client, and local filesystems like Btrfs and XFS can
create shared extents.
@MaxKellermann
Copy link
Contributor Author

@Girgias moved the new "interleaved" tests to a new test file.

@Girgias
Copy link
Member

Girgias commented Apr 23, 2022

@Girgias moved the new "interleaved" tests to a new test file.

If CI is happy I'll merge it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants