file_get_contents()
and file_put_contents()
fail with data >=2GB on macOS & BSD
#18753
Labels
file_get_contents()
and file_put_contents()
fail with data >=2GB on macOS & BSD
#18753
Uh oh!
There was an error while loading. Please reload this page.
The buggy behavior
macOS (
arm64
)Running the following code produces an error:
The function on macOS returns a 0-byte string, as verified by
gettype(file_get_contents(...))
andstrlen(file_get_contents(...)
. The file is almost 5GB in size:Note the size reported: it appears that macOS tries to read exactly 8,192 bytes past the file size?this is probably not related, see belowComparing to Linux (
x86_64
)On a fully updated Debian 13.0 the result is different:
PHP Versions
macOS installed via Homebrew:
Linux:
Operating System
Looking for the culprit
How it fails?
While I am not a C developer, nor I have great familiarity with ZE codebase, I tried to take a crack at this. The error seems to be coming from
php_stdiop_read()
:php-src/main/streams/plain_wrapper.c
Lines 446 to 448 in 359bb63
Initially, I was thinking it's about the 4GB size, or the error reporting a size off by 8K from the real file size, but it doesn't seem to be the case. In fact, any read larger than or equal to 2GB will fail:
file_get_contents()
fails only for regular files, regardless of the underlying filesystem (tested on regular APFS & HFS+ ramdisk):Issue seems to be isolated to
file_get_contents()
only. My initial hunch of reads in chunks larger thanSSIZE_MAX
also led to nowhere, as a singlefread()
is able to read the file as well:The issue is also not related to an old bug 69824 of mine with variables >2GB, as on modern PHP versions creating a 5GB (i.e. larger than the file) isn't a problem.
I also couldn't replicate it using PHP code that doesn't use
file_get_contents()
.Why it fails?
If I'm reading the
file_get_contents()
implementation for files correctly, it will call_php_stream_copy_to_mem()
, which then calls universal_php_stream_read()
that callsstream->ops->read()
on the stream. I think that call on the stream is set tophp_stdiop_read()
.I suspected that the
read(3)
is being called with the full$length
, as passed to fgc. This points to behavior ofread(3)
being different between Darwin and Linux.I wrote a quick C reproducer and tested:
Linux accepts arbitrary size to
read(3)
and simply returns maximum amount possible (hmm, 2GB-4K??), which lets the stream logic handle stitching. Darwin/XNU and BSD kernels instead immediately returnsEINVAL
if requested chunk size is larger thanINT_MAX
.The same problem also affects
file_put_contents()
for the same reasons.Possible fix?
This behavior appears to be known, as
stream_set_chunk_size()
errors-out if requested chunk size is> INT_MAX
on all platforms. Moreover, while debugging I did a full circle: thephp_stdiop_read()
does clamp the max chunk/buffer toINT_MAX
but only on Windows.I think adding the clamping for macOS and BSD, in addition to Windows, is the simplest solution - PR provided.
Affected versions
The issue will only appear if the stream read buffer is set
> INT_MAX
, which in the case offile_get_contents()
bisects to commit 6beee1a from #8547 that first landed in PHP 8.2.Knowing this I found this isn't a problem with just
file_get_contents()
but alsofread()
asstream_set_read_buffer()
doesn't guard this:However, I don't think this needs to be guarded even for DX, as this is a user shooting themselves into a foot. After the patch the code above will instead fail with
Notice: fread(): Read of 2147483648 bytes failed with errno=9 Bad file descriptor
.Dataset
The exact file I encounter a problem with is available from Cornell University. You can get it directly via
curl -L -o ~/Downloads/arxiv.zip https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv
. However, after some digging I see it's not about this exact file, i.e.truncate -s 4694824521 big
works too.The text was updated successfully, but these errors were encountered: