Skip to content

Commit b4b46e3

Browse files
authored
GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader (#45093)
### Rationale for this change For some consumers, it is convenient to expose a way to retrieve the necessary byte ranges of a parquet file to read specific column chunks from row groups without having to go through a full `ReadRangeCache`. Ultimately, it's a fairly simple function since we already have all the infrastructure implemented to compute these ranges. ### What changes are included in this PR? A single function added to `parquet::FileReader` named `GetReadRanges` which computes and retrieves the coalesced read ranges for specified row groups and column indices. * GitHub Issue: #45092 Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
1 parent 19f0652 commit b4b46e3

File tree

2 files changed

+42
-0
lines changed

2 files changed

+42
-0
lines changed

cpp/src/parquet/file_reader.cc

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
#include "arrow/io/caching.h"
3030
#include "arrow/io/file.h"
3131
#include "arrow/io/memory.h"
32+
#include "arrow/io/util_internal.h"
3233
#include "arrow/util/bit_util.h"
3334
#include "arrow/util/checked_cast.h"
3435
#include "arrow/util/future.h"
@@ -400,6 +401,21 @@ class SerializedFile : public ParquetFileReader::Contents {
400401
PARQUET_THROW_NOT_OK(cached_source_->Cache(ranges));
401402
}
402403

404+
::arrow::Result<std::vector<::arrow::io::ReadRange>> GetReadRanges(
405+
const std::vector<int>& row_groups, const std::vector<int>& column_indices,
406+
int64_t hole_size_limit, int64_t range_size_limit) {
407+
std::vector<::arrow::io::ReadRange> ranges;
408+
for (int row_group : row_groups) {
409+
for (int col : column_indices) {
410+
ranges.push_back(
411+
ComputeColumnChunkRange(file_metadata_.get(), source_size_, row_group, col));
412+
}
413+
}
414+
415+
return ::arrow::io::internal::CoalesceReadRanges(std::move(ranges), hole_size_limit,
416+
range_size_limit);
417+
}
418+
403419
::arrow::Future<> WhenBuffered(const std::vector<int>& row_groups,
404420
const std::vector<int>& column_indices) const {
405421
if (!cached_source_) {

cpp/src/parquet/file_reader.h

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,32 @@ class PARQUET_EXPORT ParquetFileReader {
201201
const ::arrow::io::IOContext& ctx,
202202
const ::arrow::io::CacheOptions& options);
203203

204+
/// Retrieve the list of byte ranges that would need to be read to retrieve
205+
/// the data for the specified row groups and column indices.
206+
///
207+
/// A reader can optionally call this if they wish to handle their own
208+
/// caching and management of file reads (or offload them to other readers).
209+
/// Unlike PreBuffer, this method will not perform any actual caching or
210+
/// reads, instead just using the file metadata to determine the byte ranges
211+
/// that would need to be read if you were to consume the entirety of the column
212+
/// chunks for the provided columns in the specified row groups.
213+
///
214+
/// If row_groups or column_indices are empty, then the result of this will be empty.
215+
///
216+
/// hole_size_limit represents the maximum distance, in bytes, between two
217+
/// consecutive ranges; beyond this value, ranges will not be combined. The default
218+
/// value is 1MB.
219+
///
220+
/// range_size_limit is the maximum size in bytes of a combined range; if combining
221+
/// two consecutive ranges would produce a range larger than this, they are not
222+
/// combined. The default values is 64MB. This *must* be larger than hole_size_limit.
223+
///
224+
/// This will not take into account page indexes or any other predicate push down
225+
/// benefits that may be available.
226+
::arrow::Result<std::vector<::arrow::io::ReadRange>> GetReadRanges(
227+
const std::vector<int>& row_groups, const std::vector<int>& column_indices,
228+
int64_t hole_size_limit = 1024 * 1024, int64_t range_size_limit = 64 * 1024 * 1024);
229+
204230
/// Wait for the specified row groups and column indices to be pre-buffered.
205231
///
206232
/// After the returned Future completes, reading the specified row

0 commit comments

Comments
 (0)