Skip to content

Add support for glob string in datafusion-cli query #16332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

a-agmon
Copy link

@a-agmon a-agmon commented Jun 8, 2025

Partly closes #16303

Introduces glob() table function that allows running queries on multiple files, like:

 SELECT id FROM glob('s3://tests/data/file-a*.csv');
 SELECT id FROM glob('s3://tests/*/*.csv');

note that the latter statement include 2 glob layers (2 wildcards) that work on only if you enable

SET datafusion.execution.listing_table_ignore_subdirectory = false;

Integration tests were added to test

a-agmon added 4 commits June 8, 2025 15:21
- Introduced a new `GlobFunc` to allow users to query files using glob patterns, enhancing the flexibility of file access in SQL queries.
- Updated `Cargo.toml` to include the `glob` dependency.
- Registered the `glob` function in the main execution context.
- Enhanced `functions.rs` with necessary imports and implementation details for glob pattern handling.
…function handling

- Renamed the `expr_to_literal` function to `as_utf8_literal` for clarity.
- Enhanced error messages for better user guidance.
- Streamlined URL handling in the `GlobFunc` implementation, improving the parsing logic for glob patterns.
- Simplified file format determination and schema inference processes.
- Removed debug print statement from the GlobFunc implementation.
- Added SQL integration tests for the glob function, covering various scenarios including CSV and JSON file queries, data aggregation, and glob pattern matching.
- Created snapshot files to validate the output of the glob function in different test cases.
@a-agmon a-agmon changed the title Add support for glob string in datafusion-cli Add support for glob string in datafusion-cli query Jun 8, 2025
SELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv');

-- Test 5: Glob pattern with wildcard - test actual glob functionality
SELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we introduce a new function? can we reuse current model?

what should be the behavior if there are mixed CSV/JSON/Parquet files in the folder?

Copy link
Author

@a-agmon a-agmon Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the current model, but I think it will require touching some core modules within datafusion.

re the second question, I think that supporting multiple file types should not be supported.

Copy link
Author

@a-agmon a-agmon Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the current implementation supports the following - but just for local files (as it uses ::parse())

CREATE EXTERNAL TABLE logs 
STORED AS CSV 
LOCATION '/data/*_small.csv';

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility would be to intercept the CREATE EXTERNAL TABLE command in datafusion-cli itself

For example, simliarly to how it peeks here:

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &plan {
// To support custom formats, treat error as None
let format = config_file_type_from_str(&cmd.file_type);
register_object_store_and_config_extensions(
ctx,
&cmd.location,
&cmd.options,
format,
)
.await?;
}

We could implement a special handler in datafusion-cli rather than use the default one in SessionContext:

DdlStatement::CreateExternalTable(cmd) => {
(Box::pin(async move { self.create_external_table(&cmd).await })
as std::pin::Pin<Box<dyn futures::Future<Output = _> + Send>>)
.await

a-agmon and others added 6 commits June 8, 2025 22:18
- Updated error handling in the GlobFunc to use `plan_datafusion_err` for improved clarity and consistency in error messages.
- Streamlined URL parsing and glob pattern handling for better readability and maintainability.
…e .last() with .next_back() for better performance on DoubleEndedIterator - Remove unused DataFusionError import
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you so much @a-agmon -- this looks very cool.

In general I think in DataFusion we try to follow some existing implementation rather than innovate new syntax or dialect when possible.

One question I had was if you considered adding a read_parquet type function as proposed on #16303 rather than a glob function.

I also think following another implementation like read_parquet might lead to a better user experience as:

  1. Some users will already know how it works
  2. Some other system has already designed out the kinks (e.g how to read multiple specific files)

SELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv');

-- Test 5: Glob pattern with wildcard - test actual glob functionality
SELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility would be to intercept the CREATE EXTERNAL TABLE command in datafusion-cli itself

For example, simliarly to how it peeks here:

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &plan {
// To support custom formats, treat error as None
let format = config_file_type_from_str(&cmd.file_type);
register_object_store_and_config_extensions(
ctx,
&cmd.location,
&cmd.options,
format,
)
.await?;
}

We could implement a special handler in datafusion-cli rather than use the default one in SessionContext:

DdlStatement::CreateExternalTable(cmd) => {
(Box::pin(async move { self.create_external_table(&cmd).await })
as std::pin::Pin<Box<dyn futures::Future<Output = _> + Send>>)
.await


fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
match expr {
Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Maybe thus could use he try_as_str function (which would also handle other literal types) https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#method.try_as_str

SELECT COUNT(*) AS cars_count FROM glob('../datafusion/core/tests/data/cars.csv');

-- Test 2: Data aggregation from CSV file - verify actual data reading
SELECT car, COUNT(*) as count FROM glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another usecase that @robtandy had was "a list of multiple files" -- like is there some way to select exactly two files? Something like

glob('[../datafusion/core/tests/data/cars.csv',  '../datafusion/core/tests/data/trucks.csv', ])

Perhaps 🤔

@a-agmon
Copy link
Author

a-agmon commented Jun 11, 2025

@alamb - thank you very much for the generous comments. I appreciate it.
Re naming - I completely agree. Was just wondering whether its better to introduce one function that infer the file type (like read() or glob()) rather than a function for each file type (read_parquet, read_csv, etc). You are correct that the latter is more common so will for this.
Re the other comments - will review and handle. Thanks.

@alamb
Copy link
Contributor

alamb commented Jun 13, 2025

rather than a function for each file type (read_parquet, read_csv, etc). You are correct that the latter is more common so will for this.

I think the reason that DuckDB et al use a function for each file type is that it simplifies option handling (there are many options that are better suited for parquet that are not csv)

That being said, adding a function like read_file(..) or read_data(...) that handled all file types might be a reasonable thing to do in datafusion-cli as then you could probably reuse most of the ListingTable code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading multiple parquet files via datafusion-cli
3 participants