-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add support for glob string in datafusion-cli query #16332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Introduced a new `GlobFunc` to allow users to query files using glob patterns, enhancing the flexibility of file access in SQL queries. - Updated `Cargo.toml` to include the `glob` dependency. - Registered the `glob` function in the main execution context. - Enhanced `functions.rs` with necessary imports and implementation details for glob pattern handling.
…function handling - Renamed the `expr_to_literal` function to `as_utf8_literal` for clarity. - Enhanced error messages for better user guidance. - Streamlined URL handling in the `GlobFunc` implementation, improving the parsing logic for glob patterns. - Simplified file format determination and schema inference processes.
- Removed debug print statement from the GlobFunc implementation. - Added SQL integration tests for the glob function, covering various scenarios including CSV and JSON file queries, data aggregation, and glob pattern matching. - Created snapshot files to validate the output of the glob function in different test cases.
SELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv'); | ||
|
||
-- Test 5: Glob pattern with wildcard - test actual glob functionality | ||
SELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we introduce a new function? can we reuse current model?
what should be the behavior if there are mixed CSV/JSON/Parquet files in the folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the current model, but I think it will require touching some core modules within datafusion.
re the second question, I think that supporting multiple file types should not be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, the current implementation supports the following - but just for local files (as it uses ::parse()
)
CREATE EXTERNAL TABLE logs
STORED AS CSV
LOCATION '/data/*_small.csv';
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility would be to intercept the CREATE EXTERNAL TABLE
command in datafusion-cli
itself
For example, simliarly to how it peeks here:
datafusion/datafusion-cli/src/exec.rs
Lines 357 to 367 in 1d61f31
if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &plan { | |
// To support custom formats, treat error as None | |
let format = config_file_type_from_str(&cmd.file_type); | |
register_object_store_and_config_extensions( | |
ctx, | |
&cmd.location, | |
&cmd.options, | |
format, | |
) | |
.await?; | |
} |
We could implement a special handler in datafusion-cli rather than use the default one in SessionContext:
datafusion/datafusion/core/src/execution/context/mod.rs
Lines 669 to 672 in 1d61f31
DdlStatement::CreateExternalTable(cmd) => { | |
(Box::pin(async move { self.create_external_table(&cmd).await }) | |
as std::pin::Pin<Box<dyn futures::Future<Output = _> + Send>>) | |
.await |
- Updated error handling in the GlobFunc to use `plan_datafusion_err` for improved clarity and consistency in error messages. - Streamlined URL parsing and glob pattern handling for better readability and maintainability.
Co-authored-by: Oleks V <[email protected]>
Co-authored-by: Oleks V <[email protected]>
…sion into feat-cli-sup-glob
…e .last() with .next_back() for better performance on DoubleEndedIterator - Remove unused DataFusionError import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you so much @a-agmon -- this looks very cool.
In general I think in DataFusion we try to follow some existing implementation rather than innovate new syntax or dialect when possible.
One question I had was if you considered adding a read_parquet
type function as proposed on #16303 rather than a glob
function.
I also think following another implementation like read_parquet
might lead to a better user experience as:
- Some users will already know how it works
- Some other system has already designed out the kinks (e.g how to read multiple specific files)
SELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv'); | ||
|
||
-- Test 5: Glob pattern with wildcard - test actual glob functionality | ||
SELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility would be to intercept the CREATE EXTERNAL TABLE
command in datafusion-cli
itself
For example, simliarly to how it peeks here:
datafusion/datafusion-cli/src/exec.rs
Lines 357 to 367 in 1d61f31
if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &plan { | |
// To support custom formats, treat error as None | |
let format = config_file_type_from_str(&cmd.file_type); | |
register_object_store_and_config_extensions( | |
ctx, | |
&cmd.location, | |
&cmd.options, | |
format, | |
) | |
.await?; | |
} |
We could implement a special handler in datafusion-cli rather than use the default one in SessionContext:
datafusion/datafusion/core/src/execution/context/mod.rs
Lines 669 to 672 in 1d61f31
DdlStatement::CreateExternalTable(cmd) => { | |
(Box::pin(async move { self.create_external_table(&cmd).await }) | |
as std::pin::Pin<Box<dyn futures::Future<Output = _> + Send>>) | |
.await |
|
||
fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> { | ||
match expr { | ||
Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Maybe thus could use he try_as_str
function (which would also handle other literal types) https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#method.try_as_str
SELECT COUNT(*) AS cars_count FROM glob('../datafusion/core/tests/data/cars.csv'); | ||
|
||
-- Test 2: Data aggregation from CSV file - verify actual data reading | ||
SELECT car, COUNT(*) as count FROM glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another usecase that @robtandy had was "a list of multiple files" -- like is there some way to select exactly two files? Something like
glob('[../datafusion/core/tests/data/cars.csv', '../datafusion/core/tests/data/trucks.csv', ])
Perhaps 🤔
@alamb - thank you very much for the generous comments. I appreciate it. |
I think the reason that DuckDB et al use a function for each file type is that it simplifies option handling (there are many options that are better suited for parquet that are not csv) That being said, adding a function like |
Partly closes #16303
Introduces glob() table function that allows running queries on multiple files, like:
note that the latter statement include 2 glob layers (2 wildcards) that work on only if you enable
Integration tests were added to test