Add support for glob string in datafusion-cli query #16332

a-agmon · 2025-06-08T17:51:17Z

Partly closes #16303

Introduces glob() table function that allows running queries on multiple files, like:

 SELECT id FROM glob('s3://tests/data/file-a*.csv');
 SELECT id FROM glob('s3://tests/*/*.csv');

note that the latter statement include 2 glob layers (2 wildcards) that work on only if you enable

SET datafusion.execution.listing_table_ignore_subdirectory = false;

Integration tests were added to test

- Introduced a new `GlobFunc` to allow users to query files using glob patterns, enhancing the flexibility of file access in SQL queries. - Updated `Cargo.toml` to include the `glob` dependency. - Registered the `glob` function in the main execution context. - Enhanced `functions.rs` with necessary imports and implementation details for glob pattern handling.

…function handling - Renamed the `expr_to_literal` function to `as_utf8_literal` for clarity. - Enhanced error messages for better user guidance. - Streamlined URL handling in the `GlobFunc` implementation, improving the parsing logic for glob patterns. - Simplified file format determination and schema inference processes.

- Removed debug print statement from the GlobFunc implementation. - Added SQL integration tests for the glob function, covering various scenarios including CSV and JSON file queries, data aggregation, and glob pattern matching. - Created snapshot files to validate the output of the glob function in different test cases.

datafusion-cli/src/functions.rs

comphead · 2025-06-08T18:39:52Z

datafusion-cli/tests/sql/integration/glob_test.sql

+SELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv'); 


should we introduce a new function? can we reuse current model?

what should be the behavior if there are mixed CSV/JSON/Parquet files in the folder?

We can use the current model, but I think it will require touching some core modules within datafusion.

re the second question, I think that supporting multiple file types should not be supported.

By the way, the current implementation supports the following - but just for local files (as it uses ::parse())

CREATE EXTERNAL TABLE logs STORED AS CSV LOCATION '/data/*_small.csv';

Another possibility would be to intercept the CREATE EXTERNAL TABLE command in datafusion-cli itself

For example, simliarly to how it peeks here:

datafusion/datafusion-cli/src/exec.rs

Lines 357 to 367 in 1d61f31

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &plan {

// To support custom formats, treat error as None

let format = config_file_type_from_str(&cmd.file_type);

register_object_store_and_config_extensions(

ctx,

&cmd.location,

&cmd.options,

format,

)

.await?;

}

We could implement a special handler in datafusion-cli rather than use the default one in SessionContext:

datafusion/datafusion/core/src/execution/context/mod.rs

Lines 669 to 672 in 1d61f31

DdlStatement::CreateExternalTable(cmd) => {

(Box::pin(async move { self.create_external_table(&cmd).await })

as std::pin::Pin<Box<dyn futures::Future<Output = _> + Send>>)

.await

- Updated error handling in the GlobFunc to use `plan_datafusion_err` for improved clarity and consistency in error messages. - Streamlined URL parsing and glob pattern handling for better readability and maintainability.

Co-authored-by: Oleks V <[email protected]>

…sion into feat-cli-sup-glob

…e .last() with .next_back() for better performance on DoubleEndedIterator - Remove unused DataFusionError import

alamb

First of all, thank you so much @a-agmon -- this looks very cool.

In general I think in DataFusion we try to follow some existing implementation rather than innovate new syntax or dialect when possible.

One question I had was if you considered adding a read_parquet type function as proposed on #16303 rather than a glob function.

I also think following another implementation like read_parquet might lead to a better user experience as:

Some users will already know how it works
Some other system has already designed out the kinks (e.g how to read multiple specific files)

alamb · 2025-06-11T04:59:02Z

datafusion-cli/tests/sql/integration/glob_test.sql

+SELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv'); 


Another possibility would be to intercept the CREATE EXTERNAL TABLE command in datafusion-cli itself

For example, simliarly to how it peeks here:

datafusion/datafusion-cli/src/exec.rs

Lines 357 to 367 in 1d61f31

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &plan {

// To support custom formats, treat error as None

let format = config_file_type_from_str(&cmd.file_type);

register_object_store_and_config_extensions(

ctx,

&cmd.location,

&cmd.options,

format,

)

.await?;

}

We could implement a special handler in datafusion-cli rather than use the default one in SessionContext:

datafusion/datafusion/core/src/execution/context/mod.rs

Lines 669 to 672 in 1d61f31

DdlStatement::CreateExternalTable(cmd) => {

(Box::pin(async move { self.create_external_table(&cmd).await })

as std::pin::Pin<Box<dyn futures::Future<Output = _> + Send>>)

.await

alamb · 2025-06-11T05:02:39Z

datafusion-cli/src/functions.rs

+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+    match expr {
+        Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),


Minor: Maybe thus could use he try_as_str function (which would also handle other literal types) https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#method.try_as_str

alamb · 2025-06-11T05:05:01Z

datafusion-cli/tests/sql/integration/glob_test.sql

+SELECT COUNT(*) AS cars_count FROM glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;


I think another usecase that @robtandy had was "a list of multiple files" -- like is there some way to select exactly two files? Something like

glob('[../datafusion/core/tests/data/cars.csv', '../datafusion/core/tests/data/trucks.csv', ])

Perhaps 🤔

a-agmon · 2025-06-11T05:25:12Z

@alamb - thank you very much for the generous comments. I appreciate it.
Re naming - I completely agree. Was just wondering whether its better to introduce one function that infer the file type (like read() or glob()) rather than a function for each file type (read_parquet, read_csv, etc). You are correct that the latter is more common so will for this.
Re the other comments - will review and handle. Thanks.

alamb · 2025-06-13T14:14:02Z

rather than a function for each file type (read_parquet, read_csv, etc). You are correct that the latter is more common so will for this.

I think the reason that DuckDB et al use a function for each file type is that it simplifies option handling (there are many options that are better suited for parquet that are not csv)

That being said, adding a function like read_file(..) or read_data(...) that handled all file types might be a reasonable thing to do in datafusion-cli as then you could probably reuse most of the ListingTable code

a-agmon added 4 commits June 8, 2025 15:21

fix some typo

7526dda

a-agmon changed the title ~~Add support for glob string in datafusion-cli~~ Add support for glob string in datafusion-cli query Jun 8, 2025

Merge branch 'main' into feat-cli-sup-glob

4d91636

comphead reviewed Jun 8, 2025

View reviewed changes

datafusion-cli/src/functions.rs Outdated Show resolved Hide resolved

comphead reviewed Jun 8, 2025

View reviewed changes

datafusion-cli/src/functions.rs Outdated Show resolved Hide resolved

comphead reviewed Jun 8, 2025

View reviewed changes

a-agmon and others added 6 commits June 8, 2025 22:18

refactor: enhance error handling in GlobFunc implementation

9a1f18f

- Updated error handling in the GlobFunc to use `plan_datafusion_err` for improved clarity and consistency in error messages. - Streamlined URL parsing and glob pattern handling for better readability and maintainability.

Update datafusion-cli/src/functions.rs

8ab3948

Co-authored-by: Oleks V <[email protected]>

Update datafusion-cli/src/functions.rs

55492af

Co-authored-by: Oleks V <[email protected]>

Merge branch 'feat-cli-sup-glob' of https://github.com/a-agmon/datafu…

bb744f1

…sion into feat-cli-sup-glob

Merge remote changes

eb740fa

fix: resolve clippy warnings for double-ended iterator usage - Replac…

6ed56f1

…e .last() with .next_back() for better performance on DoubleEndedIterator - Remove unused DataFusionError import

a-agmon mentioned this pull request Jun 9, 2025

Support reading multiple parquet files via datafusion-cli #16303

Open

alamb reviewed Jun 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for glob string in datafusion-cli query #16332

Add support for glob string in datafusion-cli query #16332

a-agmon commented Jun 8, 2025

Uh oh!

Uh oh!

Uh oh!

comphead Jun 8, 2025

Uh oh!

a-agmon Jun 8, 2025 •

edited

Loading

Uh oh!

a-agmon Jun 8, 2025 •

edited

Loading

Uh oh!

alamb Jun 11, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Jun 11, 2025

Uh oh!

alamb Jun 11, 2025

Uh oh!

alamb Jun 11, 2025

Uh oh!

a-agmon commented Jun 11, 2025

Uh oh!

alamb commented Jun 13, 2025

Uh oh!

Uh oh!

	if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &plan {
	// To support custom formats, treat error as None
	let format = config_file_type_from_str(&cmd.file_type);
	register_object_store_and_config_extensions(
	ctx,
	&cmd.location,
	&cmd.options,
	format,
	)
	.await?;
	}

	DdlStatement::CreateExternalTable(cmd) => {
	(Box::pin(async move { self.create_external_table(&cmd).await })
	as std::pin::Pin<Box<dyn futures::Future<Output = _> + Send>>)
	.await

Add support for glob string in datafusion-cli query #16332

Are you sure you want to change the base?

Add support for glob string in datafusion-cli query #16332

Conversation

a-agmon commented Jun 8, 2025

Uh oh!

Uh oh!

Uh oh!

comphead Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

a-agmon Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-agmon Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

a-agmon commented Jun 11, 2025

Uh oh!

alamb commented Jun 13, 2025

Uh oh!

Uh oh!

a-agmon Jun 8, 2025 •

edited

Loading

a-agmon Jun 8, 2025 •

edited

Loading