-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add support for glob string in datafusion-cli query #16332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4ff00a9
e234243
e50d9f1
7526dda
4d91636
9a1f18f
8ab3948
55492af
bb744f1
eb740fa
6ed56f1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
--- | ||
source: datafusion-cli/tests/cli_integration.rs | ||
assertion_line: 173 | ||
info: | ||
program: datafusion-cli | ||
args: [] | ||
stdin: "-- Test glob function with files available in CI\n-- Test 1: Single CSV file - verify basic functionality\nSELECT COUNT(*) AS cars_count FROM glob('../datafusion/core/tests/data/cars.csv');\n\n-- Test 2: Data aggregation from CSV file - verify actual data reading\nSELECT car, COUNT(*) as count FROM glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;\n\n-- Test 3: JSON file with explicit format parameter - verify format specification\nSELECT COUNT(*) AS json_count FROM glob('../datafusion/core/tests/data/1.json', 'json');\n\n-- Test 4: Single specific CSV file - verify another CSV works\nSELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv');\n\n-- Test 5: Glob pattern with wildcard - test actual glob functionality\nSELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv'); " | ||
input_file: datafusion-cli/tests/sql/integration/glob_test.sql | ||
--- | ||
success: true | ||
exit_code: 0 | ||
----- stdout ----- | ||
[CLI_VERSION] | ||
+------------+ | ||
| cars_count | | ||
+------------+ | ||
| 25 | | ||
+------------+ | ||
1 row(s) fetched. | ||
[ELAPSED] | ||
|
||
+-------+-------+ | ||
| car | count | | ||
+-------+-------+ | ||
| green | 12 | | ||
| red | 13 | | ||
+-------+-------+ | ||
2 row(s) fetched. | ||
[ELAPSED] | ||
|
||
+------------+ | ||
| json_count | | ||
+------------+ | ||
| 4 | | ||
+------------+ | ||
1 row(s) fetched. | ||
[ELAPSED] | ||
|
||
+---------------+ | ||
| example_count | | ||
+---------------+ | ||
| 1 | | ||
+---------------+ | ||
1 row(s) fetched. | ||
[ELAPSED] | ||
|
||
+--------------------+ | ||
| glob_pattern_count | | ||
+--------------------+ | ||
| 4 | | ||
+--------------------+ | ||
1 row(s) fetched. | ||
[ELAPSED] | ||
|
||
\q | ||
|
||
----- stderr ----- |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,15 @@ | ||||||||||||||||||||||||||||||||
-- Test glob function with files available in CI | ||||||||||||||||||||||||||||||||
-- Test 1: Single CSV file - verify basic functionality | ||||||||||||||||||||||||||||||||
SELECT COUNT(*) AS cars_count FROM glob('../datafusion/core/tests/data/cars.csv'); | ||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
-- Test 2: Data aggregation from CSV file - verify actual data reading | ||||||||||||||||||||||||||||||||
SELECT car, COUNT(*) as count FROM glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car; | ||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think another usecase that @robtandy had was "a list of multiple files" -- like is there some way to select exactly two files? Something like glob('[../datafusion/core/tests/data/cars.csv', '../datafusion/core/tests/data/trucks.csv', ]) Perhaps 🤔 |
||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
-- Test 3: JSON file with explicit format parameter - verify format specification | ||||||||||||||||||||||||||||||||
SELECT COUNT(*) AS json_count FROM glob('../datafusion/core/tests/data/1.json', 'json'); | ||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
-- Test 4: Single specific CSV file - verify another CSV works | ||||||||||||||||||||||||||||||||
SELECT COUNT(*) AS example_count FROM glob('../datafusion/core/tests/data/example.csv'); | ||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
-- Test 5: Glob pattern with wildcard - test actual glob functionality | ||||||||||||||||||||||||||||||||
SELECT COUNT(*) AS glob_pattern_count FROM glob('../datafusion/core/tests/data/exa*.csv'); | ||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we introduce a new function? can we reuse current model? what should be the behavior if there are mixed CSV/JSON/Parquet files in the folder? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can use the current model, but I think it will require touching some core modules within datafusion. re the second question, I think that supporting multiple file types should not be supported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By the way, the current implementation supports the following - but just for local files (as it uses
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another possibility would be to intercept the For example, simliarly to how it peeks here: datafusion/datafusion-cli/src/exec.rs Lines 357 to 367 in 1d61f31
We could implement a special handler in datafusion-cli rather than use the default one in SessionContext: datafusion/datafusion/core/src/execution/context/mod.rs Lines 669 to 672 in 1d61f31
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Maybe thus could use he
try_as_str
function (which would also handle other literal types) https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#method.try_as_str