Skip to content

Error on bad lines pyarrow #45029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 29, 2021
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 23, 2021

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them

cc @lithomas1 Not sure if that was what the TODO was referring to

@phofl phofl added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Dec 23, 2021
@jbrockmendel
Copy link
Member

Not sure if that was what the TODO was referring to

I don't know the specific context of that comment, but many of the TODO(1.4) comments have to do with the pyarrow_skip mark being changed to pyarrow_xfail

@@ -135,6 +135,9 @@ def test_pyarrow_engine(self):
1,2,3,4,"""

for default in pa_unsupported:
if default == "on_bad_lines":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that on_bad_lines is skipped here and tested separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, missed that. The test was failing before but forget to fix. Adjusted it now accordingly

@lithomas1
Copy link
Member

cc @lithomas1 Not sure if that was what the TODO was referring to

The TODO was to uncomment that line from the set which I see you've done here.

@jreback jreback added this to the 1.4 milestone Dec 27, 2021
"kwds",
[{"on_bad_lines": "warn"}, {"error_bad_lines": True}, {"warn_bad_lines": True}],
)
def test_pyarrow_bad_lines_fails(self, pyarrow_parser_only, kwds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this test be removed now? It seems to duplicate the test_pyarrow_engine test above now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thx

@jreback jreback merged commit d3f665d into pandas-dev:master Dec 29, 2021
@jreback
Copy link
Contributor

jreback commented Dec 29, 2021

thanks @phofl and @lithomas1

@phofl phofl deleted the error_on_bad_lines_pyarrow branch December 29, 2021 15:20
@SysuJayce
Copy link

Can we use pandas' read_csv with pyarrow engine and skip bad lines like "ArrowInvalid: CSV parse error: Expected 2 columns"?

The default engine's read_csv is quite slow while many original args are not supported by pyarrow engine, which makes it inconvenient to process large data file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants