Skip to content

gracefully handle malformed .csv files #471

Open
@gdementen

Description

@gdementen

When a .csv file has its first line containing too few commas compared to the number of data columns (see attached test_case.csv), pandas has a very strange behaviour that you apparently cannot deactivate short of ignoring the line via header or skiprows argument. The exact same file with two more commas on the first line works fine (see below).

Pandas is the culprit here. So we should report the bug there or workaround it here.

>>> df = pd.read_csv('test_case.csv')
>>> df
           year earnings
NaN    all  men    women
2012.0 0      1        2
2013.0 3      4        5

>>> df.columns
Index(['year', 'earnings'], dtype='object')
>>> df.index
MultiIndex(levels=[[2012.0, 2013.0], ['0', '3', 'all']],
           labels=[[-1, 0, 1], [2, 0, 1]])

>>> # unsurprisingly... the same broken result
>>> pd.read_csv('test_case.csv', index_col=None)
           year earnings
NaN    all  men    women
2012.0 0      1        2
2013.0 3      4        5

>>> pd.read_csv('test_case.csv', index_col=0)
IndexError: list index out of range
>>> pd.read_csv('test_case.csv', index_col=[0])
IndexError: list index out of range
>>> pd.read_csv('test_case.csv', header=[0, 1], index_col=0)
IndexError: list index out of range
>>> pd.read_csv('test_case.csv', header=1)
   Unnamed: 0  all  men  women
0        2012    0    1      2
1        2013    3    4      5
>>> pd.read_csv('test_case.csv', header=1, index_col=0)
      all  men  women
2012    0    1      2
2013    3    4      5

When the first line has the correct number of , (such as after opening it in Excel and saving it back), the result is what I expected:

>>> pd.read_csv('test_case.csv2')
     year earnings Unnamed: 2 Unnamed: 3
0     NaN      all        men      women
1  2012.0        0          1          2
2  2013.0        3          4          5

test_cases.zip

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions