Skip to content

Read excel nrows #16672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Other Enhancements
- :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`)
- :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`)
- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

pd.read_excel() has gained the nrows parameter (...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv

use :func:`read_excel`


.. _whatsnew_0210.api_breaking:

Expand Down
56 changes: 31 additions & 25 deletions pandas/io/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@
Rows to skip at the beginning (0-indexed)
skip_footer : int, default 0
Rows at the end to skip (0-indexed)
nrows : int, default None
Number of rows to parse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded 0.21.0 tag

index_col : int, list of ints, default None
Column (0-indexed) to use as the row labels of the DataFrame.
Pass None if there is no such column. If a list is passed,
Expand Down Expand Up @@ -191,12 +193,12 @@ def get_writer(engine_name):


@Appender(_read_excel_doc)
def read_excel(io, sheet_name=0, header=0, skiprows=None, skip_footer=0,
index_col=None, names=None, parse_cols=None, parse_dates=False,
date_parser=None, na_values=None, thousands=None,
convert_float=True, converters=None, dtype=None,
true_values=None, false_values=None, engine=None,
squeeze=False, **kwds):
def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None,
skip_footer=0, index_col=None, names=None, parse_cols=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally don't like to shuffle parameters around in kwargs. Please align the ordering of these params as much as possible with how read_csv does it (obviously only include the current parameters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not a problem. I was using read_csv as a guide and saw that nrows was directly after skiprows, hence the change.

I will go ahead and line up the read_excel kwargs as close as possible to the read_csv kwargs. I see 3or4 out of order with just a quick glance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great. prob need a slightly expanded whatsnew note to tell about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearranged the kwargs and also updated docstrings to match parameter order (one of the things that always bugs me). Should I also update all internal function kwargs in excel.py to match the external API?

I'm adding the following note to the Other Enhancements section (just wanted to make sure it was the right spot...):
- Rearranged the order of keyword arguments in :func:`read_excel()` to align with :func:`read_csv()`

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I would have the internal API match the external

parse_dates=False, date_parser=None, na_values=None,
thousands=None, convert_float=True, converters=None,
dtype=None, true_values=None, false_values=None,
engine=None, squeeze=False, **kwds):

# Can't use _deprecate_kwarg since sheetname=None has a special meaning
if is_integer(sheet_name) and sheet_name == 0 and 'sheetname' in kwds:
Expand All @@ -211,12 +213,13 @@ def read_excel(io, sheet_name=0, header=0, skiprows=None, skip_footer=0,
io = ExcelFile(io, engine=engine)

return io._parse_excel(
sheetname=sheet_name, header=header, skiprows=skiprows, names=names,
index_col=index_col, parse_cols=parse_cols, parse_dates=parse_dates,
date_parser=date_parser, na_values=na_values, thousands=thousands,
convert_float=convert_float, skip_footer=skip_footer,
converters=converters, dtype=dtype, true_values=true_values,
false_values=false_values, squeeze=squeeze, **kwds)
sheetname=sheet_name, header=header, skiprows=skiprows, nrows=nrows,
names=names, index_col=index_col, parse_cols=parse_cols,
parse_dates=parse_dates, date_parser=date_parser, na_values=na_values,
thousands=thousands, convert_float=convert_float,
skip_footer=skip_footer, converters=converters, dtype=dtype,
true_values=true_values, false_values=false_values,
squeeze=squeeze, **kwds)


class ExcelFile(object):
Expand Down Expand Up @@ -275,11 +278,11 @@ def __init__(self, io, **kwds):
def __fspath__(self):
return self._io

def parse(self, sheet_name=0, header=0, skiprows=None, skip_footer=0,
names=None, index_col=None, parse_cols=None, parse_dates=False,
date_parser=None, na_values=None, thousands=None,
convert_float=True, converters=None, true_values=None,
false_values=None, squeeze=False, **kwds):
def parse(self, sheet_name=0, header=0, skiprows=None, nrows=None,
skip_footer=0, names=None, index_col=None, parse_cols=None,
parse_dates=False, date_parser=None, na_values=None,
thousands=None, convert_float=True, converters=None,
true_values=None, false_values=None, squeeze=False, **kwds):
"""
Parse specified sheet(s) into a DataFrame

Expand All @@ -288,7 +291,9 @@ def parse(self, sheet_name=0, header=0, skiprows=None, skip_footer=0,
"""

return self._parse_excel(sheetname=sheet_name, header=header,
skiprows=skiprows, names=names,
skiprows=skiprows,
nrow=nrows,
names=names,
index_col=index_col,
parse_cols=parse_cols,
parse_dates=parse_dates,
Expand Down Expand Up @@ -335,12 +340,12 @@ def _excel2num(x):
else:
return i in parse_cols

def _parse_excel(self, sheetname=0, header=0, skiprows=None, names=None,
skip_footer=0, index_col=None, parse_cols=None,
parse_dates=False, date_parser=None, na_values=None,
thousands=None, convert_float=True, true_values=None,
false_values=None, verbose=False, dtype=None,
squeeze=False, **kwds):
def _parse_excel(self, sheetname=0, header=0, skiprows=None, nrows=None,
names=None, skip_footer=0, index_col=None,
parse_cols=None, parse_dates=False, date_parser=None,
na_values=None, thousands=None, convert_float=True,
true_values=None, false_values=None, verbose=False,
dtype=None, squeeze=False, **kwds):

skipfooter = kwds.pop('skipfooter', None)
if skipfooter is not None:
Expand Down Expand Up @@ -511,12 +516,13 @@ def _parse_cell(cell_contents, cell_typ):
true_values=true_values,
false_values=false_values,
skiprows=skiprows,
nrows=nrows,
skipfooter=skip_footer,
squeeze=squeeze,
dtype=dtype,
**kwds)

output[asheetname] = parser.read()
output[asheetname] = parser.read(nrows=nrows)
if names is not None:
output[asheetname].columns = names
if not squeeze or isinstance(output[asheetname], DataFrame):
Expand Down
4 changes: 4 additions & 0 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -999,6 +999,8 @@ def _failover_to_python(self):

def read(self, nrows=None):
if nrows is not None:
nrows = _validate_integer('nrows', nrows)
Copy link
Member

@gfyoung gfyoung Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch (see my comment here. However, instead of littering our code with duplicate checks, here's what I think is best:

In your modified parsers.py, there should now be two places where nrows = _validate_integer(...) is called. Delete both of those.

Locate the read method for the TextFileReader in that same file, and add this validation check right before the if nrows is not None. That should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally didn't have this line there, but then I was getting the following error:
TypeError: '>=' not supported between instances of 'int' and 'str' vs
ValueError: 'nrows' must be an integer >=0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should I put this test? There are a bunch in `tests/io/parser', but nothing for read_csv directly.


if self.options.get('skipfooter'):
raise ValueError('skipfooter not supported for iteration')

Expand Down Expand Up @@ -1893,6 +1895,8 @@ def TextParser(*args, **kwds):
date_parser : function, default None
skiprows : list of integers
Row numbers to skip
nrows : int, default None
Number of rows to parse
skipfooter : int
Number of line at bottom of file to skip
converters : dict, default None
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/io/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -1000,6 +1000,20 @@ def test_read_excel_skiprows_list(self):
'skiprows_list', skiprows=np.array([0, 2]))
tm.assert_frame_equal(actual, expected)

def test_read_excel_nrows(self):
# GH 16645
num_rows_to_pull = 5
actual = pd.read_excel(os.path.join(self.dirpath, 'test1' + self.ext),
nrows=num_rows_to_pull)
expected = pd.read_excel(os.path.join(self.dirpath,
'test1' + self.ext))
expected = expected[:num_rows_to_pull]
tm.assert_frame_equal(actual, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to pull more rows than exist in the file as well.


with pytest.raises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung don't we have an issue about nrows validation in the parser?

Copy link
Member

@gfyoung gfyoung Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but I think you can circumvent the check via non read_csv and read_table calls (the check occurs in _read, which does not get hit on read_excel), which is annoying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alysivji : Put this as a separate test, and use tm.assert_raises_regex to check the specific error message raised in the ValueError (we want it to be that validation failed).

pd.read_excel(os.path.join(self.dirpath, 'test1' + self.ext),
nrows='5')

def test_read_excel_squeeze(self):
# GH 12157
f = os.path.join(self.dirpath, 'test_squeeze' + self.ext)
Expand Down