Skip to content

ENH: HDFStore.flush() to optionally perform fsync (GH5364) #5369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 29, 2013
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2745,6 +2745,12 @@ Notes & Caveats
need to serialize these operations in a single thread in a single
process. You will corrupt your data otherwise. See the issue
(:`2397`) for more information.
- If serializing all write operations via a single thread in a single
process is not an option, another alternative is to use an external
distributed lock manager to ensure there is only a single writer at a
time and all readers close the file during writes and re-open it after any
writes. In this case you should use ``store.flush(fsync=True)`` prior to
releasing any write locks. See the issue (:`5364`) for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you shorten this to:

If you use locks to manage write access between multiple processes, you may want to use :py:func:`~os.fsync` before releasing write locs. For convenience you can use ``store.flush(fsync=True)`` to do this for you.

- ``PyTables`` only supports fixed-width string columns in
``tables``. The sizes of a string based indexing column
(e.g. *columns* or *minor_axis*) are determined as the maximum size
Expand Down
2 changes: 2 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,8 @@ API Changes
- store `datetime.date` objects as ordinals rather then timetuples to avoid
timezone issues (:issue:`2852`), thanks @tavistmorph and @numpand
- ``numexpr`` 2.2.2 fixes incompatiblity in PyTables 2.4 (:issue:`4908`)
- ``flush`` now accepts an ``fsync`` parameter, which defaults to ``False``
(:issue:`5364`)
- ``JSON``

- added ``date_unit`` parameter to specify resolution of timestamps.
Expand Down
24 changes: 21 additions & 3 deletions pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import copy
import itertools
import warnings
import os

import numpy as np
from pandas import (Series, TimeSeries, DataFrame, Panel, Panel4D, Index,
Expand Down Expand Up @@ -525,12 +526,30 @@ def is_open(self):
return False
return bool(self._handle.isopen)

def flush(self):
def flush(self, fsync=False):
"""
Force all buffered modifications to be written to disk
Force all buffered modifications to be written to disk.
By default this method requests PyTables to flush, and PyTables in turn
requests the HDF5 library to flush any changes to the operating system.
There is no guarantee the operating system will actually commit writes
to disk.
To request the operating system to write the file to disk, pass
``fsync=True``. The method will then block until the operating system
reports completion, although be aware there might be other caching
layers (eg disk controllers, disks themselves etc) which further delay
durability.
Parameters
----------
fsync : boolean, invoke fsync for the file handle, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need all of the explanation above. Just put a short summary under fsync:

Parameters
----------
fsync : bool (default False)
    call ``os.fsync()`` on the file handle to force writing to disk.

Then at the end you could add:

Notes
-----
Without ``fsync=True``, flushing may not guarantee that the OS writes to disk. With fsync, the operation will block until the OS claims the file has been written; however, other caching layers may still interfere.

"""
if self._handle is not None:
self._handle.flush()
if fsync:
os.fsync(self._handle.fileno())

def get(self, key):
"""
Expand Down Expand Up @@ -4072,5 +4091,4 @@ def timeit(key, df, fn=None, remove=True, **kwargs):
store.close()

if remove:
import os
os.remove(fn)
6 changes: 6 additions & 0 deletions pandas/io/tests/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,12 @@ def test_flush(self):
store['a'] = tm.makeTimeSeries()
store.flush()

def test_flush_fsync(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just combine with the previous flush test. Not very different, especially since we're not mocking anything.


with ensure_clean(self.path) as store:
store['a'] = tm.makeTimeSeries()
store.flush(fsync=True)

def test_get(self):

with ensure_clean(self.path) as store:
Expand Down