Processing entire time period with Dask is slower than processing smaller time chunks in series. #8833
Unanswered
claytharrison
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Otherwise the approach is quite nice and exactly how I would write it! Nice work! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The process
I have many ordered satellite swath files (netCDF4) with just an observations (
obs
) dimension and a variable containing the ID of a grid point each observation is taken from. I am creating aggregations of the files grid-point-wise over chunks of time (e.g. weekly mean).To do this, I open up the files from the desired time range into a multifile dataset, then use flox's
xarray_reduce
to calculate the means for each grid point over the desired time chunks.Single swath file footprint (truncated)
Multifile-dataset footprint (truncated)
This results in a Dataset with a
time_chunks
dimension and alocation_id
dimension with the (e.g.) mean value of the desired variables for each location over each time chunk:Grouped dataset footprint (two weeks)
I then use
groupby('time_chunks')
andsave_mfdataset
to save a file to disk for each time chunk.Code example
The problem
My problem is that when I use huge masses of source data (e.g. several months at a time), the computation time blows way up beyond what is necessary.
For example, aggregating two weeks of data into week-long chunks takes about 5 minutes on my machine. At that rate, it should take about an hour to process 6 months of data by just tossing two-week chunks into the script in series. But if I toss in all six months at once, the process takes ten hours instead.
Surely I'm doing something wrong here. Is there anything I can do in terms of rechunking to make things more efficient?
I already tried a few different values for
chunk
inopen_mfdataset
, but nothing seemed to be much better than the default (where each source file is its own chunk). I also tried rechunking the grouped array, since by default thetime_chunks
dimension is all a single chunk (sorry, bad naming there...). But that didn't make much difference either.Any pointers would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions