SlidingWindowInferer: option to adaptively stitch in cpu memory for large images (#5297)

myron · wyli · web-flow · commit dae09ffcaa24 · 2022-10-13T22:06:15.000Z
SlidingWindowInferer: option to adaptively stitch in cpu memory for large images. This adds an option to provide maximum input image volume (number of elements) to dynamically change stitching to cpu memory (to avoid gpu memory crashes). For example with `cpu_thresh=400*400*400`, all input images with large volume will be stitched on cpu. At the moment, a user must decide beforehand, to stitch ALL images on cpu or gpu (by specifying the 'device' parameter). But in many datasets, only a few large images require device==cpu, and running inference on cpu for ALL will be unnecessary slow. It's related to #4625 #4495 #3497 #4726 #4588 ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: myron <amyronenko@nvidia.com> Co-authored-by: Wenqi Li <831580+wyli@users.noreply.github.com>
diff --git a/monai/inferers/inferer.py b/monai/inferers/inferer.py
@@ -122,6 +122,9 @@ class SlidingWindowInferer(Inferer):
             `inputs` and `roi_size`. Output is on the `device`.
         progress: whether to print a tqdm progress bar.
         cache_roi_weight_map: whether to precompute the ROI weight map.
+        cpu_thresh: when provided, dynamically switch to stitching on cpu (to save gpu memory)
+            when input image volume is larger than this threshold (in pixels/volxels).
+            Otherwise use ``"device"``. Thus, the output may end-up on either cpu or gpu.
 
     Note:
         ``sw_batch_size`` denotes the max number of windows per network inference iteration,
@@ -142,8 +145,9 @@ def __init__(
         device: Union[torch.device, str, None] = None,
         progress: bool = False,
         cache_roi_weight_map: bool = False,
+        cpu_thresh: Optional[int] = None,
     ) -> None:
-        Inferer.__init__(self)
+        super().__init__()
         self.roi_size = roi_size
         self.sw_batch_size = sw_batch_size
         self.overlap = overlap
@@ -154,6 +158,7 @@ def __init__(
         self.sw_device = sw_device
         self.device = device
         self.progress = progress
+        self.cpu_thresh = cpu_thresh
 
         # compute_importance_map takes long time when computing on cpu. We thus
         # compute it once if it's static and then save it for future usage
@@ -189,6 +194,11 @@ def __call__(
             kwargs: optional keyword args to be passed to ``network``.
 
         """
+
+        device = self.device
+        if device is None and self.cpu_thresh is not None and inputs.shape[2:].numel() > self.cpu_thresh:
+            device = "cpu"  # stitch in cpu memory if image is too large
+
         return sliding_window_inference(
             inputs,
             self.roi_size,
@@ -200,7 +210,7 @@ def __call__(
             self.padding_mode,
             self.cval,
             self.sw_device,
-            self.device,
+            device,
             self.progress,
             self.roi_weight_map,
             *args,