Skip to content

Commit 56187d8

Browse files
committed
updates per comment
1 parent 8ee714b commit 56187d8

File tree

1 file changed

+71
-60
lines changed

1 file changed

+71
-60
lines changed

recipes_source/xeon_run_cpu.rst

Lines changed: 71 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,35 @@
1-
torch.backends.xeon.run_cpu
2-
===========================
1+
Optimizing PyTorch Inference with Intel(R) Xeon(R) Scalable Processors
2+
======================================================================
33

4-
There are a set of configurations that would influence the performance of PyTorch inference running on Intel(R) Xeon(R) Scalable Processors.
4+
There are several configuration options that can impact the performance of PyTorch inference when executed on Intel(R) Xeon(R) Scalable Processors.
55
To get peak performance, the ``torch.backends.xeon.run_cpu`` script is provided that optimizes the configuration of thread and memory management.
66
For thread management, the script configures thread affinity and the preload of Intel(R) OMP library.
7-
For memory management, it configures NUMA binding and preloads optimized memory allocation libraries (e.g. TCMalloc, JeMalloc).
7+
For memory management, it configures NUMA binding and preloads optimized memory allocation libraries, such as TCMalloc and JeMalloc.
88
In addition, the script provides tunable parameters for compute resource allocation in both single instance and multiple instance scenarios,
99
helping the users try out an optimal coordination of resource utilization for the specific workloads.
1010

11-
Prerequisites
12-
-------------
11+
What you will learn
12+
-------------------
1313

14-
NUMA Access Control
15-
~~~~~~~~~~~~~~~~~~~
14+
* How to utilize tools like ``numactl``, ``taskset``, Intel(R) OpenMP Runtime Library and optimized memory allocators such as TCMalloc and JeMalloc for enhanced performance.
15+
* How to configure CPU cores and memory management to maximize PyTorch inference performance on Intel(R) Xeon(R) processors.
1616

17-
It is a good thing that more and more CPU cores are provided to users in one socket, as it brings in more computation resources.
18-
However, this also brings memory access competitions. Program can stall because memory is busy to visit.
17+
Introduction for Optimizations
18+
------------------------------
19+
20+
Applying NUMA Access Control
21+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22+
23+
It is beneficial that an increasing number of CPU cores are being provided to users within a single socket, as this offers greater computational resources.
24+
However, this also leads to competition for memory access, which can cause programs to stall due to busy memory.
1925
To address this problem, Non-Uniform Memory Access (NUMA) was introduced.
20-
Comparing to Uniform Memory Access (UMA), in which scenario all the memories are connected to all cores equally,
21-
NUMA tells memories into multiple groups. Certain number of memories are directly attached to one socket's integrated memory controller to become local memory of this socket.
26+
Unlike Uniform Memory Access (UMA), where all memories are equally accessible to all cores,
27+
NUMA organizes memory into multiple groups. Certain number of memories are directly attached to one socket's integrated memory controller to become local memory of this socket.
2228
Local memory access is much faster than remote memory access.
2329

24-
Users can get CPU information with ``lscpu`` command on Linux to learn how many cores, sockets are there on the machine.
25-
Also, NUMA information like how CPU cores are distributed can also be retrieved.
26-
The following is an example of ``lscpu`` execution on a machine with Intel (R) Xeon (R) CPU Max 9480.
27-
2 sockets were detected. Each socket has 56 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads.
28-
i.e. each socket has another 56 logical cores. Thus, there are 224 CPU cores on service.
29-
When indexing CPU cores, usually physical cores are indexed before logical core.
30-
In this case, the first 56 cores (0-55) are physical cores on the first NUMA socket (node), the second 56 cores (56-111) are physical cores on the second NUMA socket (node).
31-
Logical cores are indexed afterward. 112-167 are 56 logical cores on the first NUMA socket,
32-
168-223 are the second 56 logical cores on the second NUMA socket.
33-
Typically, running PyTorch programs with compute intense workloads should avoid using logical cores to get good performance.
30+
Users can get CPU information with ``lscpu`` command on Linux to learn how many cores and sockets are there on the machine.
31+
Additionally, this command provides NUMA information, such as the distribution of CPU cores.
32+
Below is an example of executing ``lscpu`` on a machine equipped with an Intel(R) Xeon(R) CPU Max 9480:
3433

3534
.. code-block:: console
3635
@@ -52,46 +51,48 @@ Typically, running PyTorch programs with compute intense workloads should avoid
5251
NUMA node1 CPU(s): 56-111,168-223
5352
...
5453
55-
Linux provides a tool, ``numactl``, that allows user control of NUMA policy for processes or shared memory.
54+
* Two sockets were detected, each containing 56 physical cores. With Hyper-Threading enabled, each core can handle 2 threads, resulting in 56 logical cores per socket. Therefore, the machine has a total of 224 CPU cores in service.
55+
* Typically, physical cores are indexed before logical cores. In this scenario, cores 0-55 are the physical cores on the first NUMA node, and cores 56-111 are the physical cores on the second NUMA node.
56+
* Logical cores are indexed subsequently: cores 112-167 correspond to the logical cores on the first NUMA node, and cores 168-223 to those on the second NUMA node.
57+
58+
Typically, running PyTorch programs with compute intense workloads should avoid using logical cores to get good performance.
59+
60+
Linux provides a tool called ``numactl`` that allows user control of NUMA policy for processes or shared memory.
5661
It runs processes with a specific NUMA scheduling or memory placement policy.
5762
As described above, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations.
5863
From a memory access perspective, bounding memory access locally is much faster than accessing remote memories.
59-
``numactl`` command should have been installed in recent Linux distributions. In case it is missing, we can install it manually with the installation command, like
64+
``numactl`` command should have been installed in recent Linux distributions. In case it is missing, you can install it manually with the installation command, like on Ubuntu:
6065

6166
.. code-block:: console
6267
6368
$ apt-get install numactl
6469
65-
on Ubuntu, or
70+
on CentOS you can run the following command:
6671

6772
.. code-block:: console
6873
6974
$ yum install numactl
7075
71-
on CentOS.
72-
7376
The ``taskset`` command in Linux is another powerful utility that allows you to set or retrieve the CPU affinity of a running process.
74-
``taskset`` are pre-installed in most Linux distributions and in case it's not, we can install it with command
77+
``taskset`` are pre-installed in most Linux distributions and in case it's not, on Ubuntu you can install it with the command:
7578

7679
.. code-block:: console
7780
7881
$ apt-get install util-linux
7982
80-
on Ubuntu, or
83+
on CentOS you can run the following command:
8184

8285
.. code-block:: console
8386
8487
$ yum install util-linux
8588
86-
on CentOS.
87-
88-
OpenMP
89-
~~~~~~
89+
Using Intel(R) OpenMP Runtime Library
90+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9091

9192
OpenMP is an implementation of multithreading, a method of parallelizing where a primary thread (a series of instructions executed consecutively) forks a specified number of sub-threads and the system divides a task among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.
9293
Users can control OpenMP behaviors with some environment variable settings to fit for their workloads, the settings are read and executed by OMP libraries. By default, PyTorch uses GNU OpenMP Library (GNU libgomp) for parallel computation. On Intel(R) platforms, Intel(R) OpenMP Runtime Library (libiomp) provides OpenMP API specification support. It usually brings more performance benefits compared to libgomp.
9394

94-
The Intel(R) OpenMP Runtime Library can be installed via the command
95+
The Intel(R) OpenMP Runtime Library can be installed using one of these commands:
9596

9697
.. code-block:: console
9798
@@ -103,74 +104,69 @@ or
103104
104105
$ conda install mkl
105106
106-
Memory Allocator
107-
~~~~~~~~~~~~~~~~
107+
Choosing an Optimized Memory Allocator
108+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108109

109110
Memory allocator plays an important role from performance perspective as well. A more efficient memory usage reduces overhead on unnecessary memory allocations or destructions, and thus results in a faster execution. From practical experiences, for deep learning workloads, JeMalloc or TCMalloc can get better performance by reusing memory as much as possible than default malloc function.
110111

111-
TCMalloc can be installed by
112+
You can install TCMalloc by running the following command on Ubuntu:
112113

113114
.. code-block:: console
114115
115116
$ apt-get install google-perftools
116117
117-
on Ubuntu, or
118+
On CentOS, you can install it by running:
118119

119120
.. code-block:: console
120121
121122
$ yum install gperftools
122123
123-
on CentOS.
124-
125-
In conda environment, it can also be installed by
124+
In a conda environment, it can also be installed by running:
126125

127126
.. code-block:: console
128127
129128
$ conda install conda-forge::gperftools
130129
131-
JeMalloc can be installed by
130+
On Ubuntu JeMalloc can be installed by this command:
132131

133132
.. code-block:: console
134133
135134
$ apt-get install libjemalloc2
136135
137-
on Ubuntu, or
136+
On CentOS it can be installed by running:
138137

139138
.. code-block:: console
140139
141140
$ yum install jemalloc
142141
143-
on CentOS, or
142+
In a conda environment, it can also be installed by running:
144143

145144
.. code-block:: console
146145
147146
$ conda install conda-forge::jemalloc
148147
149-
in conda environment.
150-
151-
152148
Quick Start Example Commands
153149
----------------------------
154150

155-
1. To run single-instance inference with 1 thread on 1 CPU core (only Core #0 would be used)
151+
1. To run single-instance inference with 1 thread on 1 CPU core (only Core #0 would be used):
156152

157153
.. code-block:: console
158154
159155
$ python -m torch.backends.xeon.run_cpu --ninstances 1 --ncores-per-instance 1 <program.py> [program_args]
160156
161-
2. To run single-instance inference on a single CPU node (NUMA socket).
157+
2. To run single-instance inference on a single CPU node (NUMA socket):
162158

163159
.. code-block:: console
164160
165161
$ python -m torch.backends.xeon.run_cpu --node-id 0 <program.py> [program_args]
166162
167-
3. To run multi-instance inference, 8 instances with 14 cores per instance on a 112-core CPU
163+
3. To run multi-instance inference, 8 instances with 14 cores per instance on a 112-core CPU:
168164

169165
.. code-block:: console
170166
171167
$ python -m torch.backends.xeon.run_cpu --ninstances 8 --ncores-per-instance 14 <program.py> [program_args]
172168
173-
4. To run inference in throughput mode, in which all the cores in each CPU node set up an instance
169+
4. To run inference in throughput mode, in which all the cores in each CPU node set up an instance:
174170

175171
.. code-block:: console
176172
@@ -180,18 +176,17 @@ Quick Start Example Commands
180176

181177
Term "instance" here doesn't refer to a cloud instance. This script is executed as a single process which invokes multiple "instances" which are formed from multiple threads. "Instance" is kind of group of threads in this context.
182178

183-
Usage of torch.backends.xeon.run_cpu
184-
------------------------------------
179+
Using ``torch.backends.xeon.run_cpu``
180+
-------------------------------------
185181

186-
The argument list and usage guidance can be shown with
182+
The argument list and usage guidance can be shown with the following command:
187183

188184
.. code-block:: console
189185
190186
$ python -m torch.backends.xeon.run_cpu –h
191187
usage: run_cpu.py [-h] [--multi-instance] [-m] [--no-python] [--enable-tcmalloc] [--enable-jemalloc] [--use-default-allocator] [--disable-iomp] [--ncores-per-instance] [--ninstances] [--skip-cross-node-cores] [--rank] [--latency-mode] [--throughput-mode] [--node-id] [--use-logical-core] [--disable-numactl] [--disable-taskset] [--core-list] [--log-path] [--log-file-prefix] <program> [program_args]
192188
193-
positional arguments
194-
~~~~~~~~~~~~~~~~~~~~
189+
The command above has the following positional arguments:
195190

196191
+------------------+---------------------------------------------------------+
197192
| knob | help |
@@ -204,14 +199,14 @@ positional arguments
204199
Explanation of the options
205200
~~~~~~~~~~~~~~~~~~~~~~~~~~
206201

207-
The generic option settings (knobs) are:
202+
The generic option settings (knobs) include the following:
208203

209204
+----------------------+------+---------------+-------------------------------------------------------------------------------------------------------------------------+
210205
| knob | type | default value | help |
211206
+======================+======+===============+=========================================================================================================================+
212207
| ``-h``, ``--help`` | | | Show the help message and exit. |
213208
+----------------------+------+---------------+-------------------------------------------------------------------------------------------------------------------------+
214-
| ``-m``, ``--module`` | | | Changes each process to interpret the launch script as a python module, executing with the same behavior as 'python -m'.|
209+
| ``-m``, ``--module`` | | | Changes each process to interpret the launch script as a python module, executing with the same behavior as "python -m".|
215210
+----------------------+------+---------------+-------------------------------------------------------------------------------------------------------------------------+
216211
| ``--no-python`` | bool | False | Do not prepend the program with "python" - just exec it directly. Useful when the script is not a Python script. |
217212
+----------------------+------+---------------+-------------------------------------------------------------------------------------------------------------------------+
@@ -270,7 +265,7 @@ Knobs for controlling instance number and compute resource allocation are:
270265

271266
.. note::
272267

273-
Environment variables that will be set by this script include
268+
Environment variables that will be set by this script include the following:
274269

275270
+------------------+-------------------------------------------------------------------------------------------------+
276271
| Environ Variable | Value |
@@ -288,4 +283,20 @@ Knobs for controlling instance number and compute resource allocation are:
288283
| | "oversize_threshold:1,background_thread:true,metadata_thp:auto". |
289284
+------------------+-------------------------------------------------------------------------------------------------+
290285

291-
Please note that the script respects environment variables set preliminarily. i.e. If you have set the environment variables mentioned above before running the script, the values of the variables will not overwritten by the script.
286+
Please note that the script respects environment variables set preliminarily. For example, if you have set the environment variables mentioned above before running the script, the values of the variables will not be overwritten by the script.
287+
288+
Conclusion
289+
----------
290+
291+
In this tutorial, we explored a variety of advanced configurations and tools designed to optimize PyTorch inference performance on Intel(R) Xeon(R) Scalable Processors.
292+
By leveraging the ``torch.backends.xeon.run_cpu script``, we demonstrated how to fine-tune thread and memory management to achieve peak performance.
293+
We covered essential concepts such as NUMA access control, optimized memory allocators like TCMalloc and JeMalloc, and the use of Intel(R) OpenMP for efficient multithreading.
294+
295+
Additionally, we provided practical command-line examples to guide you through setting up single and multiple instance scenarios, ensuring optimal resource utilization tailored to specific workloads.
296+
By understanding and applying these techniques, users can significantly enhance the efficiency and speed of their PyTorch applications on Intel(R) Xeon(R) platforms.
297+
298+
See also:
299+
300+
* `PyTorch Performance Tuning Guide <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#cpu-specific-optimizations>`__
301+
* `PyTorch Multiprocessing Best Practices <https://pytorch.org/docs/stable/notes/multiprocessing.html#cpu-in-multiprocessing>`__
302+
* Grokking PyTorch Intel CPU performance: `Part 1 <https://pytorch.org/tutorials/intermediate/torchserve_with_ipex>`__ `Part 2 <https://pytorch.org/tutorials/intermediate/torchserve_with_ipex_2>`__

0 commit comments

Comments
 (0)