Skip to content

Commit 3769402

Browse files
authored
Merge branch 'main' into fix-pgs
2 parents 67d797c + 1fcb66e commit 3769402

16 files changed

+836
-100
lines changed

.ci/docker/Dockerfile

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,11 @@ RUN bash ./install_user.sh && rm install_user.sh
1515
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
1616
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
1717

18-
# Install conda and other packages
19-
ENV ANACONDA_PYTHON_VERSION=3.10
20-
ENV CONDA_CMAKE yes
21-
ENV DOCS yes
22-
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
23-
COPY ./requirements.txt /opt/conda/
24-
COPY ./common/install_conda.sh install_conda.sh
25-
COPY ./common/common_utils.sh common_utils.sh
26-
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements.txt
18+
COPY ./common/install_pip_requirements.sh install_pip_requirements.sh
19+
COPY ./requirements.txt requirements.txt
20+
RUN bash ./install_pip_requirements.sh && rm install_pip_requirements.sh
21+
22+
RUN ln -s /usr/bin/python3 /usr/bin/python
2723

2824
USER ci-user
2925
CMD ["bash"]

.ci/docker/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ set -exu
1010
IMAGE_NAME="$1"
1111
shift
1212

13-
export UBUNTU_VERSION="20.04"
13+
export UBUNTU_VERSION="22.04"
1414
export CUDA_VERSION="12.4.1"
1515

1616
export BASE_IMAGE="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"

.ci/docker/common/common_utils.sh

Lines changed: 0 additions & 26 deletions
This file was deleted.

.ci/docker/common/install_base.sh

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ install_ubuntu() {
1010
apt-get install -y --no-install-recommends \
1111
build-essential \
1212
ca-certificates \
13-
cmake=3.16* \
13+
cmake=3.22* \
1414
curl \
1515
git \
1616
wget \
@@ -27,7 +27,9 @@ install_ubuntu() {
2727
libglfw3-dev \
2828
sox \
2929
libsox-dev \
30-
libsox-fmt-all
30+
libsox-fmt-all \
31+
python3-pip \
32+
python3-dev
3133

3234
# Cleanup package manager
3335
apt-get autoclean && apt-get clean

.ci/docker/common/install_conda.sh

Lines changed: 0 additions & 55 deletions
This file was deleted.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
3+
set -ex
4+
5+
# Install pip packages
6+
pip install --upgrade pip
7+
pip install -r ./requirements.txt

.jenkins/metadata.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
},
3434
"recipes_source/torch_export_aoti_python.py": {
3535
"needs": "linux.g5.4xlarge.nvidia.gpu"
36-
},
36+
},
3737
"advanced_source/pendulum.py": {
3838
"needs": "linux.g5.4xlarge.nvidia.gpu",
3939
"_comment": "need to be here for the compiling_optimizer_lr_scheduler.py to run."
@@ -58,6 +58,9 @@
5858
"intermediate_source/scaled_dot_product_attention_tutorial.py": {
5959
"needs": "linux.g5.4xlarge.nvidia.gpu"
6060
},
61+
"intermediate_source/transformer_building_blocks.py": {
62+
"needs": "linux.g5.4xlarge.nvidia.gpu"
63+
},
6164
"recipes_source/torch_compile_user_defined_triton_kernel_tutorial.py": {
6265
"needs": "linux.g5.4xlarge.nvidia.gpu"
6366
},

.jenkins/validate_tutorials_built.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
"intermediate_source/mnist_train_nas", # used by ax_multiobjective_nas_tutorial.py
2626
"intermediate_source/fx_conv_bn_fuser",
2727
"intermediate_source/_torch_export_nightly_tutorial", # does not work on release
28+
"intermediate_source/transformer_building_blocks", # does not work on release
2829
"advanced_source/super_resolution_with_onnxruntime",
2930
"advanced_source/usb_semisup_learn", # fails with CUDA OOM error, should try on a different worker
3031
"prototype_source/fx_graph_mode_ptq_dynamic",

beginner_source/ddp_series_fault_tolerance.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
Fault-tolerant Distributed Training with ``torchrun``
1010
=====================================================
1111

12-
Authors: `Suraj Subramanian <https://github.com/suraj813>`__
12+
Authors: `Suraj Subramanian <https://github.com/subramen>`__
1313

1414
.. grid:: 2
1515

beginner_source/ddp_series_theory.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
What is Distributed Data Parallel (DDP)
88
=======================================
99

10-
Authors: `Suraj Subramanian <https://github.com/suraj813>`__
10+
Authors: `Suraj Subramanian <https://github.com/subramen>`__
1111

1212
.. grid:: 2
1313

en-wordlist.txt

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
ACL
22
ADI
3+
ALiBi
34
AOT
45
AOTInductor
56
APIs
@@ -79,6 +80,7 @@ FX
7980
FX's
8081
FairSeq
8182
Fastpath
83+
FFN
8284
FloydHub
8385
FloydHub's
8486
Frobenius
@@ -127,6 +129,7 @@ Kihyuk
127129
Kiuk
128130
Kubernetes
129131
Kuei
132+
KV
130133
LRSchedulers
131134
LSTM
132135
LSTMs
@@ -162,6 +165,7 @@ NLP
162165
NTK
163166
NUMA
164167
NaN
168+
NaNs
165169
NanoGPT
166170
Netron
167171
NeurIPS
@@ -231,6 +235,7 @@ Sigmoid
231235
SoTA
232236
Sohn
233237
Spacy
238+
SwiGLU
234239
TCP
235240
THP
236241
TIAToolbox
@@ -276,6 +281,7 @@ Xcode
276281
Xeon
277282
Yidong
278283
YouTube
284+
Zipf
279285
accelerometer
280286
accuracies
281287
activations
@@ -305,6 +311,7 @@ bbAP
305311
benchmarked
306312
benchmarking
307313
bitwise
314+
bool
308315
boolean
309316
breakpoint
310317
broadcasted
@@ -333,6 +340,7 @@ csv
333340
cuDNN
334341
cuda
335342
customizable
343+
customizations
336344
datafile
337345
dataflow
338346
dataframe
@@ -377,6 +385,7 @@ fbgemm
377385
feedforward
378386
finetune
379387
finetuning
388+
FlexAttention
380389
fp
381390
frontend
382391
functionalized
@@ -431,6 +440,7 @@ mAP
431440
macos
432441
manualSeed
433442
matmul
443+
matmuls
434444
matplotlib
435445
memcpy
436446
memset
@@ -446,6 +456,7 @@ modularized
446456
mpp
447457
mucosa
448458
multihead
459+
MultiheadAttention
449460
multimodal
450461
multimodality
451462
multinode
@@ -456,7 +467,11 @@ multithreading
456467
namespace
457468
natively
458469
ndarrays
470+
nheads
459471
nightlies
472+
NJT
473+
NJTs
474+
NJT's
460475
num
461476
numericalize
462477
numpy
@@ -532,6 +547,7 @@ runtime
532547
runtime
533548
runtimes
534549
scalable
550+
SDPA
535551
sharded
536552
softmax
537553
sparsified
@@ -591,12 +607,14 @@ tradeoff
591607
tradeoffs
592608
triton
593609
uint
610+
UX
594611
umap
595612
uncomment
596613
uncommented
597614
underflowing
598615
unfused
599616
unimodal
617+
unigram
600618
unnormalized
601619
unoptimized
602620
unparametrized
@@ -618,6 +636,7 @@ warmstarted
618636
warmstarting
619637
warmup
620638
webp
639+
wikitext
621640
wsi
622641
wsis
623642
Meta's

index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -664,6 +664,14 @@ Welcome to PyTorch Tutorials
664664
:link: beginner/knowledge_distillation_tutorial.html
665665
:tags: Model-Optimization,Image/Video
666666

667+
668+
.. customcarditem::
669+
:header: Accelerating PyTorch Transformers by replacing nn.Transformer with Nested Tensors and torch.compile()
670+
:card_description: This tutorial goes over recommended best practices for implementing Transformers with native PyTorch.
671+
:image: _static/img/thumbnails/cropped/pytorch-logo.png
672+
:link: intermediate/transformer_building_blocks.html
673+
:tags: Transformer
674+
667675
.. Parallel-and-Distributed-Training
668676
669677

intermediate_source/process_group_cpp_extension_tutorial.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,8 @@ Basics
2525

2626
PyTorch collective communications power several widely adopted distributed
2727
training features, including
28-
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
29-
`ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer>`__,
30-
`FullyShardedDataParallel <https://github.com/pytorch/pytorch/blob/master/torch/distributed/_fsdp/fully_sharded_data_parallel.py>`__.
28+
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ and
29+
`ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer>`__.
3130
In order to make the same collective communication API work with
3231
different communication backends, the distributed package abstracts collective
3332
communication operations into a

intermediate_source/rpc_async_execution.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ speed.
199199
Batch-Processing CartPole Solver
200200
--------------------------------
201201

202-
This section uses CartPole-v1 from `OpenAI Gym <https://gym.openai.com/>`__ as
202+
This section uses CartPole-v1 from OpenAI Gym as
203203
an example to show the performance impact of batch processing RPC. Please note
204204
that since the goal is to demonstrate the usage of
205205
`@rpc.functions.async_execution <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.functions.async_execution>`__

0 commit comments

Comments
 (0)