Stable release of automatic mixed precision (AMP). New Beta features include a TensorPipe backend for RPC, memory profiler, and several improvements to distributed training for both RPC and DDP.
PyTorch 1.6.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug Fixes
- Performance
- Documentation
Highlights
The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training.
A few of the highlights include:
- Automatic mixed precision (AMP) training is now natively supported and a stable feature - thanks to NVIDIA’s contributions;
- Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
- New profiling tools providing tensor-level memory consumption information; and
- Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.
Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post here.
[Stable] Automatic Mixed Precision (AMP) Training
AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported torch.cuda.amp API, AMP provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.
[Beta] TensorPipe backend for RPC
PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, ...) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, ...) and model and pipeline parallel training (think GPipe), gossip SGD, etc.
# One-line change needed to opt in
torch.distributed.rpc.init_rpc(
...
backend=torch.distributed.rpc.BackendType.TENSORPIPE,
)
# No changes to the rest of the RPC API
torch.distributed.rpc.rpc_sync(...)[Beta] Memory Profiler
The torch.autograd.profiler API now includes a memory profiler that lets you inspect the tensor memory cost of different operators inside your CPU and GPU models.
Here is an example usage of the API:
import torch
import torchvision.models as models
import torch.autograd.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(profile_memory=True, record_shapes=True) as prof:
model(inputs)
# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
# --------------------------- --------------- --------------- ---------------
# Name CPU Mem Self CPU Mem Number of Calls
# --------------------------- --------------- --------------- ---------------
# empty 94.79 Mb 94.79 Mb 123
# resize_ 11.48 Mb 11.48 Mb 2
# addmm 19.53 Kb 19.53 Kb 1
# empty_strided 4 b 4 b 1
# conv2d 47.37 Mb 0 b 20
# --------------------------- --------------- --------------- ---------------Distributed and RPC Features and Improvements
[Beta] DDP+RPC
PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Currently, these two features work independently and users can’t mix and match these to try out hybrid parallelism paradigms.
Starting PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.
// On each trainer
remote_emb = create_emb(on="ps", ...)
ddp_model = DDP(dense_model)
for data in batch:
with torch.distributed.autograd.context():
res = remote_emb(data)
loss = ddp_model(res)
torch.distributed.autograd.backward([loss])[Beta] RPC - Asynchronous User Functions
RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when an callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the @rpc.functions.async_execution decorator; and 2) Let the function return a torch.futures.Future and install the resume logic as callbacks on the Future object. See below for an example:
@rpc.functions.async_execution
def async_add_chained(to, x, y, z):
return rpc.rpc_async(to, torch.add, args=(x, y)).then(
lambda fut: fut.wait() + z
)
ret = rpc.rpc_sync(
"worker1",
async_add_chained,
args=("worker2", torch.ones(2), 1, 1)
)
print(ret) # prints tensor([3., 3.])- Tutorial for performant batch RPC using Asynchronous User Functions| Link
- Documentation | Link
- Usage examples | Link
[Beta] Fork/Join Parallelism
This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.
Parallel execution of TorchScript programs is enabled through two primitives: torch.jit.fork and torch.jit.wait. In the below example, we parallelize execution of foo:
import torch
from typing import List
def foo(x):
return torch.neg(x)
@torch.jit.script
def example(x):
futures = [torch.jit.fork(foo, x) for _ in range(100)]
results = [torch.jit.wait(future) for future in futures]
return torch.sum(torch.stack(results))
print(example(torch.ones([])))- Documentation | Link
Backwards Incompatible Changes
Dropped support for Python <= 3.5 (#39879)
The minimum version of Python we support now is 3.6. Please upgrade your Python to match. If you use conda, instructions for setting up a new environment with Python >= 3.6 can be found here.
Throw a RuntimeError for deprecated torch.div and torch.addcdiv integer floor division behavior (#38762, #38620)
In 1.5.1 and older PyTorch releases torch.div , torch.addcdiv, and the / operator perform integer floor division. In 1.6 attempting to perform integer division throw a RuntimeError, and in 1.7 the behavior will change so that these operations always perform true division (consistent with Python and NumPy division).
To floor divide integer tensors, please use torch.floor_divide instead.
| 1.5.1 | 1.6.0 |
|---|---|
>>> torch.tensor(3) / torch.tensor(2)
../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer
division of tensors using div or / is deprecated, and in a future
release div will perform true division as in Python 3. Use true_divide
or floor_divide (// in Python) instead.
tensor(1)
|
>>> # NB: the following is equivalent to
>>> # torch.floor_divide(torch.tensor(3), torch.tensor(2))
>>> torch.tensor(3) // torch.tensor(2)
tensor(1)
|
The fix for torch.addcdiv is similar.
| 1.5.1 | 1.6.0 |
|---|---|
>>> input = torch.tensor(0)
>>> tensor = torch.tensor(1)
>>> other = torch.tensor(3)
>>> value = 1
>>> torch.addcdiv(input, tensor, other, value=value)
../aten/src/ATen/native/PointwiseOps.cpp:81: UserWarning:
Integer division with addcdiv is deprecated, and in a future
release addcdiv will perform a true division of tensor1 and
tensor2. The current addcdiv behavior can be replicated using
floor_divide for integral inputs (self + value * tensor1 // tensor2)
and division for float inputs (self + value * tensor1 / tensor2).
The new addcdiv behavior can be implemented with
true_divide (self + value * torch.true_divide(tensor1, tensor2).
tensor(0)
|
>>> input = torch.tensor(0)
>>> tensor = torch.tensor(1)
>>> other = torch.tensor(3)
>>> value = 1
>>> (input + torch.floor_divide(value * tensor, other))
tensor(0)
|
Prevent cross-device data movement for zero-dimension CUDA tensors in binary pointwise PyTorch operators (#38998)
In previous versions of PyTorch, zero dimensional CUDA tensors could be moved across devices implicitly while performing binary pointwise operations (e.g. addition, subtraction, multiplication, division, and others). For example,
torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')would work, even though the tensors are on different CUDA devices. This is a frequent source of user confusion, however, and PyTorch generally does not move data across devices without it being explicit. This functionality is removed in PyTorch 1.6.
To perform binary pointwise operations on data of different devices, please cast the tensors to the correct device by using Tensor.to:
| Version 1.5.1 | Version 1.6.0 |
|---|---|
>>> torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')
torch.tensor([6, 6], device='cuda:1')
|
>>> torch.tensor(5, device='cuda:0').to('cuda:1') + torch.tensor((1, 1), device='cuda:1')
torch.tensor([6, 6], device='cuda:1')
|
Dropped support for CUDA 9.2 on Windows
In previous versions of PyTorch, we provided an installation option for Windows environments running CUDA 9.2. Starting from PyTorch 1.6.0, we are no longer providing those binaries. Please upgrade your CUDA version to 10.1 or 10.2 and install a PyTorch binary for one of those CUDA versions instead.
PyTorch release binaries dropped dedicated bytecode for CUDA compute capability 6.1; removed PTX for CUDA compute capability 3.7
To check whether you are affected, please find your GPU in a table inthis link.
If you are using a Nvidia GPU with compute capability 6.1, you may notice a performance hit when using the release binaries (installed via pip or conda). We stopped building for CUDA compute capability 6.1 but PyTorch programs should still continue to work with those devices. If you do notice a performance hit, a workaround is to compile PyTorch from source.
If you are using a Nvidia GPU with compute capability 3.7 and relied on PTX, we have dropped support for that in our release binaries (installed via pip or conda). Potential workarounds are: install a previous version of PyTorch or to compile PyTorch from source.
Changed how bool tensors are constructed from non-bool values to match Python, C++, and NumPy (#38392)
In previous versions of PyTorch, when a bool tensor is constructed from a floating-point tensor, we would first convert the tensor to a long tensor, then to float tensor. This is not consistent with how bools are interpreted in Python, C++, and NumPy (just to name a few), which interpret 0 floating-point values as False and everything else as True.
If you were relying on the previous behavior, the following code will achieve the same effect.
| Version 1.5.1 | Version 1.6.0 |
|---|---|
>>> torch.tensor([-2, -1, -0.9, 0, 0.9, 1, 2], dtype=torch.bool)
tensor([ True, True, False, False, False, True, True])
|
>>> torch.tensor([-2, -1, -0.9, 0, 0.9, 1, 2]).long().bool()
tensor([ True, True, False, False, False, True, True])
|
Throw RuntimeError when torch.full would infer a float dtype from a bool or integral fill value (#40364)
In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.
Enabled thread parallelism for autograd on CPU (#33157)
In previous versions of PyTorch, running .backward() in multiple threads causes them to be serialized in a specific order, resulting in no parallelism on CPU. In PyTorch 1.6.0, running .backward() in multiple threads no longer serializes the execution and instead autograd will run those in parallel.
This is BC-breaking for the following two use cases:
- If any weights are shared among threads, gradient accumulation that was previously deterministic may become non-deterministic in 1.6 as two different threads will write to the .grad attribute in a non-deterministic order.
- If you use any C++ hooks, those are not guaranteed to be thread-safe. Please change them to be thread-safe.
In more detail, in 1.6.0, when you run backward() or grad() via python, TorchScript or the C++ API in multiple threads on CPU, you should expect to see extra concurrency. For example, you can manually write multithreaded Hogwild training code like:
# Define a train function to be used in different threads
def train_fn(model, input):
# forward
y = model(input)
# backward
y.sum().backward()
# potential optimizer update
# define your model in python or in TorchScript
model = Model()
# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
# define or load the data
input = torch.ones(5, 5, requires_grad=True)
p = threading.Thread(target=train_fn, args=(model, input))
p.start()
threads.append(p)
for p in threads:
p.join()Note when you use the same model and call backward() concurrently in multiple threads, model parameters are automatically shared across threads. The gradient accumulation might become non-deterministic as two backward calls might access and try to accumulate the same .grad attribute. Although we do proper locking to avoid data corruption, we don't guarantee the order in which the ops are executed, so non-determinism might arise, but this is an expected pattern in multithread training. You could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid the non-determinism.
For thread safety:
- The custom Python/C++ Autograd Functions (both forward and backward) are properly protected and are guaranteed to be thread safe in 1.6.0.
- For hooks, both Python/C++ hooks will run concurrently. Note that in C++, just like in regular C++ threading, you will need to do proper locking when writing shared objects, so previous custom C++ hooks might not work nicely under a multithreaded environment in 1.6.0. In Python, just like in regular python threading, you can read/write objects safely but the order (and thus determinism) is not guaranteed.
Change autograd gradient accumulation logic to yield .grads that match the weights' memory layout (#40358)
In previous versions of PyTorch, autograd would yield contiguous gradients. Now, gradients have the same memory layout as their respective weights. This should result in silent performance improvements. Since PyTorch operators generally support non-contiguous tensors, this should have no functional effect on most PyTorch programs. A known exception is when accessing param.grad and performing an operation that requires a contiguous tensor, such as param.grad.view(-1). In this case, you will receive an error as follows:
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
If a user wants to force accumulation into a grad with a particular layout, they can preset param.grad to a zeroed tensor with the desired strides or manually set grad to have the desired strides ( param.grad = param.grad.contiguous(desired format).)
See the below section on “Note: BC-breaking memory format changes” for more details.
Change memory format promotion rules of pointwise operators (#37968)
In previous versions of PyTorch, performing a binary pointwise operation between a Contiguous and a Channels Last tensor produced a Channels Last. In PyTorch 1.6, this now returns a tensor with the layout of the first operand.
See the below section on“Note: BC-breaking memory format changes” for more details.
Note: BC-breaking memory format changes
Operations that now return tensors in a different memory format generally should have no functional effect on most PyTorch programs because PyTorch operators generally support non-contiguous tensors.
The most common incompatibility with Python programs is with the view operator, which has specific stride requirements. If these requirements are no longer met as a result of this change, you will get an error message indicating that you should use reshape instead, i.e. "RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead."
Another possible exception incompatibility is if you have a (usually) C++ operator implementation that works directly on memory (i.e. calls data_ptr and relies on the strides being contiguous).
nn.functional.interpolate: recompute_scale_factor default behavior changed from True to False (#39453)
In PyTorch 1.5.1 and older versions, nn.functional.interpolate(input, size, scale_factor, ..., recompute_scale_factor) has a default of recompute_scale_factor = True. In PyTorch 1.6, we’ve changed the default to recompute_scale_factor = False.
Depending on the precision of the scale_factor, this may result in an output tensor with different values than before. To retain the old behavior, simply change your code to use recompute_scale_factor = True.
More concretely, what recompute_scale_factor = True means is, if the user passes in a scale_factor:
- We will first compute the new output size; and
- Then, we will compute a new
scale_factorby dividing the output size by the input size and sending it to an internal helper function. - The new
scale_factoris used in the interpolate computation but in some cases is different from thescale_factorthe user passed in.
This behavior resulted in loss of precision so we deprecated it in PyTorch 1.5.0. In PyTorch 1.6 and onward, recompute_scale_factor has a default of False, which means that we pass it directly to an internal helper function.
out= arguments of pointwise and reduction functions no longer participate in type promotion (#39655)
In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is,
out = torch.add(a, b)could produce a different result than
torch.add(a, b, out=out)This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed.
Changed torch.quasirandom.SobolEngine(..., scramble=True, seed=None) to respect torch.manual_seed when a seed has not been provided (#36427)
In previous versions of PyTorch, SobolEngine(..., scramble=True, seed=None) did not respect any calls to torch.manual_seed. The expected behavior for random number generation functions is to respect the seed set by torch.manual_seed, so we’ve changed SobolEngine to match.
If you were relying on the old behavior where SobolEngine ignores torch.manual_seed, please explicitly pass a different seed to SobolEngine:
| Version 1.5.1 | Version 1.6.0 |
|---|---|
>>> torch.manual_seed(1337)
# SobolEngine ignores the manual_seed and instead uses its own.
>>> `x1 = SobolEngine(dimension=1, scramble=True, seed=None).draw(3)`
|
>>> import time
>>> torch.manual_seed(1337)
# To replicate the old behavior of, pass a seed to SobolEngine.
>>> ms_since_epoch = int(round(time.now() * 1000))
>>> x1 = SobolEngine(dimension=1, scramble=True, seed=ms_since_epoch).draw(3)
|
Tensor.random_(to, from): Enforce check that from and to are within the bounds of the Tensor’s dtype (#37507)
In previous versions of PyTorch, to and from did not have to be within the bounds of the tensor’s dtype (this raised a warning). The behavior of random_ in that case can be unexpected. We are making this a hard error starting from PyTorch 1.6.0; please modify your code if you run into the error.
| Version 1.5.1 | Version 1.6.0 |
|---|---|
>>> tensor = torch.zeros(10, dtype=torch.uint8)
# 256 is the maximum value for `to` for `torch.uint8`
>>> tensor.random_(0, 257)
UserWarning: to - 1 is out of bounds for unsigned char.
|
>>> tensor = torch.zeros(10, dtype=torch.uint8)
# 256 is the maximum value for `to` for `torch.uint8`
>>> tensor.random_(0, 256)
|
Dropped support for CUDA < 9.2 from for source builds (#38977, #36846)
If you build PyTorch from source, we’ve dropped support for using CUDA < 9.2 (run nvcc --version to check your CUDA version). Users who install PyTorch packages via conda and/or pip are unaffected.
DataLoader’s __len__ changed to return number of batches when holding an IterableDataset (#38925)
In previous versions of PyTorch, len(<instance of dataloader holding an IterableDataset>) would return the number of examples in the dataset. We’ve changed it to be the number of batches (e.g., the number of examples divided by the DataLoader’s batch_size) to be consistent with the computation of length when the DataLoader has a BatchedSampler.
torch.backends.cudnn.flags: deleted unused verbose flag (#39228)
The verbose flag did nothing, so we deleted it. If you were passing a value to flags for verbose, please remove it.
RPC
RpcBackendOptions takes float instead of timedelta for timeout argument to stay consistent with timeout types in other TorchScriptable RPC APIs.
# v1.5
rpc.init_rpc(
"worker1",
rank=0,
world_size=2,
rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
num_send_recv_threads=16,
datetime.timedelta(seconds=20)
)
)# v1.6
rpc.init_rpc(
"worker1",
rank=0,
world_size=2,
rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
num_send_recv_threads=16,
20 # seconds
)
)TorchScript
The Default Executor Is Rolled Back To Legacy (#41017)
We rolled back to the old fuser and the legacy executor in this release in order to recover some reported performance regressions. In future releases we plan to reach the same or better performance with a new redesigned executor and fuser.
In order to switch back to the executor used in the 1.5 release one could use the following API:
- in Python: call
torch._C._jit_set_profiling_executor(True)before you call your model for the first time, - in C++: include
#include <torch/csrc/jit/runtime/graph_executor.h>and setgetExecutorMode() = truebefore you invoke your model for the first time.
Added dynamic versioning (#40279)
Note: this isn’t actually BC-breaking but we are listing it here because it is BC-Improving.
The PyTorch Team recommends saving and loading modules with the same version of PyTorch. Older versions of PyTorch may not support newer modules, and newer versions may have removed or modified older behavior. These changes are explicitly described in PyTorch’s release notes, and modules relying on functionality that has changed may need to be updated to continue working properly.
In this release, the historic behavior of torch.div and torch.full is preserved for models saved via torch.jit.save in previous versions of PyTorch. Modules saved with the current version of PyTorch will use the latest torch.div and torch.full behavior. See the notes above for the BC changes to those operators.
Internals
The following are a list of BC-breaking changes to some of PyTorch’s internal components.
Dispatcher C++ API has had some spring cleaning. This is still considered an “internal” API, but it is becoming more public facing as it stabilizes.
- Renamed callUnboxed() to call() in Dispatcher, OperatorHandle, KernelFunction (#37999)
- The TensorId suffix has been removed from most DispatchKey enum entries (#36240)
- Removed ::callOp(); use Dispatcher::call instead (renamed in #37797, removed in #38351, #38742)
- Removed
KernelFunction::makeFromUnboxedFunctorFactory; use makeFromUnboxedFunctor directly instead (#35488) - Renamed boxing/unboxing files and utilities in ATen/core/boxing (#35411)
autograd.gradcheck and autograd.gradgradcheck: Added a new default-true argument check_undefined_grad (#39400)
Internally, in the autograd engine, we use a special undefined Tensor value to represent zero-filled gradients and expect backward functions and user-defined torch.autograd.Functions to gracefully handle those values. When check_undefined_grad is True (the default for PyTorch 1.6+), gradcheck/gradgradcheck test that the operation in question supports undefined output gradients. This may cause a previously succeeding gradcheck to fail.
You can turn the check off by setting check_undefined_grad to False. As long as autograd does not error out due to an undefined gradient in your model, then everything should be fine.
| Version 1.5.1 | Version 1.6.0 |
|---|---|
>>> torch.autograd.gradcheck(my_custom_function, inputs)
True
|
>>> # To keep the previous behavior
>>> torch.autograd.gradcheck(my_custom_function, inputs, check_undefined_grad=False)
True
|
[C++ API] Changed the TensorIterator API (#39803)
TensorIterator is an implementation detail for writing kernels that is exposed in our C++ API. We’ve modified how developers interact with TensorIterator, please see the Pull Request for more details.
Removed torch._min and torch._max(#38440)
torch._min and torch._max are undocumented and were intended to be an implementation detail; we expect very few users, if any at all, to be using it. We’ve deleted it in PyTorch 1.6.0. Please use torch.min/torch.max instead if you are using torch._min/torch._max.
Deprecations
Deprecated old torch.save serialization format (#39460, #39893, #40288, #40793)
We have switched torch.save to use a zip file-based format by default rather than the old Pickle-based format. torch.load has retained the ability to load the old format, but use of the new format is recommended. The new format is:
- more friendly for inspection and building tooling for manipulating the save files
- fixes a long-standing issue wherein serialization (
__getstate__,__setstate__) functions onModulesthat depended on serializedTensorvalues were getting the wrong data - the same as the TorchScript serialization format, making serialization more consistent across PyTorch
Usage is as follows:
m = MyMod()
torch.save(m.state_dict(), 'mymod.pt') # Saves a zipfile to mymod.ptTo use the old format, pass the flag _use_new_zipfile_serialization=False
m = MyMod()
torch.save(m.state_dict(), 'mymod.pt', _use_new_zipfile_serialization=False) # Saves pickleFixed missing deprecation warning for Tensor.nonzero() (#40187)
Calling torch.nonzero(tensor, as_tuple=False) with one argument or Tensor.nonzero(as_tuple=False) with no arguments is deprecated and will be removed in a future version of PyTorch. Please specify the as_tuple argument.
New Features
Python API
New Utilities
- Added global hooks to
torch.nn.Module(#38972) - Added option to enable cpp stack traces with
TORCH_SHOW_CPP_STACKTRACES=1(#38127) - Added
torch.utils.show_picklefor showing pickle contents in saved models (#35168)
New Operators
torch.logcumsumexpadded (#36308)torch.logaddexpadded (#38384)torch.rad2deg,torch.deg2radadded (#38852)torch.arccosh,torch.arcsinh,torch.arctanhadded (#38388)torch.flip{lr, ud}added (#38599)torch.bucketize,torch.searchsortedadded (#34577)torch.istft(Inverse Short Time Fourier Transform) added (#35569)torch.vander: added support for generating Vandermonde matrices (#36725)torch.block_diagadded (#33449)nn.Hardswish,nn.functional.hardswishadded (#34747)torch.nn.init.trunc_normal_(truncated normal initializer) added (#32397)- Added Stochastic Weight Averaging. See
torch.optim.AveragedModelandtorch.optim.SWALRfor more details.(#35032)
C++ API
- Added Optimizer
AdamWto C++ frontend (#40009) - Custom C++ autograd function now supports c10::optional as parameters (#37700)
- torch::Tensor now supports bitwise NOT(!), AND(&), OR(|), XOR(^) operators (#38691)
- Cpp extension now supports load and
load_inlineunder ROCm (#35897)
[Beta] Complex Tensor support
The PyTorch 1.6 release brings beta-level support for complex tensors. The UX is similar to existing PyTorch tensors and the new complex-specific functionality is compatible with NumPy’s complex arrays. In particular, you’ll be able to create and manipulate complex tensors, interop with previously existing code that represented complex tensors as tensors of size (..., 2), and more.
While this is an early version of this feature, and we expect it to improve over time, the overall goal is provide a NumPy compatible user experience that leverages PyTorch’s ability to run on accelerators and work with autograd to better support the scientific computing and ML communities.
Please find the full documentation here.
Python API:
- Added
torch.is_signed()for complex tensors. (#33773) - Added dtype inference for complex tensors. (#33713)
- Added
torch.randnandtorch.normal_for complex tensors. (#34037, #35056) - Added complex type inference for
torch.full. (#34709) - Added type promotion logic for complex numbers. (#34093)
- Added
is_complextensor attribute for complex numbers. (#34093) - Added torch.fill for complex tensors. (#34973)
- Added
torch.randfor complex dtypes. (#34924, #35585) - Fixed complex conversions, used in
torch.copy_, on cuda. (#35344) - Added
torch.from_numpyfor complex dtypes. (#35531) - Added a check to throw error for in place modification of non-complex tensors with complex number values. (#35883)
- Fixed
torch.expCPU implementation for complex tensors. (#35715) - Added
torch.masked_fillfor complex tensors. (#36335) - Updated
torch.absto return float tensors for complex tensors. (#35871) - Added
torch.isfiniteandtorch.isinffor complex tensors. (#36648) - Added
torch.isclosefor complex tensors. (#36456) - Updated
torch.angleto return float tensors for complex tensors. (#36896) - Enabled
requires_gradfor complex tensors. (#36932) - Fixed reciprocal divide for complex tensors. (#37193)
- Added
torch.reciprocalfor complex tensors on CUDA. (#36749) - Added Python API for
Complex Storage. (#35771) - Added
torch.addmvfor complex tensors. (#37924, #40238) - Updated dtype inference for
torch.tensor. (#38030) - Added
torch.powfor complex tensors on CUDA. (#36793) - Added support for complex values as exponents in
torch.pow.(#36793, #39117) - Added
torch.rollfor complex tensors on CUDA. (#38664) - Added
torch.gatherfor complex tensors on CPU. (#36430) - Added
torch.tanhfor complex tensors on CUDA. (#38786) - Added complex dtypes to list of supported types in autograd. (#38325)
- Added
torch.cumsum, torch.cumprodfor complex tensors on CUDA. (#39063) - Added
realandimagviews as tensor attributes. (#39033) - Added
torch.flipandtorch.rot90for complex tensors. (#37826) - Added
torch.view_as_real,torch.view_as_complexfor complex tensors. (#39099) - Added printing logic for complex tensors (#40513, #38031)
- Add
torch.tanfor complex tensors on CUDA (#38400) - Added support for complex tensors for
torch.tanhbackward function (#37791, #38786)
C++ API:
- Added core of c10::complex. (#36626)
- Added overloads of std:: math functions in c10::complex (#37468, #37689)
- Added c10::complex as the C++ type for complex tensors (#37421, #39306)
- Added support for operations on c10::complex and integer scalars (#38418)
- Added overloads for complex math functions in both :: and std:: to fix ROCm bugs (#39829)
- Added
at::tensor()andtorch::tensor()for complex numbers (#39793)
Distributed
torch.distributed: Addall_to_allAPI to the MPI backend in the distributed module (#32361).torch.distributed: Addc10ddynamic loading mechanism to support 3rd-partyc10dimplementations (#28068).torch.nn.parallel.DistributedDataParallel: Add distributed data parallel benchmark tool (#35198).torch.nn.parallel.DistributedDataParallelandtorch.distributed.rpc: allow DDP to work with RPC (#37998, #39916, #40130, #40139, #40495).
Mobile
- Add
torch.utils.mobile_optimizer.optimize_for_mobileto encapsulate several model optimizations appropriate for mobile models. (Note: currently broken on Windows.) (#35227) (#36357)
New operator registration API
PyTorch 1.6 has a new, pybind11-based operator registration API which replaces the torch::RegisterOperators() class.
Before:
static auto registry =
torch::RegisterOperators("my_ops::warp_perspective", &warp_perspective);After:
TORCH_LIBRARY(my_ops, m) {
m.def("warp_perspective", warp_perspective);
}You can read more about this API in the custom C++ operators tutorial or the reference documentation.
The new API was developed in PRs #35061, #35629, #35706, #36222, #36223, #36258, #36742, #37019. Internal code was ported to this API in #36799, #36800, #36389, #37834, #38014; you may find the code examples in these PRs helpful for your ports.
ONNX
In PyTorch 1.6, we have added support for ONNX Opset 12. We have also enhanced export of torchvision models, such as FasterRCNN, MaskRCNN, and KeypointRCNN to support dynamic input image size. Export support for several new ops have also been added. A new operator export mode, ONNX_FALLTHROUGH, has been added to the export API that allows exporting the model with non-standard ONNX operators. For large (> 2 GB) model export (using external_data_format=True argument), we now support models with large tensor data in attributes (not just model parameters).
New ONNX operator support:
- Update Dropout Export (#37641)
- Update Argmin/Argmax ONNX Export (#38329)
- Fix pow op export (#38065)
- Export Support for Celu (#38243)
- Add GreaterOrEqual and LessOrEqual to opset 12 ONNX export (#38311)
- ONNX Export Support for CrossEntropyLoss (#34830)
- Adding 'numel' and 'to' export for script module (#36501)
- Support clamp_min and clamp_max (#37872)
- Quantization: Add
aten::max_pool2dto onnx jit pass (#34912) - Quantization: Mark
upsample_nearest2d, sigmoid and reshape as no scale in onnx (#36325) - Quantization: export of quantized models with new conv and linear API in onnx (#38736)
Quantization
New quantization operators:
- quantized Conv1d (#35093, #36352, #38248, #38283, #38438, #38449, #38749)
- quantized hardsigmoid (#34959,#36351, #36698, #36699)
- quantized hardswish (#34820,#36350, #36252, #36320, #36545)
- quantized layernorm (#36593, #36690, #35693)
- quantized groupnorm (#36835, #39090)
- quantized instancenorm (#36847, #39091)
- quantized reflection_pad1d (#37452)
- quantized adaptive avgpool. (#36813)
- channel shuffle op fp32 + quantized. (#36815)
- qnnpack path for hardtanh (#35779)
- Quantized Threshold (#39352)
RPC
torch.distributed.rpc: Add TensorPipe RPC backend (#36197, #35483, #37839, #37918, #37919,#37850,#37851, #37852,#37980, #38052, #38265, #38266, #40162, #40389, #37910, #38448, #38818, #38819, #38926, #38931, #38930, #38933, #38934, #39010, #39011, #39397)torch.distributed.rpc: Support per-RPC timeouts forrpc_syncandrpc_async(#34650)torch.distributed.rpc.functions.async_execution: Add an@async_executiondecorator to allow pause and resume executions in RPC target functions (#39216, #39267, #39485, #39486, #39758).torch.futures.Future:Expose aFuturetype to Python API (#39008, #37311, #39119, #39597, #39964, #39950)torch.distributed.rpc: Allow profiler to be enabled remotely with RPC (#38748, #40066)torch.distributed.rpc: Implement TorchScript-compatibleRemoteModuleAPI (#37139, #40173)torch.distributed.rpc.RRef: enable retrying RRef control messages on communication failures (#33636)torch.distributed.rpc: Let RPC usetorch._C.Futureinstead of exposing a dedicated future type. No impact on user side (#35039)torch.distributed.autograd: Add profiler support forbackwardof the distributed autograd engine (#35261)torch.distributed.rpc.RRef: Add TorchScript support forRRef.local_value()(#35433)torch.distributed.rpc.WorkerInfo: Add TorchScript support forWorkerInfo(#35447)torch.distributed.rpc: Allow profiling RPC with TorchScript target functions (#36275)torch.distributed.rpc.RRef: Add RRef Python Helper to launch function on the remotely referenced object (#36619)torch.distributed.rpc: Add timeout argument to TorchScriptablerpc_async(#37884)torch.distributed.rpc: Enable RPC Server Global Profiler (#38847)torch.distributed.rpc: Implement timeout support forrpc.remoteandRRef.to_here()(#38590)torch.distributed.rpc: Enable RRef timeout for TensorPipe (#39531)torch.distributed.rpc.WorkerInfo: AddWorkerInfopython__repr__magic method (#40004)
TorchScript
- Fork / Join Async Parallelism (#40438)
- ScriptModule Freezing (#40409, #37044, #38830, #34786, #34787)
Improvements
Python API
- Added long description to wheel packages (#39676)
torch.add: Prevent unbounded growth while adding sparse tensors (#36030)torch.mv: enabled for sparse tensors (#21782)torch.bmm: enabled for sparse x dense tensor operations (#33430)torch.cat: improved error message (#38978)torch.masked_select: enabled bfloat16 support (#36859)torch.absolute: added as an alias fortorch.abs(#36597)torch.device: improved error message to includexlaas an acceptable device (#36446)torch.linspace,torch.logspace: improved precision (#35461)Tensor.true_dividemethod variant added (#34794)Tensor.isnan(),Tensor.isinf(),Tensor.isfinite()method variants added (#37942)Tensor.is_nonzero: improved error message (#38150)Tensor.cauchy_, Tensor.log_normal_,Tensor.exponential_: added support for bfloat16 (#38427)Tensor.as_subclassmethod added. (#34369)collect_env.py: improved to detect relevant conda-installed numpy and cudatoolkit (#35646)collect_env.py: made it more robust on Windows (#39136)torch.utils.data: Addgenerator=kwarg for DataLoader & random samplers (#39737)torch.utils.data.DataLoader: properly diagnose exceeding file descriptor limit (#34768)torch.utils.data.DataLoader: added repr for WorkerInfo (#39975)torch.utils.data.random_split: added option to pass a generator for determinism (#34043)torch.utils.data.IterableDataset: make the warning for when a DataLoader holds an IterableDataset clearer (#41185)torch.nn: Added support for non-persistent buffers that do not show up in a Module’s state dict (#37191)nn.Fold,nn.Unfold: added double backwards support (#36379)nn.MultiheadAttention: added support for bool/byteattn_masktensor (#33763)nn.functional.upsample: enabled uint8 sampling support (#35029)nn.functional.kl_div: added option to accept target in log space (#34586)nn.functional.softmax: added support for sparse tensors (CPU) (#36305)nn.Softmin,nn.Softmax: improved repr (#39084)- warnings: Changed warnings generated in cpp to show point of Python origination (#36052)
- warnings: Improve warnings to actually point at user code (#39143)
- Extend some of the basic ops to kHalf (#37121)
- Added a warning to a known autograd issue on XLA backend. (#35449, #35543)
torch.cuda: Change DeprecationWarning to FutureWarning (#32142)- Added
torch.utils.cmake_prefix_pathpointing toshare/cmakefolder (#38559) torch.hub: Addedfile_nameargument toload_state_dict_from_url(#39749)- Disable autograd while preparing Tensor for printing (#39420)
- Improved CUDA error message for MSVC (#39987)
- Improved reentrant autograd error message (#38625)
- Let >> and << support half on CUDA (#37670)
- dockerfile: Update miniconda installer download location & remove unnecessary flag (#37082)
torch.cuda.get_arch_list()andtorch.cuda.get_gencode_flags()added. These return the architecture list and gencode flags PyTorch was compiled with. (#41212)torch.min, torch.max: significantly improved CUDA performance (#38440, #39029)torch.multinomialwithreplacement=False:significantly improved performance (#39742)
Python Type Annotations
torch.autograd: add type hints in-line (#38080)torch.finfo,torch.iinfotype annotations added (#38220)- Moved
torch.cudaannotations inline (#40075) - Add typing for
torch.cuda._CudaStreamBaseandtorch.cuda._CudaEventBaseclasses (#40256) - Introduced
torch.types.Deviceand stubbed alltorch._Cfunctions comprehensively (#38173) - Move all
torch.nnmodules type annotations inline (#38211) - Fixes type annotations for named tensors (#36890)
- Fix minor issue in type stub for Optimizer (#38067)
- Fixed some miscellaneous type hints (#36584)
- Fix multiple issues with type annotations (#36358)
torch.autograd.anomaly_mode: fixed type hints stub (#39324)torch.backends.cudnnadded type annotations (#38947)torch.channels_last,torch.preserve_format: added annotations (#39120)
AMD/ROCm
torch.topk: enabled support for BFloat16 type on ROCm. (#34849)torch.dot: enabled fp16 support on ROCm (#30431, #30432)torch.add: enabled support for BFloat16 type on ROCm for sparse tensors(#35978)- Enabled bfloat16 for operators in BERT model (#37634)
torch.log: improved ROCm support (#40079)torch.pow,torch.exp,torch.erf: enabled support for BFloat16 type on ROCm (#40236)
C++ API
- Eliminate warnings for cpp extensions on Windows (#37400)
- Disable C4251 when compiling
cpp_extensionson Windows (#35272)
Note: Above two PRs eliminate unnecessary compile warnings for windows build, make build log more readable.
Distributed
torch.distributed: Enhance error message for MPI unavailability. (#36781).torch.distributed: Exposetorch.distributed.is_available()API (#37021).torch.utils.data: Only createtorch.generatorand seed inDistributedSamplerwhen shuffling (#37604).ProcessGroup: Log incorrect device inProcessGroupGloo(#38844).torch.utils.data: ImproveDistributedSamplerdocs and add seed option (#39628).torch.cuda.comm.reduce: Avoid initializing unnecessary tensors innccl.reduce(#39688).torch.nn.parallel.DistributedDataparallel: Remove obsolete warning message from DDP (#40190).
Distributions
distributions.Cauchy: Implemented kl divergence (#36477)distributions.Transform: Add a.with_cache()method (#36882)distributions.Binary: Implemented BTRS algorithm for fast/efficient binomial sampling (#36858)
Internals
- New macro
TORCH_FNfor passing in compile time function pointers as regular function arguments rather than template arguments (#39823, #40110) - Improved support for more types in registered custom kernels
- Added FPGA DispatchKey, DeviceType, Backend for out-of-tree experimentation (#38938)
- Better type safety for calling the dispatcher; we now do a runtime test when casting OperatorHandle to TypedOperatorHandle that you’ve provided the correct type for kernels (#40251)
- OperatorHandle::callBoxed now works on all operators, you no longer need to manually go through JIT registry (#36010, #36850)
- Added Dispatcher::redispatch for performing a dispatch that bypasses the current key and all keys before it (#35476, subsequently renamed)
- More operators are fully supported by the dispatcher (#37273, #36564, #36398, #36666, #36838)
- Tracing is no longer done inside our autograd code; instead it has been factored into a separate Tracing dispatch key (#39514, #38467)
- DispatchKey computation no longer relies on TensorOptions; instead, factory functions and other functions with special dispatch key computation needs can register a BackendSelect kernel to compute the required key. (#36290, #36562, #37257)
ONNX
- Enable Constant Folding for ONNX Opset 12 (#34823)
- ONNX Update training ops and training amenable export API (#35567)
- Fix for constant folding: Slice, Added ReduceL1 and ReduceL2 (#35280)
- Added support for constant folding onnx::Add and onnx::Sub (#35869)
- Enable constant folding for Shape (#35386)
- Improve error checking for large model export (#37798)
- Remove Aten ops from ONNX export (#37239)
- Update pytoch/onnx doc (#39480)
- Update pytorch/onnx docs for new export API args (#39802)
- Support large attribute and subgraph for large model (#38793)
Operator Benchmark
- Added benchmark for quantized batchnorm (#35389)
- Added more quantized activation benchmarks and input sizes (#35729)
- Added
__torch_function__benchmarks (#36138) - Aligned qconv benchmark to conv (#36673)
- Aligned the qlinear benchmark to linear (#36674)
- Added CUDA support for the observer benchmark (#39360)
Profiler
torch.autograd.profiler: Make RecordFunction callbacks thread local and modernize interface (#37491)torch.autograd.profiler: Make profiler thread local (#36291)
Quantization
- Add ConvBn3d, ConvBnReLU3d, BNReLU2d, BNReLU3d to eager mode quantization (#33540)
- Enabled per channel quantized static linear/conv in QNNPACK (#37622)
- Enable per-channel quantization for LSTM Modules (#39666, #39041)
- Dynamic quantization support for LSTMCell, RNNCell and GRUCell (#40102)
- Quantization aware training now works with nn.DataParallel and nn.DistributedDataParallel
- Add quantized tensor support on CUDA (#37081)
- Add reduce_range params for quantized_lstm (#39604)
- Use TorchBind for ConvPackedParams (#35923)
- Use TorchBind for Linear PackedParams" (#38101)
RPC
torch.distributed.rpc.RRef: Throw an actionable error message on user callRRef.to_here()in TorchScript (#35369)torch.distributed.rpc.RRef: Handle exceptions returned viaremote()calls (#35331)torch.distributed.rpc.RRef: Make RRef type hint mismatch exception message more actionable to users (#35943)torch.distributed.rpc:Allow abortRecvWork::wait()inProcessGroupAgent::listenLoop(#36084)torch.distributed.autograd: Appropriately handle exceptions in autograd engine. (#36019)torch.distributed.autograd: Catch exception in distributed engine callbacks. (#36118)torch.distributed.autograd: Avoid some future callback self-captures. (#36502)torch.distributed.rpc: Propagate error from RPC retries to the original attempt (#35263)torch.distributed.autograd: Ensure future is complete when exitingEngine::mark_graph_task_completed()(#36856)torch.distributed.autograd: Trigger pre/post hooks of output function nodes under distributed autograd (#34501)torch.distributed.rpc: Supporting create an RPC gang of world size 1 (#32731)torch.distributed.autograd: Improve Error Message for Dist Autograd Context Cleanup Failure (#37255)torch.distributed.rpc: Guard against negativerpcTimeoutbeing passed in toRpcBackendOptions(#38267)torch.distributed.rpc: Use infinite timeout for operations in ProcessGroup RPC backend (#38577)torch.distributed.rpc.WorkerInfo: Add stringifyWorkerInfo(#39974)torch.distributed.rpc: Avoid using default process group in ProcessGroupAgent. (#39909)torch.distributed.rpc: Ignore expected errors in TensorPipe RPC backend (#39182)torch.distributed.rpc: Don't use separate heap allocation for metrics in TensorPipe RPC backend (#39183)torch.distributed.rpc: Bind to hostname's IP address instead of localhost in TensorPipe RPC backend (#39184)torch.distributed.rpc: Use PrefixStore to avoid conflicting keys in TensorPipe RPC backend (#39185)
TorchScript
Improvements
- Add
idfunction (#34975) - Add lazy script decorator (#34935)
- Make Future type annotation available in Python (#27637)
- Support converting
strtofloat(#35352) - Enable recording of TorchScript functions (#34710)
- Improve the error message when registering a custom class twice (#35568)
- Improve optimization of
ifstatements with statically determinable predicates (#35834) - Fix reporting of error message in
toBool(#35570) - Better error when types of default value and parameter do not match (#35888)
- Improve serialization for lists and dictionary (#35741)
- Add type hints on
hardsigmoid,hardswish, andeluto make them scriptable (#35885) - Add
stricttracer flag to guard against risky behaviors (#36277) - Add support of
Dictas output when connecting script and tracing (#36265) - Use current default
dtypewithtorch.tensorwhendtypeis not specified (#36587) - Add dictionary as output of tracer (#36696)
- Allowing casting
strtoint(#36016) - Convert float Tensor argument to double in
Tensor.tolist(#37465) - Add a
code_with_constantsmethod to module printing (#37586) - Support indexing using list literal as index (#37848)
- Support indexing using list variable as index (#37966)
- Support
delstatements with variables as targets in TorchScript (#37608) - Recursively compile TorchScript class types (#38050)
- Better error message when missing
initon custom C++ classes (#37474) - Fix
@staticmethodaccess fromselfon modules (#37702) - Allow
@torch.jit.unusedto be used on TorchScript classes (#38522, #39336) - Add support for
%=operator in TorchScript (#38983) - Provide error messages when JIT infers the type of an argument as
Tensor(#38527) - Allow self-referential type annotations in TorchScript classes (#39821)
- Support having a different forward method when not in scripting mode (#38158)
- Fix
index_put_error in subscript assignment (#38378) - Refactor attributes to support buffers and parameters as first class citizens, add support for iterating over named_buffers() (#37905)
- Add ROCm-specific
half_support_literal(#38899) - Make
torch.unique_consecutivecompilable (#39339) - Make
deepcopy()of Objects callg/setstateif present (#39500) - Allow slicing sequential container (fe45c2c)
- Support
torch.Tensorsubclasses (likeParameter) as inputs to functions (#39487) - Add
dtypeas supported type annotation (#39741) - Improve error message when type annotation Future without a contained type (#39751)
- Fix inconsistent results of string
splitfunc (#38772) - Support
pad_sequence/pack_sequence(#39844) - Enable
copy.deepcopyandcopy.copyforRecursiveScriptModule(#32685) - Fix zip serialization for file > 2GiB (0c90b6d)
- Fix
dictConstructordering and enable dict mix (41816dc) - Fix delegating to
jit.loadfromtorch.load(#41013) - Add distributed
backwardsupport (#38494)
Bug Fixes
Python API
torch.cat: fixed missing type promotion (#35030, #39777)torch.gather: fixed silently incorrect results when in-place gather tries to use incorrect shapes (#37102)torch.median: fixedNaNcomparison (#38216)torch.cdist: fixed backward calculation forp=2(#37337)torch.eig: fixed segfault when input has NaNs and infs (#37642)torch.irfft: stopped modifying the input in-place (#35219)torch.max,torch.min,torch.median: fixed incorrect backwards implementation (#36316)torch.fmod: fixed crash on division by zero (#38919)torch.multinomial: fixed support for tensors with empty batch (#39873)torch.einsum: fixed incorrect__torch_function__handling (#38741)torch.remainder: fixed overflow when dividend is very large (#37758)torch.remainder: fixed precision issues for CPU tensors (#38293)torch.argmax,torch.argmin: fixed bug for big CPU tensors withdim=2(#39576)torch.histc:fixed support when passed empty tensor (#38987)torch.as_strided: added error message when passed a negative stric=de (#39508)torch.argmax,torch.argmin: fixed bogus returns when called on a scalar tensor (#37214)torch.topk: Fixed bogus results with 4d+ input tensors with topk dimension >= 1024/2048 on CUDA (depending on GPU) (#40349)torch.mv: Fixed bug when grad hasstride=0on GPU in the backward pass (#38321)>>,<<on CUDA changed to match the behavior on CPU for certain compiler variants (#35339)Tensor.exponential_(0)fixed to return a Tensor filled withinf(#36837)Tensor.to(..., non_blocking=True): fixed regression wherenon_blockingis ignored (#35144)Tensor.to: fixed CUDA negative float to uint8 cast to be consistent with CPU (#36832)- Fixed incorrect binary pointwise operations when the first argument is a scalar (#39956)
Tensor.copy_: Fixed error when used with AMD devices (#38003)torch.tensor: fix segfault in error checking in Tensor constructor (#40106)- Fix overflow issues when constructing tensors with large numbers (#39140)
- Fixed regression in unary ops casting to output dtype (#41097)
nn.Module: fixed AttributeError reporting fornn.Module's properties (#34324)nn.MaxPool2d: fix for returning wrong shape withreturn_indices=Trueon CUDA (#38992)nn.MaxPool2d: fix NCHW backward bug (#38953)nn.MaxPool2d: fixed dilated case (#36288)nn.MultiheadAttention: Removed weights from__constants__to fix warnings when converting to TorchScript.nn.ConvTranspose2d: fixed error in backward pass for fp16 inputs. (#37569)nn.ConvTranspose3d: fixed index overflow (#39198)nn.RReLU: fixed memory leak (#39347)nn.PReLU: fixed stack overflow in backward pass (#36134)nn.MultiheadAttention: fixed assertion to support FP16 training (#37539)nn.MultiheadAttention: Updated assert to remove check on 3rd dim for MHA (#39402)nn.ModuleDict,nn.ParameterDict: fixed bug in updating with anotherModuleDict/ParameterDict, respectively (#27814)nn.BatchNorm: fixed buffer update whentrack_running_statsis set to False (#38084)nn.MaxPool3d: fixed incorrect CUDA backward results for non-square output (#36820)nn.DataParallel: fixed support for empty tensors (#35965)nn.functional.grid_sample: fixed out of boundary bug when grid contains large numbers (#35506)nn.functional.max_pool2d,nn.functional.avg_pool2d: fixed issue when stride=None (#39221)nn.functional.max_pool2d: fixed erroneous dimension out of range on CUDA (#36095)nn.grad._grad_input_padding: fixed support for dilation argument (#33872)nn.functional.log_softmax: improved accuracy on CUDA (#38945)nn.utils.prune,nn.utils.weight_norm: fixed problems when used with RNNs (#34170)- Fixed nan, inf in GPU {fractional,adaptive} max_pool{2,3}d (#39903)
nn.functional.interpolation: nearest interpolation implementation fix for CUDA (#39055)torch.utils.mkldnn.to_mkdnn: covernn.Conv1din mkldnn model conversion logic (#38528)torch.utils.data.DataLoader: Relax sampler check in BatchSampler (#38403)torch.utils.data.DataLoader: The exception raised when RandomSampler.replacement is non-boolean should be TypeError (#36547)torch.utils.data.DataLoader: Correct a ValueError in dataloader to TypeError (#36244)torch.utils.data.DataLoader: Allow shuffle when auto-batching is disabled (#39865)torch.utils.data.DataLoader: Kill DataLoader workers when we can't join to clean up gracefully (#39869)torch.utils.data.Dataloader: Added error when usingdefault_collateon lists of unequal size (#38492)- Fixed crashes on
import torchrelated to defining static data in Vec256 (#37767, #38088) - For
out=operations, preserve output tensor's strides if it is correctly sized (#38895) cuda: fixed a bug where it was possible to incorrectly access the CUDA device before it was initialized (#36714)torch.device: Added better device idx parse checks (#37376)torch.autograd: fixed init-shutdown race condition in autograd engine (#39194)torch.autograd: Fixed error when using hooks with no__name__attributetorch.autograd: Fixed error message (#39729)torch.autograd: wait for non-reentrant threads to shutdown (#34529)torch.autograd: Add undefined tensor gradient support to all backward functions (#39400)torch.autograd: fixed engine flakiness (#35599)torch.autograd.Function: fixed ability to report error messages inside (#34845)torch.autograd: move scalar input to a different device when needed; fixes backward passes of binary-pointwise operators with scalar inputs (#35286)torch.autograd.gradcheck: Fixed behavior forstride=0(#38774)torch.autograd.Function: prevent custom Functions from creating non differentiable type that requires grad (#38326)torch.no_grad: Fixed bad interaction betweentorch.no_gradandtensor.numpy()conversion (#38906)torch.optim.AdamW: fixed error message (#36088)torch.optim.Optimizer.state_dict()fixed non-determinism (#37347)torch.hub: added optional request headers to avoid “connection refused” errors (#39740)torch.hub.hub_dir: fixed inconsistencies (#38969)- OpenMP: fixed memory leak for
num_threads==1with operations that use OpenMP (#39533) torch.multiprocessing: Fixed deadlock when sharing CUDA tensors (#40347)torch.distributions.Binomial: fix bug where there is a small chance of incorrectly returning -1 (#38456)torch.cuda.amp.GradScalar: fixed bug whereGradScalarwas not pickle-able (#38296)- Fixed uninitialized value in helper function
vec_reduce_all(#37853) - Fixed potential memory corruption in helper function
cpu_serial_kernel(#37869) - Synchronize MAGMA functions with the current CUDA stream (#36605)
- Windows support: Fix openmp detection with the clang-cl compiler (#35365)
- Windows support: Use
ProgramFilesenvironment variable on Windows for portability (#39707) - Windows support: Fix AVX detection with clang-cl (#35653)
- Windows support: Delay loading the cuda library until it is necessary (#37811)
- Windows support: Fix
_copysignis not a member of std (#35199) - Windows support: Fix zip serialization for files > 2GiB (#40783)
- Windows support: Add runtime check for MSVC redist, fixed
import torcherrors (#39841) - Windows support: More fixes about using Windows API through ctypes (#39376)
- Windows support: fixed
import torcherrors (#39334) - Windows support: Fix wrong MSVC version constraint for CUDA 9.2 (#40794)
- Windows support: Use LoadLibraryEX, fix problems when loading dlls (#38302)
- Windows support: Fix dll load failure in virtual environments (#39622)
- Windows support: Make
find_first_setwork on x86 MSVC (#38637, #38706) - Removes pickle deprecation warning (#39003)
- dockerfile: Sync submodules (#35423)
- Fix crashes in
manywheelsbuilds related to having specialCUDNNsearch path rules fortorch_python(#37349) - *
torch._six.PY37should be true for Python-3.8 as well (#40868) *
AMD/ROCm
- Stopped erroneously warning about CUDA compute capabilities (#35949)
- Stopped using MIOpen for tensors with more than
INT_MAXnumber of elements (#37110) - Enable HgemmBatched for ROCm (#37483)
- Fix encoding errors for hipify tool (#37906)
- Added HIP version guard for occupancy API compatibility (#38551)
- Fix the processing logic of
bernoulli(#40001) - Use correct device type when exporting tensors to DLPack (#40124)
C++ API
- Fixed the crash problem when using
BuildExtension.with_options(#40121) - Fixed the dir permission denied problem when multiple user building cpp_ext on the same machine (#34239)
Distributed
torch.nn.SyncBatchNorm: Fix batch size check. (#37133).torch.nn.parallel.DistributedDataparallel: Fix DDP error checking for unused parameters (#36054).torch.nn.DataParallel: EnsureDataParallelreplicas can be pickled (#37307).torch.distributed: EnsureNCCL_BLOCKING_WAIT=1works fordist.barrier()(#40249).torch.nn.SyncBatchNorm: Avoid blocking host thread when usingSyncBatchNorm(#36659).torch.cuda.comm.gather: FixGather::applyto avoid accessing moved tensors (#39733).torch.nn.parallel.DistributedDataparallel: Add a guard to allow DDP’s autograd engine callback to work in a with non-default CUDA streams (#40115).
Internals
- Add missing mutex for listener removal (#35486)
- Add missing mutex for fallback register/deregister (#36628)
- Improved boxed dispatch performance (#33313)
- Refactored jit::Operator to more clearly distinguish the two possible states: c10 vs jit (#33905, #36634)
- Per device initialization now occurs in backend kernels via code generation, rather than during backend selection (#37402)
- Improved support for dispatcher on mobile
- Improved error messages
ONNX
- Fixes default dtype value for onnx hardtanh export (opset11) (#35467)
- disable size optimizations for onnx (#36243)
- Adding a pass to replace interpolate function with
aten::__interpolate(#35744) - fix
provider_versionand add consistency test (#36797) - Fix numerical errors in softmax when dim is not last dimension (#37326)
- make onnx expect tests resilient to producer_version changes (#39002)
- Enable models tests (#38791)
- Enable Constant Folding Tests (#38751)
- Bump up ONNX submodule to a82c6a7010e2e332d8f74ad5b0c726fd47c85376 (#39372)
- Fix type casting for reduce ops (#38829)
- Fix ONNX export of RNNs with no bias (#36894)
- Fix regression disabling checker (#39073)
- Fix KeypointRCNN test (#39589)
- Fix bug in export of ops involving torch.bool type (#40006)
- Fix bug in export of cumsum operator (#40044)
- Set onnx opset version before model select (#37466)
- Enable tests for opset 12 (#37846)
- Enable tests in
test_pytorch_onnx_onnxruntime(#37868) - Enable tests in test_operators.py (#39431)
Operator Benchmark
- Fixed missing comma in activation benchmarks (#35104)
- Fixed bug where activation benchmarks didn’t run anything (#35731)
- Replaced
import cpp_benchmarkwithtorch.utils.cpp_benchmark(#38832)
Profiler
torch.autograd.profiler: Usehigh_resolution_clockfor profiling on Mac (#37280)torch.autograd.profiler: Fixes for profiling JIT code (#38453)torch.autograd.profiler: Destroy CUDA events after profiling (#39962)
Quantization
- Fix a bug for convolution bias in QAT Conv-BN (#36173)
- Ensure that histogram observers have zero-point of zero for post ReLU activations (#37107)
- Unify numerics between fakequant and quant/dequant (#37188)
- Release qnnpack original weights for conv/linear (#37595)
- Fix histogram observer with 0 input (#40191)
- Histogram observer bug fix with min == max (#40310)
- Add save/load state_dict to quantized dynamic RNNs (#39105)
- Ensure qconv doesn't assert with empty batch (#38252)
- Support empty batch input for quantized ops (#38508)
- Fixed CUDA memory pinning (#41139)
RPC
torch.distributed.autograd: Respect dist autograd context intorch.jit._fork. (#34360)torch.distributed.autograd: Continue tryingsend()even if onesend()failed when cleanup distributed autograd contexts (#34943)torch.distributed.rpc: In ProcessGroup RPC backend, avoid read-after-free (#35252)torch.distributed.rpc: Fixaten::waitfor RPC futures(#35695)torch.distributed.rpc: Fixprim::rpc_asyncfor RPC futures (#35994)torch.distributed.rpc: Only Schedule Retries before Agent Shutdown (#35554)torch.distributed.rpc: CallthreadPool.waitWorkCompleteafterlistenerThread.join()to fix graceful shutdown (#35394)torch.distributed.rpc: Fixing Potential TSAN issue with joining RPC helper threads (#36094)torch.distributed.rpc: Fix race during RPC shutdown. (#36113)torch.distributed.rpc: Fixing RPC shutdown and thread joining (#36239)torch.distributed.autograd: Capture global state, distributed autograd current context id, before thread switching triggered by JITfuture.wait()(#36395)torch.distributed.autograd: Fix race inmark_graph_task_completed. (#36640)torch.distributed.rpc: Acquire GIL when constructing/destructingConcretePyObjectHolder(#37870)torch.distributed.rpc: Explicitly decrefpy::objectinConcretePyObjectHolderandPythonFunctionGuard(#38364)torch.distributed.rpc: Explicitly decrefpy::objectinPythonRpcHandler(#38366)torch.distributed.rpc: Keeppy::objectalive untiljit::toIValuereturns (#38348)torch.distributed.rpc: Use GIL to guard decref ofjit::toPyObjreturn value inprocessRpc(#38376)torch.distributed.rpc: Use Future'sthen()API to make sure profiling logic is completed when the Future completes (#38352)torch.distributed.rpc: Fix timeout computation in TensorPipe agent(#38928)torch.distributed.rpc: Fix lock inversion upon response read error handling (#38929)torch.distributed.rpc: Acquire lock when adding message to timeout map to fix race in TensorPipe RPC backend (#39398)torch.distributed.rpc: Explicitly decref inUnpickledPythonCalldtor (#38398)torch.distributed.rpc: Fix possible deadlock in_wait_all_workers(#39535)torch.distributed.rpc: Release GIL when deleting users and unforked owners (#39555)torch.distributed.rpc: Fix error handling forrpc.remote(#39605)torch.distributed.rpc: Fix RRef alias annotation (#39933)torch.distributed.rpc: Fix TensorPipeAgent shutdown to ensure it drains all outstanding work. (#40060)torch.futures: Lettorch.futures.wait_all()re-throw errors (#40291)torch.distributed.autograd: Add basic GPU support to distributed autograd. (#40312)
TensorBoard
summary.hparams: SupportNoneinhparams_dict(#36497)SummaryWriter.add_scalars(): Removed incorrect documentation (#36495)SummaryWriter.add_embedding: Fix error where NaN appears in some cases (#36496)SummaryWriter.add_hparams: Fix input parameters (#31301)SummaryWriter.add_image_with_boxes: Added option to add strings to image boxes (#30941)SummaryWriter.add_graph: Fixed missing documentation (#37504)SummaryWriter.add_hparamsLet hparam render values correctly (#31544)- Enforce tensorboard minimum version as 1.15 (#35952)
TorchScript
- Fix scope of writes in comprehensions (#36105)
- Fix name collision during module loading (#35720)
- Fix
NamedTupleresolution (#35409) - Fix copying of bound method from
ModuletoScriptModule(#36546) - Fix lifting bug in tracing module calls (#37189)
- Fix tracing of return types for modules that return heterogenous tuples (#37190)
- Add type-hint check for default arguments in TorchScript C++ frontend (#39021)
- Fix recursive compilation of function annotated with `@torch.jit._script_if_tracing`` (#40468) (#40468)
- Fix parsing of subscript expressions using python resolver (#39269)
- Fix compilation error with gcc 5.5 (#38112)
- Fix handling of
aten::masked_select, properly update type of theaten::unsqueeze's output in shape analysis (#40716) - Fix handling of
aten::unfold, properly handle default dtype, and fix a gradient thrashing issue in shape analysis (#41044) - Fix a bug with incorrect handling of
aten::viewin autodiff graph construction (#42029) - Fix a bug with incorrect handling of constructor operations with tensor inputs tensor properties based on an input tensor rather than defaults in shape analysis (#41016)
- Fix bug with incorrect handling of
prim::gradoperation forUndefinedvalues in shape analysis (#41015) - Fix the incorrect requires_grad property propagation on loop’s block inputs (#41014)
Performance
Misc
F.avg_pool2d: added specialized kernel for channels-last (#35855)- Relax cudnn conditions for channels-last convolutions (#38904)
torch.cat: Enabled fast path for channels-last inputs (#39448)torch.index_putparallelized accumulate CPU float path withcpu_atomic_add_float(#29705)- Make discontiguous tensors also benefit from unrolling (#34708)
torch.scatter,torch.gather: removed some redundant checks to achieve some speedups (#34690)torch.scatter,torch.gatherimproved performance on CUDA (#36181)torch.min(tensor, dim),torch.max(tensor, dim): Optimize performance on CPU (#34875)torch.index_select: Optimize performance for 1D inputs (#35243)- Vectorize (CPU) generic types for binary bitwise operators (#34338)
torch.linspacevectorized on CPU. (#27957, #34555, #35842, (#37981, #38093)- Set device only when device index are different (#35438)
- Don't replace TensorImpl for inplace min/max dim (#35591, #39696)
torch.clampvectorized for bfloat16 (#35082)- bfloat16: vectorized many unary ops (#35092)
torch.bincountoptimized for CPU by removing extrasize()calls (#35822)- Improve reduction op performance on CUDA for large tensors (#35997, #36014)
- Vectorize in-place comparison operators (#35117)
- Vectorize reduction when reducing on fastest striding dimension (#36873)
nn.EmbeddingBag: add a fast path that calls FBGEMM (#36679)nn.Conv3d: Optimized grouped Conv3d performance (#36355)- Reduce overheads on several CPU kernels by avoiding restrides. (#36875)
nn.EmbeddingBag: uninitialize output andbag_sizein the fast path to save overhead (#36681)nn.SmoothL1Loss: vectorize forward (CPU) (#37114, #37115)nn.Unfold: optimized backward pass (#36612, #38871)- Add per-device allocator object in CUDACachingAllocator, reducing lock contention between operations on different devices. (#37567)
- Lazily initialize thread local num_threads value (#37461)
- Vectorize non-persistent Softmax (#38557)
nn.GroupNorm: performance optimized on CPU and CUDA (#28203, #28204)torch.cumsum,torch.cumprod: Restore thrust path for 1d tensors cumulative ops (#39180)- TensorIterator: Remove unnecessary
!op.is_read_writetest (#39747) torch.multinomial: fast-path for replacement=False (#39742)- Vectorize on output for reduction kernels (#37206)
nn.UpSample: optimized performance for linear modes on CPU (#34864)- Make dynamic casting case also benefit from unrolling (#34749)
torch.sinh,torch.cosh: vectorized on CPU (#36396)- Speed up sparse tensor gradient accumulation (#36292)
torch.masked_selectsped up (#36539, #33269)torch.var,torch.stdsped up (#39967)torch.max(tensor, dim),torch.min(tensor, dim)sped up (#39029)
Distributed
torch.nn.SyncBatchNorm: Speed upSyncBatchNormby batching distributed communication (#38246).torch.nn.parallel.DistributedDataparallel: Dynamically adjust DDP bucketing order using the signals collected from the first iteration (#35137).
Mobile
- Use XNNPACK to improve performance for some instances of convolution and linear. (#35790) (#35791)
- Use a custom allocator on mobile to automatically include padding for {Q,X}NNPACK, reducing reallocation costs. (#36032)
- Use updated open-source pthreadpool to improve multi-threading performance. (#40951)
Quantization
- qmul and qadd should preserve input memory format (#34834)
- remove the slow path(NCHW) for avg_pool3d (#34994)
- Optimized qadd_scalar (#34925)
- Optimize qavg_pool3d_nhwc (#35740)
- Changes to qadd for perf improvement. (602b51e)
- improve the quantized batch_norm performance (#35639)
- Add vector path to copy kernel for quantized data types (#36189)
- Speed up calculate Qparams for per-channel observers (#30485)
- Enable float requantization for avgpool/gavgpool ops. (#37037)
- Move to using MemoryFormat::ChannelsLast for avgpool2d. (#36812)
- Use
gpu_kernelin Affine Quantizer (#37312) - Perf optimization for conv and gemm kernels. (#37626)
RPC
torch.distributed.rpc: In RPC Server, handle TorchScript continuations asynchronously (#34109)torch.distributed.autograd: Avoid holding lock when completing GraphTask futureResult (#35101)torch.distributed.autograd: Lock optimizations forDistAutogradContainer(#36529)torch.distributed.rpc.RRef:PreventRRef.to_here()to block an RPC thread on the callee using Future callbacks (#36805)torch.distributed.rpc.RRef:PreventRRefunpickle to block waiting forOwnerRRefcreation (#36785)torch.distributed.autograd: Remove spinning for dist engine (#36606)torch.distributed.rpc: Avoid Releasing, Reacquiring lock per iteration in RPC Retry Thread (#38521)
TorchScript
- Add vectorized load/store support for JIT generated CUDA kernel (#36555)
- Speed up alias analysis (#36345)
- Make new zip serialization for torch save/load significantly (~70%) faster (#38379)
- Run extra optimizations after inlining (#35562)
Documentation
- Split up documentation into subpages, greatly improving performance and search-ability (#37419)
- Rename
torch._C.Generatortotorch.Generator(#38773) - FAQ: Add note about recovering from OOM (#35214)
torch.histc: Add a note on elements outside of given bounds (#34889)functional.hardswish,functional.hardsigmoid: improve docs (#35431)Tensor.is_complexdoc fix (#35680)nn.KLDivLossdoc fix (#36137)torch.min,torch.max,torch.median: added note on deterministic/non-deterministic gradient (#36481)- Amp gradient accumulation example (#36601)
functional.softmaxdoc fix (#36600)- Update
contribution_guide.rst(#36438) - Documentation LU Decomposition: deriving L, U, and P ([#36907](https://github.com/pytorch/pytor...