[RFC] Autoload Device Extension

### 🚀 The feature, motivation and pitch

co-authors:
Intel: @jgong5 @bsochack @sujoysaraswati @gujinghui @xiaodonglin1
Huawei: @yikun @hipudding

PyTorch has been continuously expanding its support for various hardware backend devices, extending its reach to out-of-the-tree devices which are usually implemented with PyTorch extensions. However, a notable challenge arises with the current approach: users must explicitly load the these out-of-the-tree device extension to gain the new device support. This manual loading not only complicates the user experience but also introduces unnecessary cognitive overhead.

The core issue lies in the discrepancy between the handling of in-tree and out-of-tree devices. While in-tree devices seamlessly integrate into the PyTorch ecosystem, out-of-the-tree devices require additional steps for loading on top of the standard PyTorch device programming model. Our goal with this proposal is to bridge this gap, ensuring that out-of-the-tree devices are as user-friendly and seamlessly integrated as their in-tree counterparts. (Note that we only focus on the Python-based programs in the RFC. C++ applications can also be linked with the native device extensions and automatic loading is taken care of by the loader.)
Specifically, our objectives are twofold:
1. Enable users to adhere to the familiar PyTorch device programming model without the need for explicit loading or importing of device-specific extensions.
2. Facilitate effortless adoption of existing PyTorch applications with zero-code changes on out-of-the-tree devices.

## Examples
In the section, we use Intel Gaudi and Huawei Ascend, two representative out-of-the-tree devices, to demonstrate the problem with current usages and expected ones.
### Intel Gaudi
Let’s take a look at the following PyTorch program that runs on Intel Gaudi via the PyTorch HPU device key. Users need to import “[habana_frameworks.torch](https://docs.habana.ai/en/latest/PyTorch/Getting_Started_with_PyTorch_and_Gaudi/Getting_Started_with_PyTorch.html)” to load the extension support for HPU.

```python
import torch
import torchvision.models as models
import habana_frameworks.torch # <-- extra import
model = models.resnet50().eval().to(“hpu”)
input = torch.rand(128, 3, 224, 224).to(“hpu”)
output = model(input)

```
With the proposal, the additional import of “habana_frameworks.torch” is not necessary and the code follows the standard PyTorch device programming model, i.e., moving the input and model parameters to the device followed by the forward call etc.
```python
import torch
import torchvision.models as models
model = models.resnet50().eval().to(“hpu”)
input = torch.rand(128, 3, 224, 224).to(“hpu”)
output = model(input)

```
The standard way of programming brings benefit in that users even don’t have to change the code that invovles moving the tensors to the target device in applications with good device abstraction, e.g., the device name is passed as an argument to the program. This proposal allows users to use new devices without the need of changing code.
Even in some cases where the applications have hard-coded calls to CUDA runtime functions, there is also solution to automatically migrating the code to the new devices. For example, the Habana PyTorch bridge supports “[automatic GPU migration](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html#using-gpu-migration-toolkit)” so that the PyTorch programs written for the NVidia CUDA device doesn’t need to be changed to work with the Intel Gaudi HPU device. This is achieved by an additional import with “gpu_migration” module.

```python
import torch
import torchvision.models as models
import habana_frameworks.torch # <-- extra import
import habana_frameworks.torch.gpu_migration # <-- extra import
model = models.resnet50().eval().to(“cuda”)
input = torch.rand(128, 3, 224, 224).to(“cuda”)
output = model(input)

```
With the proposal, the device specific imports are not necessary and users can run the existing code without any code changes.
```python
import torch
import torchvision.models as models
model = models.resnet50().eval().to(“cuda”)
input = torch.rand(128, 3, 224, 224).to(“cuda”)
output = model(input)

```
### Huawei Ascend
Similar to Intel Gaudi, users need to import “[torch_npu](https://github.com/Ascend/pytorch)” to load the device extension support for Huawei Ascend. Internally, Huawei Ascend leverages the “PrivateUse1” device key and exposes the device name as “npu” to the end users. We use Ascend as an example for out-of-the-tree devices that don’t have in-tree device keys but leverage [“PrivateUse1” device key for integration](https://pytorch.org/tutorials/advanced/privateuseone.html).
```python
import torch
import torchvision.models as models
import torch_npu # <-- extra import (install npu to PrivateUse1 etc.)
model = models.resnet50().eval().to(“npu”)
input = torch.rand(128, 3, 224, 224).to(“npu”)
output = model(input)

```
With the proposal, the expected usage is similar – users don’t have to import “torch_npu” explicitly.
```python
import torch
import torchvision.models as models
model = models.resnet50().eval().to(“npu”)
input = torch.rand(128, 3, 224, 224).to(“npu”)
output = model(input)

```
The comments on the zero-code change enabling also apply to Huawei Ascend.

## Proposed Solution
We propose three options below. For the sake of simplicity, option 3 is most preferrable.
### Option 1: Module preloading
Inspired by the `LD_PRELOAD` mechanism in Unix-like operating systems, we propose to introduce a similar `TORCH_PRELOAD` environment variable to specify a list of Python modules to be loaded during the initialization of PyTorch core. The preloaded modules are specified as follows:
`TORCH_PRELOAD=module1:module2:module3...`
Take Intel Gaudi as an example, one would specify:
`TORCH_PRELOAD=habana_frameworks.torch`
or
`TORCH_PRELOAD=habana_frameworks.torch:habana_frameworks.torch.gpu_migration`
For Huawei Ascend, one would do:
`TORCH_PRELOAD=torch_npu`
For situations where configuration files are more preferred over environment variables. We can have a JSON configuration file: `torch_preload.json` like below:
`[“module1”, “module2”, “module3”]`
The module preloading is triggered in `torch/__init__.py`. The configuration file can be placed in a pre-defined place (e.g., `site-packages/torch_config/torch_preload.json`), installed by the device extensions, or overridden by users, and for `torch/__init__.py` to find.

- Pros: Intuitive for users experienced with Unix-like systems to understand and use. Allows loading multiple modules/extensions at once. Provides flexibility to users to load any desired module.
- Cons: Users can use this mechanism for other purposes than device extension loading, which might introduce confliction.

### Option 2: Device extension specific configuration
This option proposes a specialized configuration for loading the device extension via environment variable `TORCH_DEVICE_EXT` and configuration file `torch_device_extension.json`.
`TORCH_DEVICE_EXT=device_key:device_name:module1:module2:module3...`
`torch_device_extension.json`:
```json
{
  "device_key": "device_name",
  "modules": ["module1", "module2", "module3"]
}
```
The module preloading is triggered in `torch/__init__.py`. The configuration file can be placed in a pre-defined place (e.g., `site-packages/torch_config/torch_device_extension.json`), installed by the device extensions, or overridden by users, and for `torch/__init__.py` to find.

- Pros: Dedicated configuration for loading device extensions without potential confliction with other purposes like option 1.  Leave flexibility for PyTorch core to take device extension specific actions when loading the modules.
- Cons: More complex configuration compared to Option 1.

### Option 3: Load from a pre-defined module as plugin
This option proposes to use a pre-defined module, e.g., `torch_plugins_device_extension`, for loading out-of-the-tree device extensions. It leverages the python plugin mechanism: https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata.
An entrypoint is added to the PyTorch package like below:
```
[project.entry-points.'torch.plugins']
device_extension = 'torch_plugins_device_extension'
```
The module is loaded from `torch/__init__.py` as illustrated below:
```python
from importlib.metadata import entry_points
discovered_plugins = entry_points(group='torch.plugins')
for plugin in discovered_plugins:
    try:
        plugin.load()
    except IndexError:
        pass
```
This module is installed by the python package of the device vendor, e.g., `habana_frameworks` for Intel Gaudi or `torch_npu` for Huawei Ascend, to `site-packages/torch_plugins_device_extension` and the `torch_plugins_device_extension/__init__.py` import the corresponding vendor-specific device extension modules.

- Pros: Flexible import and configuration via Python. Simpler configuration and interface to PyTorch.
- Cons: Lack of flexibility for users to customize the loading process.


### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Autoload Device Extension #122468

🚀 The feature, motivation and pitch

Examples

Intel Gaudi

Huawei Ascend

Proposed Solution

Option 1: Module preloading

Option 2: Device extension specific configuration

Option 3: Load from a pre-defined module as plugin

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Autoload Device Extension #122468

Description

🚀 The feature, motivation and pitch

Examples

Intel Gaudi

Huawei Ascend

Proposed Solution

Option 1: Module preloading

Option 2: Device extension specific configuration

Option 3: Load from a pre-defined module as plugin

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions