Fix inference api & add more description on inference engine tutorial (deepspeedai#1711)

RezaYazdaniAminabadi · web-flow · commit 94de0229fb0d · 2022-01-19T15:27:51.000-08:00
diff --git a/deepspeed/inference/engine.py b/deepspeed/inference/engine.py
@@ -53,6 +53,8 @@ def __init__(self,
             replace_method: the injection method, this can be passed as auto if no injection-policy is defined, in which case the injection is automatic based on the available policies
             quantization_setting:
                 one of None, Tuple(mlp_extra_grouping, quantize_groups), quantize_groups
+            replace_with_kernel_inject: this flag need to be set to true to inject inference kernels for models such as, Bert, GPT2, GPT-Neo and GPT-J. Otherwise,
+            the injection_dict provides the names of two linear layers as a tuple: (attention_output projection, transformer output projection)
         """
 
         super().__init__()
diff --git a/deepspeed/module_inject/replace_policy.py b/deepspeed/module_inject/replace_policy.py
@@ -1,6 +1,7 @@
 from abc import ABC
 
 import torch
+from torch.nn.parameter import Parameter
 
 
 class DSPolicy(ABC):
@@ -66,8 +67,8 @@ def attention(self):
         vw = self.client_module.attention.self.value.weight
         vb = self.client_module.attention.self.value.bias
 
-        qkvw = torch.cat((qw, kw, vw), dim=0)
-        qkvb = torch.cat((qb, kb, vb), dim=0)
+        qkvw = Parameter(torch.cat((qw, kw, vw), dim=0))
+        qkvb = Parameter(torch.cat((qb, kb, vb), dim=0))
 
         return self.linear_layer, \
                qkvw, \
@@ -120,7 +121,7 @@ def attention(self):
         kw = self.client_module.attn.attention.k_proj.weight
         vw = self.client_module.attn.attention.v_proj.weight
 
-        qkvw = torch.cat((qw, kw, vw), dim=0)
+        qkvw = Parameter(torch.cat((qw, kw, vw), dim=0))
 
         return self.linear_layer, \
                 qkvw, \
@@ -164,7 +165,7 @@ def attention(self):
         kw = self.client_module.attn.k_proj.weight
         vw = self.client_module.attn.v_proj.weight
 
-        qkvw = torch.cat((qw, kw, vw), dim=0)
+        qkvw = Parameter(torch.cat((qw, kw, vw), dim=0))
 
         return self.linear_layer, \
                 qkvw, \
diff --git a/docs/_tutorials/inference-tutorial.md b/docs/_tutorials/inference-tutorial.md
@@ -8,7 +8,9 @@ DeepSpeed provides a seamless inference mode for compatible transformer based mo
 
 ## Initializing for Inference
 
-For inference with DeepSpeed, use `init_inference` API to load the model for inference. Here, you can specify the MP degree, and if the model has not been loaded with the appropriate checkpoint, you can also provide the checkpoint description using a `json` file. To inject the high-performance kernels, you can pass int the `replace_method` as `'auto'` for the compatible models, or define a new policy in [replace_policy class](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py) and pass in the `injection_policy` that specifies the different parameters of a Transformer layer, such as attention and feed-forward parts. The `injection_policy` shows the mapping between the parameters of the original layer implementation with the inference-customized Transformer layer.
+For inference with DeepSpeed, use `init_inference` API to load the model for inference. Here, you can specify the MP degree, and if the model has not been loaded with the appropriate checkpoint, you can also provide the checkpoint description using a `json` file or the checkpoint path.
+
+To inject the high-performance kernels, you need to set the `replace_with_kernel_inject` to True and pass int the `replace_method` as `'auto'` for the compatible models, or define a new policy in [replace_policy class](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py) and pass in the `injection_policy` that specifies the different parameters of a Transformer layer, such as attention and feed-forward parts. The `injection_policy` shows the mapping between the parameters of the original layer implementation with the inference-customized Transformer layer.
 
 ```python
 # create the model
@@ -25,11 +27,33 @@ ds_engine = deepspeed.init_inference(model,
                                  mp_size=2,
                                  dtype=torch.half,
                                  checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
-                                 replace_method='auto')
+                                 replace_method='auto',
+                                 replace_with_kernel_inject=True)
 model = ds_engine.module
 output = model('Input String')
 ```
 
+To run inference with only model-parallelism for the models that we don't support kernels, you can pass an injection policy that shows the two specific linear layers on a Transformer Encoder/Decoder layer: 1) the attention output GeMM and 2) layer output GeMM. We need these part of the layer to add the required all-reduce communication between GPUs to merge the partial results across model-parallel ranks. Below, we bring an example that shows how you can use deepspeed-inference with a T5 model:
+
+
+```python
+# create the model
+import transformers
+from transformers.models.t5.modeling_t5 import T5Block
+
+import deepspeed
+
+pipe = pipeline("text2text-generation", model="google/t5-v1_1-small", device=local_rank)
+# Initialize the DeepSpeed-Inference engine
+pipe.model = deepspeed.init_inference(
+    pipe.model,
+    mp_size=world_size,
+    dtype=torch.float,
+    injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')}
+)
+output = pipe('Input String')
+```
+
 ## Loading Checkpoints
 
 For the models trained using HuggingFace, the model checkpoint can be pre-loaded using the `from_pretrained` API as shown above. For Megatron-LM models trained with model parallelism, we require a list of all the model parallel checkpoints passed in JSON config. Below we show how to load a Megatron-LM checkpoint trained using MP=2.