{"id":17795,"date":"2021-07-15T09:00:47","date_gmt":"2021-07-15T16:00:47","guid":{"rendered":"https:\/\/engineering.fb.com\/?p=17795"},"modified":"2021-07-22T11:02:59","modified_gmt":"2021-07-22T18:02:59","slug":"fsdp","status":"publish","type":"post","link":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/","title":{"rendered":"Fully Sharded Data Parallel: faster AI training with fewer GPUs"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Training AI models at a large scale isn\u2019t easy. Aside from the need for large amounts of computing power and resources, there is also considerable engineering complexity behind training very large models. At Facebook AI Research (FAIR) Engineering, we have been working on building tools and infrastructure to make training large AI models easier. Our recent work in areas such as <\/span><a href=\"https:\/\/github.com\/pytorch\/fairseq\/blob\/master\/examples\/megatron_11b\/README.md\"><span style=\"font-weight: 400;\">intra-layer model parallelism<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/fairscale.readthedocs.io\/en\/latest\/deep_dive\/pipeline_parallelism.html\"><span style=\"font-weight: 400;\">pipeline model parallelism<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/github.com\/facebookresearch\/fairscale#optimizer-state-sharding-zero\"><span style=\"font-weight: 400;\">optimizer state+gradient sharding<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><a href=\"https:\/\/github.com\/facebookresearch\/fairscale\/blob\/master\/fairscale\/nn\/moe\/moe_layer.py\"><span style=\"font-weight: 400;\">mixture of experts<\/span><\/a><span style=\"font-weight: 400;\"> is just part of our work to make training advanced AI models for any number of tasks more efficient.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Fully Sharded Data Parallel (FSDP) is the newest tool we\u2019re introducing. It <a href=\"https:\/\/engineering.fb.com\/2020\/08\/24\/production-engineering\/scaling-services-with-shard-manager\/\">shards<\/a> an AI model\u2019s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. As its name suggests, FSDP is a type of data-parallel training algorithm. Although the parameters are sharded to different <a href=\"https:\/\/engineering.fb.com\/2018\/03\/20\/ml-applications\/the-next-step-in-facebook-s-ai-hardware-infrastructure\/\">GPUs<\/a>, the computation for each microbatch of data is still local to each GPU worker. This conceptual simplicity makes FSDP easier to understand and more applicable to a wide range of usage scenarios (compared with intra-layer parallelism and pipeline parallelism). Compared with optimizer state+gradient sharding data parallel methods, FSDP shards parameters more uniformly and is capable of better performance via communication and computation overlapping during training.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">With FSDP, it is now possible to more efficiently train models that are orders of magnitude larger using fewer GPUs. FSDP has been implemented in the <\/span><a href=\"https:\/\/github.com\/facebookresearch\/fairscale\"><span style=\"font-weight: 400;\">FairScale library<\/span><\/a><span style=\"font-weight: 400;\"> and allows engineers and developers to scale and optimize the training of their models with simple APIs. At Facebook, FSDP has already been integrated and tested for training some of our <\/span><a href=\"https:\/\/github.com\/pytorch\/fairseq\"><span style=\"font-weight: 400;\">NLP<\/span><\/a><span style=\"font-weight: 400;\"> and<\/span><a href=\"https:\/\/github.com\/facebookresearch\/vissl\"><span style=\"font-weight: 400;\"> Vision<\/span><\/a><span style=\"font-weight: 400;\"> models.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">The high computational cost of large-scale training<\/span><\/h2>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2001.08361.pdf\"><span style=\"font-weight: 400;\">NLP research<\/span><\/a><span style=\"font-weight: 400;\"> is one particular area where we can see the importance of efficiently leveraging compute for training AI. Last year, OpenAI announced that they had trained <\/span><a href=\"https:\/\/neurips.cc\/virtual\/2020\/public\/poster_1457c0d6bfcb4967418bfb8ac142f64a.html\"><span style=\"font-weight: 400;\">GPT-3<\/span><\/a><span style=\"font-weight: 400;\">, the largest-ever neural language model, with 175 billion parameters. It is <\/span><a href=\"https:\/\/lambdalabs.com\/blog\/demystifying-gpt-3\/\"><span style=\"font-weight: 400;\">estimated<\/span><\/a><span style=\"font-weight: 400;\"> to have taken roughly 355 GPU years to train GPT-3, or the equivalent of 1,000 GPUs working continuously for more than four months.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Besides requiring a lot of compute and engineering resources, most approaches to scaling like this introduce additional communication costs and require engineers to carefully evaluate trade-offs between memory use and computational efficiency. For example, typical data parallel training requires maintaining redundant copies of the model on each GPU, and model parallel training introduces additional communication costs to move activations between workers (GPUs).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">FSDP is relatively free of trade-offs in comparison. It improves memory efficiency by sharding model parameters, gradients, and optimizer states across GPUs, and improves computational efficiency by decomposing the communication and overlapping it with both the forward and backward passes. FSDP produces identical results as standard distributed data parallel (DDP) training and is available in an easy-to-use interface that\u2019s a drop-in replacement for PyTorch\u2019s DistributedDataParallel module. Our early testing has shown that FSDP can enable scaling to trillions of parameters.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">How FSDP works<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In standard DDP training, every worker processes a separate batch and the gradients are summed across workers using an <\/span><a href=\"https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/collectives.html#allreduce\"><span style=\"font-weight: 400;\">all-reduce operation<\/span><\/a><span style=\"font-weight: 400;\">. While DDP has become very popular, it takes more GPU memory than it needs because the model weights and optimizer states are replicated across all DDP workers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">One method to reduce replications is to apply a process called full parameter sharding, where only a subset of the model parameters, gradients, and optimizers needed for a local computation is made available. An implementation of this method, ZeRO-3, has already been popularized by Microsoft.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The key insight to unlock full parameter sharding is that we can decompose the <\/span><a href=\"https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/collectives.html#allreduce\"><span style=\"font-weight: 400;\">all-reduce<\/span><\/a><span style=\"font-weight: 400;\"> operations in DDP into separate <\/span><a href=\"https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/collectives.html#reducescatter\"><span style=\"font-weight: 400;\">reduce-scatter<\/span><\/a><span style=\"font-weight: 400;\"> and <a href=\"https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/collectives.html#allgather\">all-gather<\/a> operations:<\/span><\/p>\n<figure id=\"attachment_17828\" aria-describedby=\"caption-attachment-17828\" style=\"width: 1024px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-17828\" src=\"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-graph-2a.png?w=1024\" alt=\"Full Sharded Data Parallel graph\" width=\"1024\" height=\"562\" srcset=\"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-graph-2a.png 1264w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-graph-2a.png?resize=916,503 916w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-graph-2a.png?resize=768,422 768w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-graph-2a.png?resize=1024,562 1024w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-graph-2a.png?resize=96,53 96w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-graph-2a.png?resize=192,105 192w\" sizes=\"auto, (max-width: 992px) 100vw, 62vw\" \/><figcaption id=\"caption-attachment-17828\" class=\"wp-caption-text\">All-reduce as a combination of reduce-scatter and all-gather. The standard all-reduce operation to aggregate gradients can be decomposed into two separate phases: reduce-scatter and all-gather. During the reduce-scatter phase, the gradients are summed in equal blocks among ranks on each GPU based on their rank index. During the all-gather phase, the sharded portion of aggregated gradients available on each GPU are made available to all GPUs (see here for details on those operators).<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">We can then rearrange the reduce-scatter and all-gather so that each DDP worker needs to store only a single shard of parameters and optimizer states. The figure below illustrates standard DDP training (top) and FSDP training (bottom):<\/span><\/p>\n<figure id=\"attachment_17812\" aria-describedby=\"caption-attachment-17812\" style=\"width: 907px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-17812\" src=\"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png?w=907\" alt=\"Full Sharded Data Parallel graph\" width=\"907\" height=\"1024\" srcset=\"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png 1566w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png?resize=811,916 811w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png?resize=768,867 768w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png?resize=907,1024 907w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png?resize=1361,1536 1361w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png?resize=96,108 96w, https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Graph-2.png?resize=192,217 192w\" sizes=\"auto, (max-width: 992px) 100vw, 62vw\" \/><figcaption id=\"caption-attachment-17812\" class=\"wp-caption-text\">A comparison of standard data parallel training and fully sharded data parallel training. In standard data parallel training methods, a copy of the model is present on each GPU and a sequence of forward and backward passes are evaluated on only a shard of the data. After these local computations, the parameters and optimizers for each local process are shared with the other GPUs in order to calculate the global weight update. In FSDP, only a shard of the model is present on a GPU. Then, locally, all weights are gathered from the other GPUs \u2014 by means of an all-gather step \u2014 to calculate the forward pass. This gathering of weights is then performed again before the backward pass. After that backward pass, the local gradients are averaged and sharded across the GPUs by means of a reduce-scatter step, which allows each GPU to update its local weight shard.<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">To maximize memory efficiency, we can discard the full weights after each layer\u2019s forward pass, saving memory for subsequent layers. This can be implemented by applying the FSDP wrapper to every layer in the network (with <\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">reshard_after_forward=True<\/span><span style=\"font-weight: 400;\">). <\/span><\/p>\n<p><span style=\"font-weight: 400;\">In pseudo-code:<\/span><\/p>\n<pre><span style=\"font-weight: 400;\">FSDP forward pass:<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0for layer_i in layers:<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0all-gather full weights for layer_i<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0forward pass for layer_i<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0discard full weights for layer_i<\/span>\r\n\r\n<span style=\"font-weight: 400;\">FSDP backward pass:<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0for layer_i in layers:<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0all-gather full weights for layer_i<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0backward pass for layer_i<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0discard full weights for layer_i<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0reduce-scatter gradients for layer_i<\/span><\/pre>\n<h2 class=\"line-numbers\"><span style=\"font-weight: 400;\">How to use FSDP<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">There are several ways to use FSDP in large-scale AI research.<\/span><span style=\"font-weight: 400;\"> At this time, we offer four solutions to adapt to different needs.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">1. Using FSDP in language models<\/span><\/h3>\n<p class=\"line-numbers\"><span style=\"font-weight: 400;\">For language models, FSDP is supported in the <\/span><a href=\"https:\/\/github.com\/pytorch\/fairseq\"><i><span style=\"font-weight: 400;\">fairseq<\/span><\/i><span style=\"font-weight: 400;\"> framework<\/span><\/a><span style=\"font-weight: 400;\"> via the following new arguments:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400; font-family: 'Courier New';\">&#8211;ddp-backend=fully_sharded<\/span><span style=\"font-weight: 400;\">: enables full sharding via FSDP<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400; font-family: 'Courier New';\">&#8211;cpu-offload<\/span><span style=\"font-weight: 400;\">: offloads the optimizer state and FP32 model copy to CPU (combine with<\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">&#8211;optimizer=cpu_adam<\/span><span style=\"font-weight: 400;\">)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400; font-family: 'Courier New';\">&#8211;no-reshard-after-forward<\/span><span style=\"font-weight: 400;\">: increases training speed for large models (1B+ params) and is similar to ZeRO stage 2<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\">Other popular options (<span style=\"font-weight: 400; font-family: 'Courier New';\">&#8211;fp16<\/span><span style=\"font-weight: 400;\">, <\/span><span style=\"font-weight: 400;\">&#8211;update-freq<\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">, <\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">&#8211;checkpoint-activations<\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">, <\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">&#8211;offload-activations<\/span><span style=\"font-weight: 400;\">, etc.) continue to work as normal<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">See the <\/span><a href=\"https:\/\/github.com\/pytorch\/fairseq\/tree\/master\/examples\/fully_sharded_data_parallel\"><span style=\"font-weight: 400;\">fairseq tutorial<\/span><\/a><span style=\"font-weight: 400;\"> for instructions on using FSDP to train a 13B-parameter model on eight GPUs or on a single GPU with FSDP + CPU offloading.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">2. Using FSDP in computer vision models<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">For computer vision models, FSDP is supported in <\/span><a href=\"https:\/\/github.com\/facebookresearch\/vissl\"><span style=\"font-weight: 400;\">VISSL<\/span><\/a><span style=\"font-weight: 400;\"> and tested on RegNets architectures. Layers like BatchNorm and ReLU are seamlessly handled and tested for convergence.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Use the following options to enable FSDP:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400; font-family: 'Courier New';\">config.MODEL.FSDP_CONFIG.AUTO_SETUP_FSDP=True<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400; font-family: 'Courier New';\">config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400; font-family: 'Courier New';\">config.MODEL.AMP_PARAMS.AMP_TYPE=pytorch<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">See <\/span><a href=\"https:\/\/github.com\/facebookresearch\/vissl\/blob\/40441123a6f7098500676ca8800025c1f02e28b3\/vissl\/config\/defaults.yaml#L498-L513\"><span style=\"font-weight: 400;\">this section<\/span><\/a><span style=\"font-weight: 400;\"> of the yaml config for additional options to config FSDP within VISSL.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">3. Using FSDP from PyTorch Lightning<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">For easier integration with more general use cases, FSDP is supported as a beta feature by PyTorch Lightning. <\/span><a href=\"https:\/\/pytorch-lightning.readthedocs.io\/en\/latest\/advanced\/advanced_gpu.html#fully-sharded-training\"><span style=\"font-weight: 400;\">This tutorial<\/span><\/a><span style=\"font-weight: 400;\"> contains a detailed example on how to use the FSDP plugin with PyTorch Lightning. At a high level, adding <\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">plugins=\u2019fsdp\u2019<\/span><span style=\"font-weight: 400;\"> below can activate it.<\/span><\/p>\n<pre><span style=\"font-weight: 400;\">model = MyModel()<\/span>\r\n<span style=\"font-weight: 400;\">trainer = Trainer(gpus=4, <\/span><b>plugins='fsdp'<\/b><span style=\"font-weight: 400;\">, precision=16)<\/span>\r\n<span style=\"font-weight: 400;\">trainer.fit(model)\r\n<\/span><span style=\"font-weight: 400;\">\r\ntrainer.test()<\/span>\r\n<span style=\"font-weight: 400;\">trainer.predict()<\/span><\/pre>\n<h3><span style=\"font-weight: 400;\">4. Using the FSDP library directly from FairScale<\/span><\/h3>\n<p class=\"line-numbers\"><span style=\"font-weight: 400;\">The main library where FSDP has been developed, and where you can find the latest updates, is <\/span><a href=\"https:\/\/fairscale.readthedocs.io\/en\/latest\/deep_dive\/oss_sdp_fsdp.html\"><span style=\"font-weight: 400;\">FairScale<\/span><\/a><span style=\"font-weight: 400;\">. You can directly use FSDP from FairScale with the below example by simply replacing the <\/span><span style=\"font-weight: 400; font-family: 'Courier New';\">DDP(my_module)<\/span><span style=\"font-weight: 400;\">:<\/span><\/p>\n<pre><span style=\"font-weight: 400;\">from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP<\/span>\r\n<span style=\"font-weight: 400;\">...<\/span>\r\n<span style=\"font-weight: 400;\">sharded_module = <\/span><span style=\"font-weight: 400;\"><del>DDP(my_module)<\/del><\/span><b>FSDP(my_module)<\/b>\r\n<span style=\"font-weight: 400;\">optim = torch.optim.Adam(sharded_module.parameters(), lr=0.0001)<\/span>\r\n<span style=\"font-weight: 400;\">for sample, label in dataload.next_batch:<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0out = sharded_module(x=sample, y=3, z=torch.Tensor([1]))<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0loss = criterion(out, label)<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0loss.backward()<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0optim.step()<\/span><\/pre>\n<p><span style=\"font-weight: 400;\">The FSDP library in FairScale exposes the low-level options for many important aspects of large-scale training. Here are some few important areas to consider when you apply FSDP with its full power.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Model wrapping: <\/b><span style=\"font-weight: 400;\">In order to minimize the transient GPU memory needs, users need to wrap a model in a nested fashion. This introduces additional complexity. The <\/span><a href=\"https:\/\/github.com\/facebookresearch\/fairscale\/blob\/master\/fairscale\/nn\/wrap\/auto_wrap.py\"><span style=\"font-weight: 400;\">auto_wrap<\/span><\/a><span style=\"font-weight: 400;\"> utility is useful in annotating existing PyTorch model code for nested wrapping purposes.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Model initialization:<\/b><span style=\"font-weight: 400;\"> Unlike DDP, FSDP does <\/span><b>not<\/b><span style=\"font-weight: 400;\"> automatically synchronize model weights between GPU workers. This means model initialization must be done carefully so that all GPU workers have the identical initial weights.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Optimizer settings:<\/b><span style=\"font-weight: 400;\"> Due to sharding and wrapping, only certain types of optimizer and optimizer settings are supported by FSDP. In particular, if a module is wrapped by FSDP and its parameters are flattened into a single tensor, users cannot use different hyperparameters for different parameter groups in such a module.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Mixed precision:<\/b><span style=\"font-weight: 400;\"> FSDP supports advanced mixed precision training with FP16 master weights, as well as FP16 reduce and scatter on the gradients. Certain parts of a model may converge only if full precision is used. In those cases, additional wrapping is needed to selectively run parts of a model in full precision.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>State checkpointing and inference:<\/b><span style=\"font-weight: 400;\"> When the model scale is large, saving and loading the model state can become challenging. FSDP supports several ways to make that task possible, but it is by no means trivial.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Finally, FSDP is often used together with <\/span><b>activation checkpointing<\/b><span style=\"font-weight: 400;\"> functions like <\/span><a href=\"https:\/\/github.com\/facebookresearch\/fairscale\/blob\/master\/fairscale\/nn\/checkpoint\/checkpoint_activations.py\"><span style=\"font-weight: 400;\">checkpoint_wrapper<\/span><\/a><span style=\"font-weight: 400;\"> from FairScale. Users may need to carefully tune the activation checkpointing strategy to fit a large model within limited GPU memory space.<\/span><\/li>\n<\/ol>\n<h2><span style=\"font-weight: 400;\">Next steps<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">FSDP is open source, and early users have tried it and contributed to it. We think it can benefit the entire research community, and we look forward to working with everyone in making it better. In particular, these are some of the important areas.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Making FSDP more general.<\/b><span style=\"font-weight: 400;\"> So far, FSDP has been used on both NLP and vision models with SGD and Adam optimizers. As newer models and optimizers emerge, FSDP needs to continue supporting them. Being a purely data-parallel training scheme, FSDP has the greatest potential to be general in supporting a wide range of AI algorithms.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Making FSDP auto-tune. <\/b><span style=\"font-weight: 400;\">There are many knobs that users can tune today with FSDP for both scaling and performance. We look forward to developing algorithms for auto-tuning both GPU memory usage and training performance.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">In addition to training, more <\/span><b>scalable inference<\/b><span style=\"font-weight: 400;\"> and model serving is an important use case that FSDP might need to support.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Last but not least, refactoring and continuing to <\/span><b>modularize FSDP<\/b><span style=\"font-weight: 400;\"> and its core components is equally important to newer and better features.<\/span><\/li>\n<\/ol>\n<h2><span style=\"font-weight: 400;\">Try it out and contribute!<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">FSDP is currently available directly from the <\/span><a href=\"https:\/\/github.com\/facebookresearch\/fairscale\"><span style=\"font-weight: 400;\">FairScale library<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Thanks for sticking with us thus far. Please try FSDP in your research or production work. We would love to hear your feedback, and, as always, pull requests are welcome! <\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Training AI models at a large scale isn\u2019t easy. Aside from the need for large amounts of computing power and resources, there is also considerable engineering complexity behind training very large models. At Facebook AI Research (FAIR) Engineering, we have been working on building tools and infrastructure to make training large AI models easier. Our [&#8230;]<\/p>\n<p><a class=\"btn btn-secondary understrap-read-more-link\" href=\"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/\">Read More&#8230;<\/a><\/p>\n","protected":false},"author":51,"featured_media":17838,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[73,69,72,174,67],"tags":[],"coauthors":[1524,1742,1743,1744,1745,1746],"class_list":["post-17795","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-research","category-data-center-engineering","category-ml-applications","category-open-source","category-production-engineering","fb_content_type-article"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v19.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -<\/title>\n<meta name=\"description\" content=\"Fully Sharded Data Parallel (FSDP) makes training larger, more advanced AI models more efficiently than ever using fewer GPUs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Myle Ott, Sam Shleifer, Min Xu, Priya Goyal, Quentin Duval, Vittorio Caggiano\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/\"},\"author\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#author\",\"name\":\"\"},\"headline\":\"Fully Sharded Data Parallel: faster AI training with fewer GPUs\",\"datePublished\":\"2021-07-15T16:00:47+00:00\",\"dateModified\":\"2021-07-22T18:02:59+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/\"},\"wordCount\":1730,\"publisher\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/engineering.fb.com\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/FSDP-Hero-FINAL-1.png\",\"articleSection\":[\"AI Research\",\"Data Center Engineering\",\"ML Applications\",\"Open Source\",\"Production Engineering\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/\",\"url\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/\",\"name\":\"Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/engineering.fb.com\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/FSDP-Hero-FINAL-1.png\",\"datePublished\":\"2021-07-15T16:00:47+00:00\",\"dateModified\":\"2021-07-22T18:02:59+00:00\",\"description\":\"Fully Sharded Data Parallel (FSDP) makes training larger, more advanced AI models more efficiently than ever using fewer GPUs.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#primaryimage\",\"url\":\"https:\\\/\\\/engineering.fb.com\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/FSDP-Hero-FINAL-1.png\",\"contentUrl\":\"https:\\\/\\\/engineering.fb.com\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/FSDP-Hero-FINAL-1.png\",\"width\":7579,\"height\":4263,\"caption\":\"Full Sharded Data Parallel hero\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/2021\\\/07\\\/15\\\/open-source\\\/fsdp\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/engineering.fb.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Fully Sharded Data Parallel: faster AI training with fewer GPUs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/#website\",\"url\":\"https:\\\/\\\/engineering.fb.com\\\/\",\"name\":\"Engineering at Meta\",\"description\":\"Engineering at Meta Blog\",\"publisher\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/engineering.fb.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/#organization\",\"name\":\"Meta\",\"url\":\"https:\\\/\\\/engineering.fb.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/engineering.fb.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Meta_lockup_positive-primary_RGB.jpg\",\"contentUrl\":\"https:\\\/\\\/engineering.fb.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Meta_lockup_positive-primary_RGB.jpg\",\"width\":29011,\"height\":12501,\"caption\":\"Meta\"},\"image\":{\"@id\":\"https:\\\/\\\/engineering.fb.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Engineering\\\/\",\"https:\\\/\\\/x.com\\\/fb_engineering\"]},[]]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -","description":"Fully Sharded Data Parallel (FSDP) makes training larger, more advanced AI models more efficiently than ever using fewer GPUs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/","twitter_misc":{"Written by":"Myle Ott, Sam Shleifer, Min Xu, Priya Goyal, Quentin Duval, Vittorio Caggiano","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#article","isPartOf":{"@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/"},"author":{"@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#author","name":""},"headline":"Fully Sharded Data Parallel: faster AI training with fewer GPUs","datePublished":"2021-07-15T16:00:47+00:00","dateModified":"2021-07-22T18:02:59+00:00","mainEntityOfPage":{"@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/"},"wordCount":1730,"publisher":{"@id":"https:\/\/engineering.fb.com\/#organization"},"image":{"@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#primaryimage"},"thumbnailUrl":"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Hero-FINAL-1.png","articleSection":["AI Research","Data Center Engineering","ML Applications","Open Source","Production Engineering"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/","url":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/","name":"Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -","isPartOf":{"@id":"https:\/\/engineering.fb.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#primaryimage"},"image":{"@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#primaryimage"},"thumbnailUrl":"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Hero-FINAL-1.png","datePublished":"2021-07-15T16:00:47+00:00","dateModified":"2021-07-22T18:02:59+00:00","description":"Fully Sharded Data Parallel (FSDP) makes training larger, more advanced AI models more efficiently than ever using fewer GPUs.","breadcrumb":{"@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#primaryimage","url":"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Hero-FINAL-1.png","contentUrl":"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Hero-FINAL-1.png","width":7579,"height":4263,"caption":"Full Sharded Data Parallel hero"},{"@type":"BreadcrumbList","@id":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/engineering.fb.com\/"},{"@type":"ListItem","position":2,"name":"Fully Sharded Data Parallel: faster AI training with fewer GPUs"}]},{"@type":"WebSite","@id":"https:\/\/engineering.fb.com\/#website","url":"https:\/\/engineering.fb.com\/","name":"Engineering at Meta","description":"Engineering at Meta Blog","publisher":{"@id":"https:\/\/engineering.fb.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/engineering.fb.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/engineering.fb.com\/#organization","name":"Meta","url":"https:\/\/engineering.fb.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/engineering.fb.com\/#\/schema\/logo\/image\/","url":"https:\/\/engineering.fb.com\/wp-content\/uploads\/2023\/08\/Meta_lockup_positive-primary_RGB.jpg","contentUrl":"https:\/\/engineering.fb.com\/wp-content\/uploads\/2023\/08\/Meta_lockup_positive-primary_RGB.jpg","width":29011,"height":12501,"caption":"Meta"},"image":{"@id":"https:\/\/engineering.fb.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Engineering\/","https:\/\/x.com\/fb_engineering"]},[]]}},"jetpack_featured_media_url":"https:\/\/engineering.fb.com\/wp-content\/uploads\/2021\/07\/FSDP-Hero-FINAL-1.png","jetpack_shortlink":"https:\/\/wp.me\/sa0Lhq-fsdp","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/posts\/17795","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/users\/51"}],"replies":[{"embeddable":true,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/comments?post=17795"}],"version-history":[{"count":17,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/posts\/17795\/revisions"}],"predecessor-version":[{"id":17840,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/posts\/17795\/revisions\/17840"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/media\/17838"}],"wp:attachment":[{"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/media?parent=17795"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/categories?post=17795"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/tags?post=17795"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/engineering.fb.com\/wp-json\/wp\/v2\/coauthors?post=17795"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}