Distributed Training & HPC

Reference architectures for distributed training of deep learning models (on GPU) - this is an architecture for data-parallel distributed training with synchronous updates using Horovod.
https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-deep-learning
https://github.com/microsoft/DistributedDeepLearning/

Intro material for Horovod:
Main paper: paper1
Videos: Video1, Video2
A notebook example of training a word2vec model in TensorFlow using distributed training via Horovod & AMLS.
Databricks notebook examples for distributed training of NNs using Keras and Horovod (HorovodRunner): Notebook1, Notebook2
Distributed Tensorflow can be used to define the method for distributing the NN training jobs.
Azure VMs with support of RDMA & InfiniBand:
https://docs.microsoft.com/en-us/azure/batch/batch-pool-compute-intensive-sizes#main
HB and HC VM series: link
Azure batch intro.
How to use AML pipeline steps
Model interpretibility: link
Azure HPC environment repo + CycleCloud tutorials & azure batch examples
BatchAI to AMLS migration example

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback