- Reference architectures for distributed training of deep learning models (on GPU) - this is an architecture for data-parallel distributed training with synchronous updates using Horovod.
https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-deep-learning
https://github.com/microsoft/DistributedDeepLearning/
-
Intro material for Horovod:
Main paper: paper1
Videos: Video1, Video2 -
A notebook example of training a word2vec model in TensorFlow using distributed training via Horovod & AMLS.
-
Databricks notebook examples for distributed training of NNs using Keras and Horovod (HorovodRunner): Notebook1, Notebook2
-
Distributed Tensorflow can be used to define the method for distributing the NN training jobs.
-
Azure VMs with support of RDMA & InfiniBand:
https://docs.microsoft.com/en-us/azure/batch/batch-pool-compute-intensive-sizes#main -
HB and HC VM series: link
-
Azure batch intro.
-
How to use AML pipeline steps

-
Model interpretibility: link
-
Azure HPC environment repo + CycleCloud tutorials & azure batch examples
-
BatchAI to AMLS migration example
- Native Apache Spark MLlib for distributed ML
- Databrick example notebooks
- mmlspark: example notebooks & video


