Skip to content

mozamani/hpc_ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 

Repository files navigation

Distributed Training & HPC

Horovod

  1. Reference architectures for distributed training of deep learning models (on GPU) - this is an architecture for data-parallel distributed training with synchronous updates using Horovod.
    https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/ai/training-deep-learning
    https://github.com/microsoft/DistributedDeepLearning/

alt text

alt text

  1. Intro material for Horovod:
    Main paper: paper1
    Videos: Video1, Video2

  2. A notebook example of training a word2vec model in TensorFlow using distributed training via Horovod & AMLS.

  3. Databricks notebook examples for distributed training of NNs using Keras and Horovod (HorovodRunner): Notebook1, Notebook2

  4. Distributed Tensorflow can be used to define the method for distributing the NN training jobs.

  5. Azure VMs with support of RDMA & InfiniBand:
    https://docs.microsoft.com/en-us/azure/batch/batch-pool-compute-intensive-sizes#main

  6. HB and HC VM series: link

  7. Azure batch intro.
    alt text

  8. How to use AML pipeline steps
    alt text

  9. Model interpretibility: link

  10. Azure HPC environment repo + CycleCloud tutorials & azure batch examples

  11. BatchAI to AMLS migration example

Spark

  1. Native Apache Spark MLlib for distributed ML
  2. Databrick example notebooks
  3. mmlspark: example notebooks & video

About

Horovod + AMLS demo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors