ECE-GY 9143 - High Performance Machine Learning Project (NYU)
The goal of this project is to optimize the [1]High-resolution Convolutional Transformer (HCT) model by leveraging High-Performance Computing (HPC) techniques, including distributed data parallelism and other advanced optimization strategies. By enhancing the efficiency and scalability of HCT, the project aims to enable faster and more effective processing of high-resolution images in applications such as medical imaging and satellite image analysis, ultimately advancing the state-of-the-art in image-based tasks through optimized model architectures and computational methodologies.
| Milestone | Status |
|---|---|
| Write data module to handle the dataset (load, pre-process, make it pytorch compatible) | Completed |
| Write a basic trainer as per model paper to measure time and accuracy of model on single GPU. | Completed |
| Integrate Weights & Biases and PyTorch Profiler | Completed |
| Optimize training time by using PyTorch distributed data parallel | Completed |
| Make it torchscript compatible to make it deployable on non-python environments | Completed |
| Use additional quantization techniques to reduce inference time further | Completed |
├── ...
├── cpp
│ ├── native_hct.cpp (To load TorchScript Model using C++)
|
|── layer
| ├── ac_layers.py (Attention-Convolution block for Transformer)
| ├── performer_attention.py (Linear Self-Attention)
| ├── resnet_layers.py (Convolution layers for early stages)
|
|── dataset
| ├── dataset/image_labels.csv (For PyTorch Dataset class)
| ├── dataset/image_labels_valid.csv (For PyTorch Dataset class)
|
|── hct_base.py (Model Starting point)
|── datascript.py (To process data and make PyTorch compatible)
|── data.py (Dataset class for PyTorch Dataloader)
|── trainer.py (Single Device Trainer)
|── DistributedTrainer.py (Distributed Data Parallel Trainer)
|── inference.py (Inference Engine)
|── train.py (Run this file to train the model)
|── inf.py (Use this file for inference purpose)
|── params.py (For command line arguments)
|── submit.sh (To submit job on HPC)
|── submit2.sh (To submit job on HPC)
|── quantized
| ├── quantized/main.py (static quantization)
|── quantized_dynamic
| ├── quantized_dynamic/main.py (dynamic quantization)
|── quantized_partial
| |── quantized_partial/main.py (attempt to partial static quantization but resulted in worse time)
- Pytorch
- Wandb
- PyTorch LibTorch (For C++) (with CUDA Support)
- OpenCV C++
- Torch_tb_profiler
- einops (Python Library for Array Manuplation)
sbatch submit.sh
python train.py
usage: train.py [-h] [--batch-size N] [--num-workers NUM_WORKERS] [--epochs N] [--lr LR] [--device DEVICE] [--optimizer OPTIMIZER] [--momentum N]
[--weight-decay N] [--dataset-path DATASET_PATH] [--wandb WANDB] [--project PROJECT] [--training-mode TRAINING_MODE] [--num-devices NUM_DEVICES]
[--log-path LOG_PATH]
High Resolution Convolutional Transformer
options:
-h, --help show this help message and exit
--batch-size N input batch size for training (default: 16)
--num-workers NUM_WORKERS
Number of I/O processes (default: 2)
--epochs N Number of epochs to train (default: 80)
--lr LR learning rate (default: 2e-5)
--device DEVICE Device to be used for training(cpu/cuda)
--optimizer OPTIMIZER
Optimizer to be used (default: Adam)
--momentum N Momentum Value (default: 0.9)
--weight-decay N Weight Decay value (default 5e-4)
--dataset-path DATASET_PATH
Path of LIU4K Dataset (default: ./dataset)
--wandb WANDB Use Weight & Biases to track training progress (0/1)
--project PROJECT Project Name for W&B
--training-mode TRAINING_MODE
Training Mode (Parallel/Distributed/Single)
--num-devices NUM_DEVICES
Number of Learners/GPUs
--log-path LOG_PATH Path where logs will be captured
sbatch submit2.sh
python inf.py
cd cpp
mkdir build
cd build
cmake ..
cmake --build .
./native_hct <model_path> <image_path> <device>
Static Quantization:
cd quantized
sbatch submit.sh
Dynamic Quantization:
cd quantized_dynamic
sbatch submit.sh
| Per Epoch | Single Device | Distributed (4 GPUs) | SpeedUp |
|---|---|---|---|
| Data Loading | 1000 secs | 271 secs | 3.7 |
| Training | 160 secs | 90 secs | 1.78 |
| Running | 1170 secs | 360 secs | 3.25 |
Training => Data Movement from CPU to GPU + Forward pass + Backward Pass + Metrics Calculation
Running => Data Loading + Training
Model Evaluation is not included in time calculation
Legends for below:
[1] Taha, A., Truong Vu, Y.N., MombourqueQe, B., MaQhews, T.P., Su, J. and Singh, S., 2022, September. Deep is a Luxury We Don’t Have. MICCAI 2022





