GPU-specific training configs optimized for different hardware setups.
| GPU | Config File | VRAM | System RAM | Batch Size | Model | Data Cache | Notes |
|---|---|---|---|---|---|---|---|
| GTX 1080 Ti | gpu_1080ti_training.yaml |
11GB | 64GB | 4 | tiny | 40GB | Entry-level training |
| V100 | gpu_v100_training.yaml |
32GB | 128GB | 12 | base_plus | 80GB | Good for prototyping |
| A100 | gpu_a100_training.yaml |
80GB | 256GB | 24 | large | 180GB | Production training |
| H100 | h100_training_config.yaml |
80GB | 230GB | 16 | large | 172GB | Fast training |
| H200 | gpu_h200_training.yaml |
141GB | 512GB | 48 | large | 380GB | Maximum throughput |
| Config | Samples | Purpose |
|---|---|---|
synthetic_test_100.yaml |
100 | Quick testing |
synthetic_train_4k.yaml |
4,000 | Development training |
synthetic_val_1k.yaml |
1,000 | Validation |
All GPU configs now use shared RAM preloading instead of per-worker LRU caching:
training:
data_cache_ram_gb: 172 # RAM budget for training data
val_data_cache_ram_gb: 20 # RAM budget for validation dataBenefits:
- All workers share the same cached data (no duplication)
- Predictable memory usage
- Much faster than disk I/O once cached
- Configurable based on available system RAM
Per batch file: 1.36 GB (100 samples × 1024×1024×3 float32 + masks)
Formula:
batches_cached = data_cache_ram_gb / 1.36
percentage_cached = batches_cached / total_batches
Example (H100):
- Cache: 172 GB
- Batches: 172 / 1.36 = 126 batches
- For train_4000 (160 batches): 126/160 = 78% cached
- For train_10000 (400 batches): 126/400 = 31% cached
| Parameter | 1080 Ti | V100 | A100 | H100 | H200 |
|---|---|---|---|---|---|
num_workers |
4 | 8 | 16 | 12 | 24 |
prefetch_factor |
2 | 2 | 3 | 2 | 4 |
# H100
python scripts/run_training.py --config configs/h100_training_config.yaml
# A100
python scripts/run_training.py --config configs/gpu_a100_training.yaml
# Skip dataset generation if already exists
python scripts/run_training.py --config configs/h100_training_config.yaml --skip-generationTraining logs are saved to:
{output_dir}/training_{timestamp}.log
Tail the log in another terminal:
tail -f ./training_output/training_*.logIf you have different RAM than the config assumes:
-
Check available RAM:
free -h
-
Adjust
data_cache_ram_gb:- Use ~75% of available RAM for cache
- Leave 25% for system/training overhead
-
Example: System with 128GB RAM:
data_cache_ram_gb: 96 # 75% of 128GB
| Model | Parameters | VRAM (approx) | Speed | Quality |
|---|---|---|---|---|
tiny |
~38M | ~5GB | Fastest | Good |
small |
~80M | ~8GB | Fast | Better |
base_plus |
~152M | ~15GB | Medium | Great |
large |
~224M | ~22GB | Slower | Best |
Symptom: DataLoader worker (pid XXX) is killed by signal: Killed
Solutions:
- Reduce
data_cache_ram_gb - Reduce
num_workers - Reduce
batch_size
Symptom: GPU at 0% or low utilization
Solutions:
- Increase
data_cache_ram_gb(more data in RAM) - Increase
num_workers - Increase
prefetch_factor
Symptom: Frequent disk I/O, slow training
Solutions:
- Increase
data_cache_ram_gbto fit more of dataset - Use smaller dataset that fits in cache
- Consider using faster storage (NVMe SSD)