configs

Training Configurations

GPU-specific training configs optimized for different hardware setups.

GPU Training Configs

GPU	Config File	VRAM	System RAM	Batch Size	Model	Data Cache	Notes
GTX 1080 Ti	`gpu_1080ti_training.yaml`	11GB	64GB	4	tiny	40GB	Entry-level training
V100	`gpu_v100_training.yaml`	32GB	128GB	12	base_plus	80GB	Good for prototyping
A100	`gpu_a100_training.yaml`	80GB	256GB	24	large	180GB	Production training
H100	`h100_training_config.yaml`	80GB	230GB	16	large	172GB	Fast training
H200	`gpu_h200_training.yaml`	141GB	512GB	48	large	380GB	Maximum throughput

Dataset Generation Configs

Config	Samples	Purpose
`synthetic_test_100.yaml`	100	Quick testing
`synthetic_train_4k.yaml`	4,000	Development training
`synthetic_val_1k.yaml`	1,000	Validation

Key Parameters

Data Caching Strategy

All GPU configs now use shared RAM preloading instead of per-worker LRU caching:

training:
  data_cache_ram_gb: 172      # RAM budget for training data
  val_data_cache_ram_gb: 20   # RAM budget for validation data

Benefits:

All workers share the same cached data (no duplication)
Predictable memory usage
Much faster than disk I/O once cached
Configurable based on available system RAM

Memory Calculations

Per batch file: 1.36 GB (100 samples × 1024×1024×3 float32 + masks)

Formula:

batches_cached = data_cache_ram_gb / 1.36
percentage_cached = batches_cached / total_batches

Example (H100):

Cache: 172 GB
Batches: 172 / 1.36 = 126 batches
For train_4000 (160 batches): 126/160 = 78% cached
For train_10000 (400 batches): 126/400 = 31% cached

DataLoader Settings

Parameter	1080 Ti	V100	A100	H100	H200
`num_workers`	4	8	16	12	24
`prefetch_factor`	2	2	3	2	4

Usage

Basic Training

# H100
python scripts/run_training.py --config configs/h100_training_config.yaml

# A100
python scripts/run_training.py --config configs/gpu_a100_training.yaml

# Skip dataset generation if already exists
python scripts/run_training.py --config configs/h100_training_config.yaml --skip-generation

Monitor Training

Training logs are saved to:

{output_dir}/training_{timestamp}.log

Tail the log in another terminal:

tail -f ./training_output/training_*.log

Adjusting for Your System

If you have different RAM than the config assumes:

Check available RAM:
```
free -h
```
Adjust data_cache_ram_gb:
- Use ~75% of available RAM for cache
- Leave 25% for system/training overhead
Example: System with 128GB RAM:
```
data_cache_ram_gb: 96  # 75% of 128GB
```

Model Checkpoints

Model	Parameters	VRAM (approx)	Speed	Quality
`tiny`	~38M	~5GB	Fastest	Good
`small`	~80M	~8GB	Fast	Better
`base_plus`	~152M	~15GB	Medium	Great
`large`	~224M	~22GB	Slower	Best

Common Issues

OOM (Out of Memory)

Symptom: DataLoader worker (pid XXX) is killed by signal: Killed

Solutions:

Reduce data_cache_ram_gb
Reduce num_workers
Reduce batch_size

Slow GPU Utilization

Symptom: GPU at 0% or low utilization

Solutions:

Increase data_cache_ram_gb (more data in RAM)
Increase num_workers
Increase prefetch_factor

Cache Misses

Symptom: Frequent disk I/O, slow training

Solutions:

Increase data_cache_ram_gb to fit more of dataset
Use smaller dataset that fits in cache
Consider using faster storage (NVMe SSD)

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
gpu_1080ti_training.yaml		gpu_1080ti_training.yaml
gpu_a100_training.yaml		gpu_a100_training.yaml
gpu_h200_training.yaml		gpu_h200_training.yaml
gpu_v100_training.yaml		gpu_v100_training.yaml
h100_training_config.yaml		h100_training_config.yaml
local_test.yaml		local_test.yaml
synthetic_test_100.yaml		synthetic_test_100.yaml
synthetic_train_4k.yaml		synthetic_train_4k.yaml
synthetic_train_4k_raw.yaml		synthetic_train_4k_raw.yaml
synthetic_val_1k.yaml		synthetic_val_1k.yaml
synthetic_val_1k_raw.yaml		synthetic_val_1k_raw.yaml
test_train_200.yaml		test_train_200.yaml
test_val_50.yaml		test_val_50.yaml
validation.yaml		validation.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Training Configurations

GPU Training Configs

Dataset Generation Configs

Key Parameters

Data Caching Strategy

Memory Calculations

DataLoader Settings

Usage

Basic Training

Monitor Training

Adjusting for Your System

Model Checkpoints

Common Issues

OOM (Out of Memory)

Slow GPU Utilization

Cache Misses

FilesExpand file tree

configs

Directory actions

More options

Directory actions

More options

Latest commit

History

configs

Folders and files

parent directory

README.md

Training Configurations

GPU Training Configs

Dataset Generation Configs

Key Parameters

Data Caching Strategy

Memory Calculations

DataLoader Settings

Usage

Basic Training

Monitor Training

Adjusting for Your System

Model Checkpoints

Common Issues

OOM (Out of Memory)

Slow GPU Utilization

Cache Misses