1. Install dependency libraries and prepare data
Before starting training, install corresponding dependency libraries, such as PyTorch, CUDA, CUDNN, etc., and prepare training data. For details about how to perform these operations, see DDRNet official documents.
2. Modify the configuration file. You need to specify some parameters in the configuration file, such as data set path, learning rate, batch size, and number of training rounds. During single-card training, batch_size should be set to a smaller value to adapt to the video memory of a single card. In addition, you need to set num_gpus to 1 to indicate that only one video card will be used. Here is the code for a sample configuration file, where you need to take care to modify parameters such as the data set path:
# train config
DATASET:
ROOT: './data'
NAME: 'cityscapes'
TRAIN_SET: 'train'
TEST_SET: 'val'
INPUT_SIZE: [769, 769]
BASE_SIZE: 769
SCALE_FACTOR: [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
CROP_SIZE: [769, 769]
IGNORE_LABEL: 255
MEAN: [0.485, 0.456, 0.406]
STD: [0.229, 0.224, 0.225]
NUM_CLASSES: 19
TRAIN:
MULTI_SCALE: True
FLIP: True
IGNORE_LABEL: 255
BASE_SIZE: 769
CROP_SIZE: [769, 769]
BATCH_SIZE_PER_GPU: 2
NUM_WORKERS: 4
LEARNING_RATE: 0.001
MOMENTUM: 0.9
WEIGHT_DECAY: 0.0001
POWER: 0.9
MAX_ITER: 80000
WARMUP_STEPS: 1000
WARMUP_FACTOR: 1.0 / 3.0
SAVE_PRED_EVERY: 10000
SNAPSHOT_DIR: './snapshots/'
LOG_DIR: './logs/'
DISPLAY_INTERVAL: 20
GPU_ID: 0
NUM_GPUS: 1
3. Modify the train.py file. The
DDRNet training script is changed to the train.py file. In the train.py file, add the following code:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
# Add this line to initialize the process group
dist.init_process_group('nccl', init_method='env://')
# Modify the model instantiation to use DistributedDataParallel
model = DDRNet(num_classes=args.num_classes, pretrained=True)
model = DistributedDataParallel(model.cuda())
These changes are intended to initialize process groups and allel the model to use DistributedDataParallel data. 10.