DDRNet Networking - Single GPU training

dnnyal224 注册会员
2023-02-27 13:40
DDRNet is a deep learning network for semantic segmentation, which requires the GPU to train the model. When training DDRNet on a single card, some modifications and Settings are required. Here are some basic steps and code modification instructions.

1. Install dependency libraries and prepare data
Before starting training, install corresponding dependency libraries, such as PyTorch, CUDA, CUDNN, etc., and prepare training data. For details about how to perform these operations, see DDRNet official documents.

2. Modify the configuration file. You need to specify some parameters in the configuration file, such as data set path, learning rate, batch size, and number of training rounds. During single-card training, batch_size should be set to a smaller value to adapt to the video memory of a single card. In addition, you need to set num_gpus to 1 to indicate that only one video card will be used. Here is the code for a sample configuration file, where you need to take care to modify parameters such as the data set path:

# train config
  ROOT: './data'
  NAME: 'cityscapes'
  TRAIN_SET: 'train'
  TEST_SET: 'val'
  INPUT_SIZE: [769, 769]
  BASE_SIZE: 769
  SCALE_FACTOR: [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
  CROP_SIZE: [769, 769]
  MEAN: [0.485, 0.456, 0.406]
  STD: [0.229, 0.224, 0.225]

  FLIP: True
  BASE_SIZE: 769
  CROP_SIZE: [769, 769]
  WEIGHT_DECAY: 0.0001
  POWER: 0.9
  MAX_ITER: 80000
  WARMUP_FACTOR: 1.0 / 3.0
  SNAPSHOT_DIR: './snapshots/'
  LOG_DIR: './logs/'
  GPU_ID: 0

3. Modify the train.py file. The
DDRNet training script is changed to the train.py file. In the train.py file, add the following code:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

# Add this line to initialize the process group
dist.init_process_group('nccl', init_method='env://')

# Modify the model instantiation to use DistributedDataParallel
model = DDRNet(num_classes=args.num_classes, pretrained=True)
model = DistributedDataParallel(model.cuda())

These changes are intended to initialize process groups and allel the model to use DistributedDataParallel data. 10.

ddzx007 注册会员
2023-02-27 13:40

DDRNet is a deep learning image segmentation model, and its training process can be completed by using PyTorch framework. Here are the steps and terminal commands for running DDRNet training files on a single card.

1. First, make sure the PyTorch framework and other required dependencies are installed. You can check that PyTorch is installed using the following command:

pip list | grep torch

2. Download the DDRNet source code and data set. You can clone the DDRNet Github repository using the following command:

git clone https://github.com/megvii-model/DDRNet.git
3. Go to the DDRNet directory and install the required dependencies:
cd DDRNet
pip install -r requirements.txt

4. Modify the train.py file to run the training on a single card. Open the train.py file and find the following line:

dist = init_dist(args.world_size, args.dist_backend)

Replace it with the following code:

dist = None

This disables distributed training and limits training to a single GPU.

5. Running training. Run the training file on a single card using the following command:

python train.py --dataset cityscapes --train-split train --val-split val --backbone dlaup --lr 0.01 --batch-size 8 --base-size 1024 --crop-size 768 --workers 4 --epochs 240 --gpu-ids 0 --save-dir ./save/

The --gpu-ids parameter specifies the number of the GPU to be used. Here, it is 0. You can change other parameters as needed, such as learning rate, batch size, and so on.

6. Observe the training process. During training, the console outputs training and verification losses. In addition, you can use Tensorboard to view training progress and visualize results. Run the following command to start Tensorboard:

tensorboard --logdir ./save/runs

7. Save the model. After the training is complete, the model is saved in the specified directory(here./save/). You can convert the model to ONNX format using the following command:

python tools/convert_model.py --dataset cityscapes --model-dir ./save/runs/checkpoint_240.pth --output-file ./save/runs/ddrnet_cityscapes.onnx

Where the --model-dir parameter specifies the path to the saved model file and the --output-file parameter specifies the path to the output ONNX file.

Hopefully these steps and commands will help you run DDRNet training files on a single card.

dishasi 注册会员
2023-02-27 13:40

This answer quotes ChatGPT

DDRNet is a deep learning model for semantic segmentation tasks. To train DDRNet on a single card, you need to modify the code to specify the single card and specify the Batch Size of the single card.

Here are the steps to train DDRNet using a single card on an Ubuntu terminal:

1. Verify that you have installed CUDA and CUDNN and have installed the necessary Python libraries. You can install these libraries using the following command:

pip install torch torchvision opencv-python pillow tensorboard tensorboardX yacs cython matplotlib

2. Clone the DDRNet repository. You can clone the code repository using the following command:

git clone https://github.com/ydhongHIT/DDRNet.pytorch.git

3. Go to the DDRNet repository directory and the train folder:

cd DDRNet.pytorch
cd train

4. Open the train.py file and find the following code snippet:
train_cfg = cfg.train_cfg
train_dataset = build_dataset(cfg.data.train)
train_loader = build_dataloader(

5. Modify train_loader parameters and specify the Batch Size of a single card. For example, if you want to train with a single card and each Batch is 2, you need to change the code to:

train_loader = build_dataloader(

6. Save the train.py file and close it.

7. Open the terminal, go to the train folder and run the following command:

CUDA_VISIBLE_DEVICES=0 python train.py --cfg configs/DDRNet/DDRNet23_slim.yaml

CUDA_VISIBLE_DEVICES=0 indicates that the first GPU is used, and configs/DDRNet/DDRNet23_slim.yaml is the configuration file of the model. Depending on your model and requirements, you can change the name and path of this configuration file.

cyf8448 注册会员
2023-02-27 13:40
dengwei99 注册会员
2023-02-27 13:40
DDRNet network, modify the corresponding code and enter the corresponding command in the terminal.

First, in the train file of the DDRNet network, find the following code:

if __name__ == '__main__':
    # Train on the GPU or on the CPU, if a GPU is not available
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

This code is used to determine whether to train on the GPU or CPU. Change it to:

if __name__ == '__main__':
    # Train on the GPU or on the CPU, if a GPU is not available
    device = torch.device('cuda:1')  # 修改为cuda:1,表示使用的是第2块GPU

This gives DDRNet the option of using a soundcard for training.

Then in the terminal, type the following command:

python train.py --data-dir= --lr= --batch-size= --num-epochs= --gpu=1
In the

command, you need to replace the,,, and four arguments with the values you want. --gpu=1 indicates that the second GPU is used.

Finally, press Enter to run this command to start the DDRNet network training on the soundcard.
If the answer is helpful, please accept it.

jingzaie 注册会员
2023-02-27 13:40

1 Modify parameters in train.py
In the train.py file, you need to modify some parameters in the train function. For example, batch_size, learning_rate, momentum, and weight_decay. These parameters can be modified according to specific needs to adapt to the single card training.

2 Modify the train.sh file.
In the train.sh file, you need to set the value of the --world-size parameter to 1, indicating that only a single card is used for training. In addition, you need to change the value of the --rank parameter to 0, indicating that the rank of the card is 0.

The modified train.sh file is as follows:


set -x


python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 \
    train.py --data-dir $DATA_DIR --save-dir $SAVE_DIR \
    --lr 0.01 --batch-size 2 --momentum 0.9 --weight-decay 0.0001 \
    --backbone resnet50 --output-stride 8 \
    --train-split train --val-split val --train-crop-size 512 \
    --epochs 50 --lr-scheduler poly \
    --tensorboard True --distributed False --world-size 1 --rank 0

3 Modifying the torch.distributed.launch command
On the terminal, you need to use torch.distributed.launch command to start training. Also, change the value of the --master_port parameter to an unused port number to avoid port conflicts.

The modified command is as follows:

python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 train.py ...

Finally, start training by running the modified train.sh file or executing the modified command in the terminal.

Note: In order to train DDRNet on a single card, you need to make sure that PyTorch and other necessary dependencies are installed on the machine. In addition, it is necessary to adjust the hardware configuration to ensure the stability and efficiency of single card training.

About the Author

Question Info

Publish Time
2023-02-27 13:39
Update Time
2023-02-27 13:39

Related Question








ceres gpu加速

Change styling for single item in list ReactJS

如何使用tensorflow.js v3.13.0模型推理后保存在GPU上的数据?