DeepLabv3+ on your own dataset

This is my first attempt to note blog using English. If you find some grammar mistakes or be confused in some details, please forgive and correct me.

中文可参见blog.

Installation & Setting


Dataset Preprocessing

Our task is triple classes problem. due to the limitation of the confidentiality agreement, i do not put any original image in this blog. For the convenience of presentation, then i use another dataset which name is CamVid(download here) to prove that training processing is correct.

Main steps of dataset preprocessing are as follows:

  • Data annotation
  • Index Creation
  • TFRecord data Generation

Data annotation

First i want to explain a basic concept which is ignore_label, beacuse i have been wasting a lot of time here.

ignore_label

Be careful do not confuse ignore_label with background. ignore_label is some pixels in image that we do not care about it. You would typically ignore labels for areas that mark delineations between classes, or areas where the class is undefined. As a general comment, note that background should not generally be ignored, and ignore_label is not involved in the calculation of loss.

grey value of annotation mask

The groundtruth label should contain only 1 channel(Grayscale Image) , and it is recommended to be .png format.

if your dataset have n classes including background, you should label all pixels from 0 to n-1. Note that do not label the grey value of pixels into 10,20,100,etc…(because if you do this and the tensorflow code would match the grey value directly with the object class, and it will interfere with calculation of loss.)

Finally,there is some details about generating mask:

  • set all background mask = 0
  • set object1~n-1 to 1~1-n
  • if your dataset include ignore_label,please set = 255

CamVid dataset have 11 classes(no ignore_label), and my dataset hava 3 classes(including background, no ignore_label).

Here is an example:

original image:
mark

mask image(due to camvid have not ignore_label, so it have not any white pixel):
mark

Index Creation

Index Creation means spliting dataset to three parts(train/validate/test dataset). We need create three .txt files to indicate how to split dataset.

First, you would put original images and masks into specified folder, here is my setting:

  • /root/dataset/CamVid/image: saving all original images(701 images)
  • /root/dataset/CamVid/mask: saving all masks(701 masks, corresonding to original images)

Then, we create three index files in /root/dataset/CamVid/index folder:

  • train.txt: index of train dataset
  • trainval.txt: index of validate dataset
  • val.txt: index of test dataset

all .txt files should only include name of original images. For CamVid dataset, the blueprint for the .txt file can be download form here, then you can use sublime Text or other text tools to modify ,txt files.

Here is the screenshot of train/val.txt:

1
2
3
4
5
6
7
8
9
10
11
12
13
# train.txt
0001TP_006690
0001TP_006720
...
0016E5_08640
# 367 lines
# val.txt
0016E5_07959
...
0016E5_08157
0016E5_08159
# 101 lines

TFRecord data Generation

so we uses build_voc2012_data.py and the above generated files to make TFRecord files, refer some commends in download_and_convert_voc2012.sh:

1
2
3
4
5
6
python ./build_voc2012_data.py \
--image_folder="${IMAGE_FOLDER}" \
--semantic_segmentation_folder="${SEMANTIC_SEG_FOLDER}" \
--list_folder="${LIST_FOLDER}" \
--image_format="jpg" \
--output_dir="${OUTPUT_DIR}"
  • ${IMAGE_FOLDER} : the path of saving original images
  • ${SEMANTIC_SEG_FOLDER}: the path of saving masks
  • ${LIST_FOLDER}: the path of three index files
  • image_format: the format of original images, and it is png format in CamVid
  • output_dir: the path for saving generated TFRecord files (mkdir by yourself)

For CamVid dataset, using commends like this:

1
2
3
4
5
6
7
# from /research/deeplab/dataset
python ./build_voc2012_data.py \
--image_folder="/root/dataset/CamVid/image" \
--semantic_segmentation_folder="/root/dataset/CamVid/mask" \
--list_folder="/root/dataset/CamVid/index" \
--image_format="png" \
--output_dir="/root/dataset/CamVid/tfrecord"

Here is the screenshot of running build_voc2012_data.py script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#trainval.txt
>> Converting image 59/233 shard 0
>> Converting image 118/233 shard 1
>> Converting image 177/233 shard 2
>> Converting image 233/233 shard 3
#train.txt
>> Converting image 92/367 shard 0
>> Converting image 184/367 shard 1
>> Converting image 276/367 shard 2
>> Converting image 367/367 shard 3
#val.txt
>> Converting image 26/101 shard 0
>> Converting image 52/101 shard 1
>> Converting image 78/101 shard 2
>> Converting image 101/101 shard 3

and all generated TFRecord files can be find in /root/dataset/CamVid/tfrecord:

1
2
3
4
5
6
7
$: ls
train-00000-of-00004.tfrecord trainval-00002-of-00004.tfrecord
train-00001-of-00004.tfrecord trainval-00003-of-00004.tfrecord
train-00002-of-00004.tfrecord val-00000-of-00004.tfrecord
train-00003-of-00004.tfrecord val-00001-of-00004.tfrecord
trainval-00000-of-00004.tfrecord val-00002-of-00004.tfrecord
trainval-00001-of-00004.tfrecord val-00003-of-00004.tfrecord

by the way, also you can download all the complete files in index.zip


Modify training script

Base on the DeepLab repo, we mainly need modify the following documents:

  • segmentation_dataset.py file
  • train_utils.py file

segmentation_dataset.py

For segmentation_dataset.py file L110, we need add the corresponding description about dataset.

For example, the description of _CAMVID dataset:

1
2
3
4
5
6
7
8
9
10
# segmentation_dataset.py line 110
_CAMVID_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 367, # num of samples in images/training
'val': 101, # num of samples in images/validation
},
num_classes=12, # classes(11) + ingore_label(1)
ignore_label=255,
)

For my own dataset, we have three ojects inclding object1,object2,background, adding ignore_label, so num_classes=4:

1
2
3
4
5
6
7
8
9
_MYDATA_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 444, # num of samples in images/training
'val': 46, # num of samples in images/validation
},
num_classes=4,
ignore_label=255,
)

Register dataset:

Furthermore, for segmentation_dataset.py file L112, also should add the name of dataset description:

1
2
3
4
5
6
7
8
_DATASETS_INFORMATION = {
'cityscapes': _CITYSCAPES_INFORMATION,
'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
'ade20k': _ADE20K_INFORMATION,
'mydata':_MYDATA_INFORMATION, # my own dataset
'camvid':_CAMVID_INFORMATION, # CamVid dataset
}

train_utils.py

Since the num_classes may be different, we need modify the setting of restore weight about logits layer in train_utils.py file L109:

1
2
3
4
# Variables that will not be restored.
exclude_list = ['global_step','logits']
if not initialize_last_layer:
exclude_list.extend(last_layers)

sampling imbance

Refer the explanation of DeepLabv3+’s first author aquariusjay. If the data samples may be strongly biased to one of the classes, we call this imblance.

Because the problem on the CamVid dataset can be ignored, here we take my dataset as example, my task is three classification tasks(background,object1,object2) and has serious imblance problem.

To handle that, we suggest you using loss_weight for the undersampled class intrain_utils.py file L70. In my task, the background pixels account for a large proportion, and object1 more than object2, so the weight ratio is 1:10:15:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
irgore_weight = 0
label0_weight =1
label1_weight = 10
label2_weight = 15
not_ignore_mask =
tf.to_float(tf.equal(scaled_labels, 0)) * label0_weight +
tf.to_float(tf.equal(scaled_labels, 1)) * label1_weight +
tf.to_float(tf.equal(scaled_labels, 2)) * label2_weight +
tf.to_float(tf.equal(scaled_labels, ignore_label)) * irgore_weight
tf.losses.softmax_cross_entropy(
one_hot_labels,
tf.reshape(logits, shape=[-1, num_classes]),
weights=not_ignore_mask,
scope=loss_scope)

In this step, i used to confuse ignore_label and background. Finally i label background=0 and correspending weight label0_weight=1, and object1=1 and label1_weight = 10, etc..


Training and Visualization

Refer the explanation in github- aquariusjay.

if you want to fine-tune DeepLab on your own dataset, then you can modify some parameters in train.py, here has some options:

  • you want to re-use all the trained wieghts, set initialize_last_layer=True
  • you want to re-use only the network backbone, set initialize_last_layer=False and last_layers_contain_logits_only=False
  • you want to re-use all the trained weights except the logits(since the num_classes may be different), set initialize_last_layer=False and last_layers_contain_logits_only=True

Finally, my setting is as follows:

  • initialize_last_layer=False
  • last_layers_contain_logits_only=True

Preliminary training

when we training onCamVid,there is no consideration of the imablance problem, if your task has imblance problem, please refer to the problems in Troubleshot chapter.

follow the demo in deeplab repo, there are some parameters we need modfiy:

  • tf_initial_checkpoint: the path of pretrained weights. Because CamVid are similar to CityScapes, so we use pretrain weight on CityScapes

  • train_logdir: the path of training checkpoint files

  • dataset_dir: the path of dataset TFRecord files
  • dataset: the name of dataset description in segmentation_dataset.py

training commend on CamVid are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=300 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=2 \
--dataset="camvid" \
--tf_initial_checkpoint='/root/newP/official_tf/models-master/research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt' \
--train_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train' \
--dataset_dir='/root/dataset/CamVid/tfrecord'

Here training step only set 300, crop_size = 513 and batchsize=2, just test whether the training commend can be executed right.

there are screenshot of outputs:

1
2
3
4
5
6
7
8
9
10
11
12
13
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 2.7773 (0.550 sec/step)
INFO:tensorflow:global step 20: loss = 2.6438 (0.531 sec/step)
INFO:tensorflow:global step 30: loss = 2.4824 (0.555 sec/step)
INFO:tensorflow:global step 40: loss = 2.4652 (0.564 sec/step)
#...
INFO:tensorflow:global step 300: loss = 1.9276 (0.534 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

Visualization

deepLab repo also provide evaluation and visualization tools, here we test the setting about CamVid dataset. Because the image size of CamVid is different from CityScapes, here has some parameters as follows:

  • vis_split: the category of tfrecord file
  • vis_crop_size: the size of input image (360,480)
  • dataset: the name of dataset description in segmentation_dataset.py
  • dataset_dir: the path of dataset TFRecord files
  • colormap_type: the colormap of annotation

Finally, vis commend on CamVid are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# From tensorflow/models/research/
python deeplab/vis.py \
--logtostderr \
--vis_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--vis_crop_size=360 \
--vis_crop_size=480 \
--dataset="camvid" \
--colormap_type="pascal" \
--checkpoint_dir='/root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train' \
--vis_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/vis' \
--dataset_dir='/root/dataset/CamVid/tfrecord'

there are screenshot of outputs:

1
2
3
4
5
6
7
INFO:tensorflow:Restoring parameters from /root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train/model.ckpt-300
INFO:tensorflow:Visualizing batch 1 / 101
INFO:tensorflow:Visualizing batch 2 / 101
INFO:tensorflow:Visualizing batch 3 / 101
...
INFO:tensorflow:Visualizing batch 100 / 101
INFO:tensorflow:Visualizing batch 101 / 101

and i select some prediction as follows:

mark

and we can see that the model can run correctly, although the results are not good.

Evaluation

There are some parameters in eval commend should be modify:

  • eval_split: the category of tfrecord file
  • crop_size: the size of input image (360,480)
  • dataset: the name of dataset description in segmentation_dataset.py
  • dataset_dir: the path of dataset TFRecord files

Finally, eval commend on CamVid are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
python deeplab/eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=360 \
--eval_crop_size=480 \
--dataset="camvid" \
--checkpoint_dir='/root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train' \
--eval_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/eval' \
--dataset_dir='/root/dataset/CamVid/tfrecord'

there are screenshot of outputs:

1
2
3
4
5
6
7
8
9
10
INFO:tensorflow:Restoring parameters from /root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train/model.ckpt-300
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-07-06-06:34:28
INFO:tensorflow:Evaluation [10/101]
INFO:tensorflow:Evaluation [20/101]
#...
INFO:tensorflow:Evaluation [101/101]
INFO:tensorflow:Finished evaluation at 2018-07-06-06:34:34
miou_1.0[0.149601415]

The result is not cool(mIoU=0.149), but it proves that there is no big problem in our training commend.

Advanced training

After the preliminary training, we have made a further modify on training commend. I delete the trained weights saving in train_logdir, and modify some parameters as follows:

  • training_number_of_steps: set to 3000
  • crop_size: set to 321
  • batch_size: imporve to 4

and new training commend on CamVid are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=3000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=321 \
--train_crop_size=321 \
--train_batch_size=4 \
--dataset="camvid" \
--tf_initial_checkpoint='/root/newP/official_tf/models-master/research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt' \
--train_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train' \
--dataset_dir='/root/dataset/CamVid/tfrecord'

There are screenshot of outputs:

1
2
3
4
5
6
7
8
INFO:tensorflow:global step 2960: loss = 0.6281 (0.407 sec/step)
INFO:tensorflow:Saving checkpoint to path /root/newP/official_tf/models-master/research/deeplab/exp/camvid_train/train/model.ckpt
INFO:tensorflow:global_step/sec: 2.45499
INFO:tensorflow:Recording summary at step 2962.
INFO:tensorflow:global step 2970: loss = 0.8240 (0.439 sec/step)
INFO:tensorflow:global step 2980: loss = 0.9588 (0.385 sec/step)
INFO:tensorflow:global step 2990: loss = 0.8880 (0.412 sec/step)
INFO:tensorflow:global step 3000: loss = 0.8292 (0.392 sec/step)

Visualization and Evaluation

Reusing eval.py for testing:

1
2
3
INFO:tensorflow:Evaluation [101/101]
INFO:tensorflow:Finished evaluation at 2018-07-06-07:03:30
miou_1.0[0.4015598]

as we can see, the new result(mIoU:0.401) has been significantly improved.

and reusing vis.py for visualization:

mark

and the new prediction has been pretty face.


Troubleshot

the main mistake on my own dataset as follows:

  • 1: generating mask
  • 2: confusion between ignore_label and background
  • 3: problem in setting weight

This is very important step about how to generate mask in the front chapter. i make a mistake to set piexl of different objects(including background) to 0,100,150,etc… , and it leads to false prediction and could hardly solve imblance problem.

Confusion between ignore_label and background

For ignore_label i used to mistakenly set 0 in segmentation_dataset.py:

1
2
3
4
5
6
7
8
9
10
# segmentation_dataset.py
_CAMVID_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 367, # num of samples in images/training
'val': 101, # num of samples in images/validation
},
num_classes=3,
ignore_label=0,
)

and also mistakenly set num_classes to 3.

there are screenshot of train_utils.py :

1
2
3
4
5
6
7
8
ignore_weight = 0
label0_weight =20
label1_weight = 20
not_ignore_mask =
tf.to_float(tf.equal(scaled_labels, 1)) * label0_weight
+ tf.to_float(tf.equal(scaled_labels, 2)) * label1_weight
+ tf.to_float(tf.equal(scaled_labels, ignore_label)) * ignore_weight

Because ignore_label is set to 0, and that will cuase background not being involved in calculation of loss.

As we can see following image (Only 200 taining step) and it proves model has learn some information but there are some problems:

mark

Problem in setting weight

Prediction always same color

Because we have some mistakes on calculation of loss, there is a problem between the weights of the corresponding classes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## all blue
irgore_weight = 0
label0_weight =10
label1_weight = 20
## all black
irgore_weight = 10
label0_weight =10
label1_weight = 10
## all green
irgore_weight = 0
label0_weight =20
label1_weight = 10

The prediction is all blue/black/green, due to corresponding object2/background/object1 weight are too large and do not consider background. that would make model do not calcuate loss of object2/background/object1, so model can get a pretty loss by simply predicting blue/black/green.

mark

Successful training

we set weight ratio is 1:10:15, we can get accurate weight ratio via object statistic:

1
2
3
4
5
6
7
8
9
10
irgore_weight = 0
label0_weight =1 # background
label1_weight = 10 # object1
label2_weight = 15 #object2
not_ignore_mask =
tf.to_float(tf.equal(scaled_labels, 0)) * label0_weight +
tf.to_float(tf.equal(scaled_labels, 1)) * label1_weight +
tf.to_float(tf.equal(scaled_labels, 2)) * label2_weight +
tf.to_float(tf.equal(scaled_labels, ignore_label)) * irgore_weight

This is test result between 4000 step and 200 step:

mark


References

Refer aquariusjay explanation about training parameters:

mark

Refer aquariusjay explanation about imblance problem:

mark

Refer github-@zhaolewen explanation about annotation mask:

mark

Thanks for your support!