版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/gzj2013/article/details/82425954
COCO 数据集中各类数量的分布到底是怎样的?
如果一个数据集中各类数量分布差异很大, 是否会对深度学习模型训练有影响? 为什么?
如果有影响, 那又应该如何处理?
初步分析
以下这段代码给出了 COCO 数据集 val2017 中 80 类的图片数据和标注数据的数量.
from pycocotools.coco import COCO
dataDir='/path/to/your/cocoDataset'
dataType='val2017'
annFile='{}/annotations/instances_{}.json'.format(dataDir,dataType)
# initialize COCO api for instance annotations
coco=COCO(annFile)
# display COCO categories and supercategories
cats = coco.loadCats(coco.getCatIds())
cat_nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(cat_nms)))
# 统计各类的图片数量和标注框数量
for cat_name in cat_nms:
catId = coco.getCatIds(catNms=cat_name)
imgId = coco.getImgIds(catIds=catId)
annId = coco.getAnnIds(imgIds=imgId, catIds=catId, iscrowd=None)
print("{:<15} {:<6d} {:<10d}".format(cat_name, len(imgId), len(annId)))
输出信息如下:
category | 图片数量 | 标注框数量 | category | 图片数量 | 标注框数量 |
---|---|---|---|---|---|
person | 2693 | 11004 | bicycle | 149 | 316 |
car | 535 | 1932 | motorcycle | 159 | 371 |
airplane | 97 | 143 | bus | 189 | 285 |
train | 157 | 190 | truck | 250 | 415 |
boat | 121 | 430 | traffic light | 191 | 637 |
fire hydrant | 86 | 101 | stop sign | 69 | 75 |
parking meter | 37 | 60 | bench | 235 | 413 |
bird | 125 | 440 | cat | 184 | 202 |
dog | 177 | 218 | horse | 128 | 273 |
sheep | 65 | 361 | cow | 87 | 380 |
elephant | 89 | 255 | bear | 49 | 71 |
zebra | 85 | 268 | giraffe | 101 | 232 |
backpack | 228 | 371 | umbrella | 174 | 413 |
handbag | 292 | 540 | tie | 145 | 254 |
suitcase | 105 | 303 | frisbee | 84 | 115 |
skis | 120 | 241 | snowboard | 49 | 69 |
sports ball | 169 | 263 | kite | 91 | 336 |
baseball bat | 97 | 146 | baseball glove | 100 | 148 |
skateboard | 127 | 179 | surfboard | 149 | 269 |
tennis racket | 167 | 225 | bottle | 379 | 1025 |
wine glass | 110 | 343 | cup | 390 | 899 |
fork | 155 | 215 | knife | 181 | 326 |
spoon | 153 | 253 | bowl | 314 | 626 |
banana | 103 | 379 | apple | 76 | 239 |
sandwich | 98 | 177 | orange | 85 | 287 |
broccoli | 71 | 316 | carrot | 3 | 17 |
hot dog | 0 | 345 | pizza | 153 | 285 |
donut | 62 | 338 | cake | 124 | 316 |
chair | 580 | 1791 | couch | 195 | 261 |
potted plant | 172 | 343 | bed | 149 | 163 |
dining table | 501 | 697 | toilet | 149 | 179 |
tv | 207 | 288 | laptop | 183 | 231 |
mouse | 88 | 106 | remote | 145 | 283 |
keyboard | 106 | 153 | cell phone | 214 | 262 |
microwave | 54 | 55 | oven | 115 | 143 |
toaster | 8 | 9 | sink | 187 | 225 |
refrigerator | 101 | 126 | book | 230 | 1161 |
clock | 204 | 267 | vase | 137 | 277 |
scissors | 28 | 36 | teddy bear | 0 | 262 |
hair drier | 9 | 11 | toothbrush | 34 | 57 |
可以看出, 不管是各类的图片数目还是标注框数目, 其数量分布差异均很大. 特别是 ‘person’ 类的标注框数目明显要其中大多数类的标注框数目大约多出 50 倍.
2017train 类数目分布
以下是 COCO 数据集 train2017 中 80 类的图片数据和标注数据的数量.
category | 图片数量 | 标注框数量 | category | 图片数量 | 标注框数量 |
---|---|---|---|---|---|
person | 64115 | 262465 | bicycle | 3252 | 7113 |
car | 12251 | 43867 | motorcycle | 3502 | 8725 |
airplane | 2986 | 5135 | bus | 3952 | 6069 |
train | 3588 | 4571 | truck | 6127 | 9973 |
boat | 3025 | 10759 | traffic light | 4139 | 12884 |
fire hydrant | 1711 | 1865 | stop sign | 1734 | 1983 |
parking meter | 705 | 1285 | bench | 5570 | 9838 |
bird | 3237 | 10806 | cat | 4114 | 4768 |
dog | 4385 | 5508 | horse | 2941 | 6587 |
sheep | 1529 | 9509 | cow | 1968 | 8147 |
elephant | 2143 | 5513 | bear | 960 | 1294 |
zebra | 1916 | 5303 | giraffe | 2546 | 5131 |
backpack | 5528 | 8720 | umbrella | 3968 | 11431 |
handbag | 6841 | 12354 | tie | 3810 | 6496 |
suitcase | 2402 | 6192 | frisbee | 2184 | 2682 |
skis | 3082 | 6646 | snowboard | 1654 | 2685 |
sports ball | 4262 | 6347 | kite | 2261 | 9076 |
baseball bat | 2506 | 3276 | baseball glove | 2629 | 3747 |
skateboard | 3476 | 5543 | surfboard | 3486 | 6126 |
tennis racket | 3394 | 4812 | bottle | 8501 | 24342 |
wine glass | 2533 | 7913 | cup | 9189 | 20650 |
fork | 3555 | 5479 | knife | 4326 | 7770 |
spoon | 3529 | 6165 | bowl | 7111 | 14358 |
banana | 2243 | 9458 | apple | 1586 | 5851 |
sandwich | 2365 | 4373 | orange | 1699 | 6399 |
broccoli | 1939 | 7308 | carrot | 24 | 142 |
hot dog | 11 | 29 | pizza | 3166 | 5821 |
donut | 1523 | 7179 | cake | 2925 | 6353 |
chair | 12774 | 38491 | couch | 4423 | 5779 |
potted plant | 4452 | 8652 | bed | 3682 | 4192 |
dining table | 11837 | 15714 | toilet | 3353 | 4157 |
tv | 4561 | 5805 | laptop | 3524 | 4970 |
mouse | 1876 | 2262 | remote | 3076 | 5703 |
keyboard | 2115 | 2855 | cell phone | 4803 | 6434 |
microwave | 1547 | 1673 | oven | 2877 | 3334 |
toaster | 217 | 225 | sink | 4678 | 5610 |
refrigerator | 2360 | 2637 | book | 5332 | 24715 |
clock | 4659 | 6334 | vase | 3593 | 6613 |
scissors | 947 | 1481 | teddy bear | 16 | 92 |
hair drier | 189 | 198 | toothbrush | 1007 | 1954 |