VOCSegmentation, VOCDetection, linting passing, examples. #663

bpinaya · 2018-11-19T09:39:09Z

Hey there guys, I hope this PR can revive some of the efforts that past PRs implemented, I'm referring to:

This PR supports VOCSegmentation and VOCDetection for multiple years (2007 to 2012). 2006 and 2005 are not supported due to the very different format.
I tried to address many of the comments of the pasts two PRs. For the VOCDetection part __getitem__ will return the image and a list of bounding boxes of the format [[xmin, ymin, xmax, ymax, ind],[...],...] as PR 37 directly instead of a ET Element.
I think VOC is a very iconic dataset for both detection and segmentation so it'd be very useful to have it ready to use on pytorch/vision.

If there are any comments or suggestions let me know, I can put some time on implementing recommendations or so.
There are two examples (jupyter notebook gist) of the dataset being loaded:

This one is for segmentation
This one is for detection

I've tested for all the years and regarding the transforms, I'm using the same structure other datasets used, I've seen implementations where a joint_transforms is passed as argument, and that transform is applied to both input and target, I guess that'd work for segmentation but not much for detection so that's why it's not used.

adamjstewart

This looks great! I was just about to try to revive #37 and #86 myself. Let me know if there's anything I can do in terms of testing to help get this PR merged sooner!

torchvision/datasets/voc.py

+    Args:
+        root (string): Root directory of the VOC Dataset.
+        year (string, optional): The dataset year, supports years 2007 to 2012.
+        image_set (string, optional): Select the image_set to use, ``train, trainval or val``


fmassa · 2018-12-04T12:25:42Z

torchvision/datasets/voc.py

+                relative coordinates like``[[xmin, ymin, xmax, ymax, ind], [...], ...]``.
+        """
+        _img = Image.open(self.images[index]).convert('RGB')
+        _target = self._get_bboxes(ET.parse(self.annotations[index]).getroot())


Can we instead return the raw result from ET.parse().getroot()? Or maybe a dict with the full parsing of the xml?
It's up to the user to decide what they actually want for their target, and we are forcing them to use one format now (which doesn't hold all the information from the dataset, such as truncated / occluded / etc).

fmassa · 2018-11-20T13:56:50Z

Hi,

I'm just about to get facebookresearch/maskrcnn-benchmark#131 merged, and I'll be pushing a more genera version of this here. I can coordinate with you guys so that we can get this finally merged.

bpinaya · 2018-11-20T14:29:31Z

Hi @fmassa, just checked the PR, seems really robust. Keep me in the loop if you need help with testing or implementation.

fmassa · 2018-11-23T09:50:33Z

I've just finished testing it and got some reasonable results with the codebase. I'll look into merging the PR todayish.

bpinaya · 2018-11-23T09:54:40Z

Looking forwards to it!

ellisbrown

small suggestion to make usage of the DATASET_YEAR_DICT more straightforward

torchvision/datasets/voc.py

@ellisbrown

Add suggestions from @ellisbrown, using dict of dicts instead of array index. Co-Authored-By: bpinaya <bpg_92@hotmail.com>

bpinaya · 2018-12-04T10:25:28Z

Hey @fmassa , any updates?

fmassa

Hi,

I've left a few more comments.

The most relevant one is that I think we should give the possibility for the user to obtain all the information that is available in the xml.

What do you think?

fmassa · 2018-12-04T12:25:42Z

torchvision/datasets/voc.py

+                relative coordinates like``[[xmin, ymin, xmax, ymax, ind], [...], ...]``.
+        """
+        _img = Image.open(self.images[index]).convert('RGB')
+        _target = self._get_bboxes(ET.parse(self.annotations[index]).getroot())


Can we instead return the raw result from ET.parse().getroot()? Or maybe a dict with the full parsing of the xml?
It's up to the user to decide what they actually want for their target, and we are forcing them to use one format now (which doesn't hold all the information from the dataset, such as truncated / occluded / etc).

fmassa · 2018-12-04T12:26:53Z

torchvision/datasets/voc.py

+    def __len__(self):
+        return len(self.images)
+
+    def _get_bboxes(self, target):


I'd probably let the user write this down. Maybe what would be the most user-friendly would be to parse the ET and return a nested dict?

Indeed, I thought about it, maybe a dict would be better, but I saw here that they pass a Bb class. I think whatever is easier for the end user maybe. Regarding the iteration of the ET, any ideas to make it recursive and elegant? I implemented a function but it's way too hacky.

In maskrcnn-benchmark, we have a dedicated class BoxList which is used everywhere in the codebase.
My original plan was to move BoxList to torchvision, but it needs to mature a bit more before we move it here.

About the ET, I suppose there is a function call that we can use that would enable us to get it recursively? Probably something like this that I wrote for lua

fmassa · 2018-12-04T12:27:39Z

torchvision/datasets/voc.py

+
+        _splits_dir = os.path.join(_voc_root, 'ImageSets/Main')
+
+        _split_f = os.path.join(_splits_dir, image_set.rstrip('\n') + '.txt')


do we need the rstrip? just out of curiosity

fmassa · 2018-12-04T12:28:39Z

torchvision/datasets/voc.py

+                 image_set='train',
+                 download=False,
+                 class_to_ind=None,
+                 keep_difficult=False,


I think keep_difficult could be part of what the user pass to the target_transform.

fmassa · 2018-12-04T12:36:38Z

torchvision/datasets/voc.py

+        self.images = []
+        self.annotations = []
+        with open(os.path.join(_split_f), "r") as lines:
+            for line in lines:


Can't we instead do

with open(os.path.join(_split_f), "r") as f: image_names = f.readlines()

? I believe this strips out the \n in the end, and is a bit faster than the current version.

I've checked and it doesn't strip the \n
the value of image_names is:

['2008_000008\n', '2008_000015\n', '2008_000019\n', '2008_000023\n', '2008_000028\n', '2008_000033\n', '2008_000036\n', '2008_000037\n', '2008_000041\n', '2008_000045\n', '2008_000053\n', '2008_000060\n', '2008_000066\n', '2008_000070\n', '2008_000074\n', '2008_000085\n', '2008_000089\n', '2008_000093\n', '2008_000095\n', '2008_000096\n', '2008_000097\n', '2008_000099\n', '2008_000103\n', '2008_000105\n', '2008_000109\n', '2008_000112\n'...

ok. I think what I had done then was something like

with open(os.path.join(_split_f), "r") as f: image_names = [x.strip() for x in f.readlines()]

but do it as you think it's better.

adamjstewart · 2018-12-04T14:35:24Z

Can you also update docs/source/datasets.rst so that this new Dataset appears in the documentation?

bpinaya · 2018-12-05T15:12:38Z

I think the better approach would be to pass a dict of the parsed ET.parse().getroot(). I'm struggling a bit to implement a recursive function to parse it correctly so if anyone has some tips, they are appreciated.

bpinaya · 2018-12-06T11:20:43Z

@fmassa I followed your recommendations and added this instead for loading:

        with open(os.path.join(_split_f), "r") as f:
            file_names = [x.strip() for x in f.readlines()]

        self.images = [os.path.join(_image_dir, x + ".jpg") for x in file_names]
        self.annotations = [os.path.join(_annotation_dir, x + ".xml") for x in file_names]

In the detection part and this in the segmentation part:

        with open(os.path.join(_split_f), "r") as f:
            file_names = [x.strip() for x in f.readlines()]

        self.images = [os.path.join(_image_dir, x + ".jpg") for x in file_names]
        self.masks = [os.path.join(_mask_dir, x + ".png") for x in file_names]

Removed keep_difficult and _get_bboxes() and replaced it with parse_voc_xml that converts the xml to a dictionary like this (formatted):

{
    'annotation': {
        'filename': '2010_001870.jpg',
        'folder': 'VOC2010',
        'object': [{
            'name': 'person',
            'bndbox': {
                'xmax': '325',
                'xmin': '110',
                'ymax': '375',
                'ymin': '72'
            },
            'difficult': '0',
            'occluded': '0',
            'pose': 'Frontal',
            'truncated': '1',
            'part': [{
                'name': 'head',
                'bndbox': {
                    'xmin': '180',
                    'ymin': '75',
                    'xmax': '286',
                    'ymax': '211'
                }
            }, {
                'name': 'hand',
                'bndbox': {
                    'xmin': '225',
                    'ymin': '284',
                    'xmax': '292',
                    'ymax': '342'
                }
            }]
        }, {
            'name': 'person',
            'bndbox': {
                'xmax': '490',
                'xmin': '57',
                'ymax': '327',
                'ymin': '73'
            },
            'difficult': '0',
            'occluded': '1',
            'pose': 'Frontal',
            'truncated': '1',
            'part': [{
                'name': 'head',
                'bndbox': {
                    'xmin': '283',
                    'ymin': '88',
                    'xmax': '360',
                    'ymax': '226'
                }
            }, {
                'name': 'hand',
                'bndbox': {
                    'xmin': '58',
                    'ymin': '235',
                    'xmax': '116',
                    'ymax': '329'
                }
            }, {
                'name': 'hand',
                'bndbox': {
                    'xmin': '356',
                    'ymin': '136',
                    'xmax': '413',
                    'ymax': '215'
                }
            }]
        }],
        'segmented': '0',
        'size': {
            'depth': '3',
            'height': '375',
            'width': '500'
        },
        'source': {
            'annotation': 'PASCAL VOC2010',
            'database': 'The VOC2010 Database',
            'image': 'flickr'
        }
    }
}

As you can see it handles part and objects ok.
The only downside would be that the user needs to pass a target transform otherwise it won't be usable directly if a target_transform is not passed. I can write a transform and add it to transforms, or shall we leave that to the user?

fmassa · 2018-12-06T12:25:33Z

That looks awesome, thanks!
I think that we can leave it as is, this is similar to how CocoDataset returns target data.
I don't think we need to provide any transforms for now.

fmassa

This is almost ready for merge, one last comment and then this is good to go!

Thanks a lot for the awesome work!

fmassa · 2018-12-06T12:29:16Z

torchvision/datasets/voc.py

+        self.transform = transform
+        self.target_transform = target_transform
+        self.image_set = image_set
+        self.class_to_ind = class_to_ind or dict(


Can we remove class_to_ind? It's not used anymore.

fmassa · 2018-12-06T12:30:57Z

torchvision/datasets/voc.py

+    def __len__(self):
+        return len(self.images)
+
+    def parse_voc_xml(self, node):


This looks great, thanks!
Just to double check, did you try parsing all the images in say VOC2012? I know that some images have a single object and that might require some special handling, just want to verify that this is indeed being taken care here.

bpinaya · 2018-12-06T12:58:39Z

Oh indeed I was not using class_to_ind anymore, removed it and removed VOC_CLASSES too. And I just checked with VOC2012, it seems it's working for parts that have a single object.
This gist is the data loading of VOC2010, for target_transform I just passed a function that prints it and returns a 0.
Also tested for VOC2012 in this gist, as you can see it works when the object has one or multiple bounding boxes. Picture bellow for example, but you might just check the gists.

fmassa

Thanks a lot, this is awesome!

feixiangdekaka · 2019-01-08T08:23:07Z

How can I load trainval 2007 and trainval 2012 at the same time?
gluon provide the function :
def init(self, root=os.path.join('~', '.mxnet', 'datasets', 'voc'),
splits=((2007, 'trainval'), (2012, 'trainval'))

torchvision provide only one year?

fmassa · 2019-01-08T12:17:17Z

@feixiangdekaka you can use torch.utils.data.ConcatDataset to concatenate two datasets created from different years, or even COCO + Pascal.

feixiangdekaka · 2019-01-10T15:35:22Z

data_transforms = (
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]))

dataset= torchvision.datasets.VOCDetection(root="~/.torch",year="2007",download=True,transform=data_transforms)
trainloader = torch.utils.data.DataLoader(dataset1, batch_size=64,
shuffle=True, num_workers=2)

how to correctly iter the datesets?
(1)next(iter(trainloader))
(2)for i, images,target in enumerate(trainloader):
but get error,any hint?

adamjstewart · 2019-01-10T17:07:15Z

@feixiangdekaka Two things look wrong here. (1) you define dataset but refer to it as dataset1 later in the code. I'm going to assume this is a typo. (2) you need parentheses around (images, target) as they are returned as a single tuple. It should look like:

for i, (images, target) in enumerate(trainloader):

feixiangdekaka · 2019-01-11T01:18:25Z

when I use cifa-10 it can get right result!
when using voc with bathsize=1 it is right,when bathsize >1 it return the error:

def _process_next_batch(self, batch):
self.rcvd_idx += 1
self._put_indices()
if isinstance(batch, ExceptionWrapper):
raise batch.exc_type(batch.exc_msg)
return batch
raise ExceptionWapper ERROR!

feixiangdekaka · 2019-01-11T01:29:19Z

import torch
import torchvision
import torchvision.transforms as transforms
########################################################################
data_transforms = (
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]))

dataset= torchvision.datasets.VOCDetection(root="~/.torch",year="2007",download=True,transform=data_transforms)
trainloader = torch.utils.data.DataLoader(dataset, batch_size=4,
shuffle=True, num_workers=1)

dataiter = iter(trainloader)
images, target = dataiter.next()
print(len(target))

can not run!

when like bellow:
for i, (images, target) in enumerate(trainloader):
print(images.shape)
still can not work!

fmassa · 2019-01-16T14:08:00Z

@feixiangdekaka can you run with num_workers=0 so that the error message is clearer, and paste the results here?

VOC Dataset, linted, flak8 passing, samples on gist.

4bfc118

adamjstewart reviewed Nov 20, 2018

View reviewed changes

Double backtick on values.

b9d9ade

ellisbrown reviewed Nov 27, 2018

View reviewed changes

ellisbrown and others added 2 commits November 28, 2018 11:10

Apply suggestions from code review

2514568

Add suggestions from @ellisbrown, using dict of dicts instead of array index. Co-Authored-By: bpinaya <bpg_92@hotmail.com>

Fixed errors with the new comments.

c67cfb1

fmassa requested changes Dec 4, 2018

View reviewed changes

Added documentation on RST

6481f08

bpinaya added 2 commits December 5, 2018 16:20

Merge branch 'master' into master

2cfe1d7

Removed getBB, added parse_voc_xml, variable naming change.

15fc44a

fmassa requested changes Dec 6, 2018

View reviewed changes

Removed unused variable, removed VOC_CLASSES, two new gists for test.

ad2e29d

fmassa approved these changes Dec 6, 2018

View reviewed changes

fmassa merged commit e489abc into pytorch:master Dec 6, 2018

This was referenced Dec 6, 2018

Add VOC Detection and Segmentation Datasets #86

Closed

Add Pascal VOC Class Segmentation #37

Closed

Pascal VOC dataset #2

Closed


		_splits_dir = os.path.join(_voc_root, 'ImageSets/Main')

		_split_f = os.path.join(_splits_dir, image_set.rstrip('\n') + '.txt')

VOCSegmentation, VOCDetection, linting passing, examples. #663

VOCSegmentation, VOCDetection, linting passing, examples. #663

Conversation

bpinaya commented Nov 19, 2018

adamjstewart left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Choose a reason for hiding this comment

fmassa commented Nov 20, 2018

bpinaya commented Nov 20, 2018

fmassa commented Nov 23, 2018

bpinaya commented Nov 23, 2018

ellisbrown left a comment

Choose a reason for hiding this comment

bpinaya commented Dec 4, 2018

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpinaya Dec 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamjstewart commented Dec 4, 2018

bpinaya commented Dec 5, 2018

bpinaya commented Dec 6, 2018 • edited Loading

fmassa commented Dec 6, 2018

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpinaya commented Dec 6, 2018

fmassa left a comment

Choose a reason for hiding this comment

feixiangdekaka commented Jan 8, 2019

fmassa commented Jan 8, 2019

feixiangdekaka commented Jan 10, 2019

adamjstewart commented Jan 10, 2019

feixiangdekaka commented Jan 11, 2019

feixiangdekaka commented Jan 11, 2019

fmassa commented Jan 16, 2019

bpinaya Dec 5, 2018 •

edited

Loading

bpinaya commented Dec 6, 2018 •

edited

Loading