Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VOCSegmentation, VOCDetection, linting passing, examples. #663

Merged
merged 8 commits into from
Dec 6, 2018

Conversation

bpinaya
Copy link
Contributor

@bpinaya bpinaya commented Nov 19, 2018

Hey there guys, I hope this PR can revive some of the efforts that past PRs implemented, I'm referring to:

This PR supports VOCSegmentation and VOCDetection for multiple years (2007 to 2012). 2006 and 2005 are not supported due to the very different format.
I tried to address many of the comments of the pasts two PRs. For the VOCDetection part __getitem__ will return the image and a list of bounding boxes of the format [[xmin, ymin, xmax, ymax, ind],[...],...] as PR 37 directly instead of a ET Element.
I think VOC is a very iconic dataset for both detection and segmentation so it'd be very useful to have it ready to use on pytorch/vision.

If there are any comments or suggestions let me know, I can put some time on implementing recommendations or so.
There are two examples (jupyter notebook gist) of the dataset being loaded:

I've tested for all the years and regarding the transforms, I'm using the same structure other datasets used, I've seen implementations where a joint_transforms is passed as argument, and that transform is applied to both input and target, I guess that'd work for segmentation but not much for detection so that's why it's not used.

Copy link
Contributor

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I was just about to try to revive #37 and #86 myself. Let me know if there's anything I can do in terms of testing to help get this PR merged sooner!

Args:
root (string): Root directory of the VOC Dataset.
year (string, optional): The dataset year, supports years 2007 to 2012.
image_set (string, optional): Select the image_set to use, ``train, trainval or val``

This comment was marked as off-topic.

This comment was marked as off-topic.

relative coordinates like``[[xmin, ymin, xmax, ymax, ind], [...], ...]``.
"""
_img = Image.open(self.images[index]).convert('RGB')
_target = self._get_bboxes(ET.parse(self.annotations[index]).getroot())

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead return the raw result from ET.parse().getroot()? Or maybe a dict with the full parsing of the xml?
It's up to the user to decide what they actually want for their target, and we are forcing them to use one format now (which doesn't hold all the information from the dataset, such as truncated / occluded / etc).

@fmassa
Copy link
Member

fmassa commented Nov 20, 2018

Hi,

I'm just about to get facebookresearch/maskrcnn-benchmark#131 merged, and I'll be pushing a more genera version of this here. I can coordinate with you guys so that we can get this finally merged.

@bpinaya
Copy link
Contributor Author

bpinaya commented Nov 20, 2018

Hi @fmassa, just checked the PR, seems really robust. Keep me in the loop if you need help with testing or implementation.

@fmassa
Copy link
Member

fmassa commented Nov 23, 2018

I've just finished testing it and got some reasonable results with the codebase. I'll look into merging the PR todayish.

@bpinaya
Copy link
Contributor Author

bpinaya commented Nov 23, 2018

Looking forwards to it!

Copy link

@ellisbrown ellisbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small suggestion to make usage of the DATASET_YEAR_DICT more straightforward

torchvision/datasets/voc.py Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
torchvision/datasets/voc.py Outdated Show resolved Hide resolved
ellisbrown and others added 2 commits November 28, 2018 11:10
Add suggestions from @ellisbrown, using dict of dicts instead of array index.

Co-Authored-By: bpinaya <bpg_92@hotmail.com>
@bpinaya
Copy link
Contributor Author

bpinaya commented Dec 4, 2018

Hey @fmassa , any updates?

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I've left a few more comments.

The most relevant one is that I think we should give the possibility for the user to obtain all the information that is available in the xml.

What do you think?

relative coordinates like``[[xmin, ymin, xmax, ymax, ind], [...], ...]``.
"""
_img = Image.open(self.images[index]).convert('RGB')
_target = self._get_bboxes(ET.parse(self.annotations[index]).getroot())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead return the raw result from ET.parse().getroot()? Or maybe a dict with the full parsing of the xml?
It's up to the user to decide what they actually want for their target, and we are forcing them to use one format now (which doesn't hold all the information from the dataset, such as truncated / occluded / etc).

def __len__(self):
return len(self.images)

def _get_bboxes(self, target):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably let the user write this down. Maybe what would be the most user-friendly would be to parse the ET and return a nested dict?

Copy link
Contributor Author

@bpinaya bpinaya Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I thought about it, maybe a dict would be better, but I saw here that they pass a Bb class. I think whatever is easier for the end user maybe. Regarding the iteration of the ET, any ideas to make it recursive and elegant? I implemented a function but it's way too hacky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In maskrcnn-benchmark, we have a dedicated class BoxList which is used everywhere in the codebase.
My original plan was to move BoxList to torchvision, but it needs to mature a bit more before we move it here.

About the ET, I suppose there is a function call that we can use that would enable us to get it recursively? Probably something like this that I wrote for lua


_splits_dir = os.path.join(_voc_root, 'ImageSets/Main')

_split_f = os.path.join(_splits_dir, image_set.rstrip('\n') + '.txt')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the rstrip? just out of curiosity

image_set='train',
download=False,
class_to_ind=None,
keep_difficult=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keep_difficult could be part of what the user pass to the target_transform.

self.images = []
self.annotations = []
with open(os.path.join(_split_f), "r") as lines:
for line in lines:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we instead do

with open(os.path.join(_split_f), "r") as f:
    image_names = f.readlines()

? I believe this strips out the \n in the end, and is a bit faster than the current version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked and it doesn't strip the \n
the value of image_names is:

['2008_000008\n', '2008_000015\n', '2008_000019\n', '2008_000023\n', '2008_000028\n', '2008_000033\n', '2008_000036\n', '2008_000037\n', '2008_000041\n', '2008_000045\n', '2008_000053\n', '2008_000060\n', '2008_000066\n', '2008_000070\n', '2008_000074\n', '2008_000085\n', '2008_000089\n', '2008_000093\n', '2008_000095\n', '2008_000096\n', '2008_000097\n', '2008_000099\n', '2008_000103\n', '2008_000105\n', '2008_000109\n', '2008_000112\n'...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I think what I had done then was something like

with open(os.path.join(_split_f), "r") as f:
    image_names = [x.strip() for x in f.readlines()]

but do it as you think it's better.

@adamjstewart
Copy link
Contributor

Can you also update docs/source/datasets.rst so that this new Dataset appears in the documentation?

@bpinaya
Copy link
Contributor Author

bpinaya commented Dec 5, 2018

I think the better approach would be to pass a dict of the parsed ET.parse().getroot(). I'm struggling a bit to implement a recursive function to parse it correctly so if anyone has some tips, they are appreciated.

@bpinaya
Copy link
Contributor Author

bpinaya commented Dec 6, 2018

@fmassa I followed your recommendations and added this instead for loading:

        with open(os.path.join(_split_f), "r") as f:
            file_names = [x.strip() for x in f.readlines()]

        self.images = [os.path.join(_image_dir, x + ".jpg") for x in file_names]
        self.annotations = [os.path.join(_annotation_dir, x + ".xml") for x in file_names]

In the detection part and this in the segmentation part:

        with open(os.path.join(_split_f), "r") as f:
            file_names = [x.strip() for x in f.readlines()]

        self.images = [os.path.join(_image_dir, x + ".jpg") for x in file_names]
        self.masks = [os.path.join(_mask_dir, x + ".png") for x in file_names]

Removed keep_difficult and _get_bboxes() and replaced it with parse_voc_xml that converts the xml to a dictionary like this (formatted):

{
    'annotation': {
        'filename': '2010_001870.jpg',
        'folder': 'VOC2010',
        'object': [{
            'name': 'person',
            'bndbox': {
                'xmax': '325',
                'xmin': '110',
                'ymax': '375',
                'ymin': '72'
            },
            'difficult': '0',
            'occluded': '0',
            'pose': 'Frontal',
            'truncated': '1',
            'part': [{
                'name': 'head',
                'bndbox': {
                    'xmin': '180',
                    'ymin': '75',
                    'xmax': '286',
                    'ymax': '211'
                }
            }, {
                'name': 'hand',
                'bndbox': {
                    'xmin': '225',
                    'ymin': '284',
                    'xmax': '292',
                    'ymax': '342'
                }
            }]
        }, {
            'name': 'person',
            'bndbox': {
                'xmax': '490',
                'xmin': '57',
                'ymax': '327',
                'ymin': '73'
            },
            'difficult': '0',
            'occluded': '1',
            'pose': 'Frontal',
            'truncated': '1',
            'part': [{
                'name': 'head',
                'bndbox': {
                    'xmin': '283',
                    'ymin': '88',
                    'xmax': '360',
                    'ymax': '226'
                }
            }, {
                'name': 'hand',
                'bndbox': {
                    'xmin': '58',
                    'ymin': '235',
                    'xmax': '116',
                    'ymax': '329'
                }
            }, {
                'name': 'hand',
                'bndbox': {
                    'xmin': '356',
                    'ymin': '136',
                    'xmax': '413',
                    'ymax': '215'
                }
            }]
        }],
        'segmented': '0',
        'size': {
            'depth': '3',
            'height': '375',
            'width': '500'
        },
        'source': {
            'annotation': 'PASCAL VOC2010',
            'database': 'The VOC2010 Database',
            'image': 'flickr'
        }
    }
}

As you can see it handles part and objects ok.
The only downside would be that the user needs to pass a target transform otherwise it won't be usable directly if a target_transform is not passed. I can write a transform and add it to transforms, or shall we leave that to the user?

@fmassa
Copy link
Member

fmassa commented Dec 6, 2018

That looks awesome, thanks!
I think that we can leave it as is, this is similar to how CocoDataset returns target data.
I don't think we need to provide any transforms for now.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost ready for merge, one last comment and then this is good to go!

Thanks a lot for the awesome work!

self.transform = transform
self.target_transform = target_transform
self.image_set = image_set
self.class_to_ind = class_to_ind or dict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove class_to_ind? It's not used anymore.

def __len__(self):
return len(self.images)

def parse_voc_xml(self, node):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks!
Just to double check, did you try parsing all the images in say VOC2012? I know that some images have a single object and that might require some special handling, just want to verify that this is indeed being taken care here.

@bpinaya
Copy link
Contributor Author

bpinaya commented Dec 6, 2018

Oh indeed I was not using class_to_ind anymore, removed it and removed VOC_CLASSES too. And I just checked with VOC2012, it seems it's working for parts that have a single object.
This gist is the data loading of VOC2010, for target_transform I just passed a function that prints it and returns a 0.
Also tested for VOC2012 in this gist, as you can see it works when the object has one or multiple bounding boxes. Picture bellow for example, but you might just check the gists.
image

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this is awesome!

@feixiangdekaka
Copy link

How can I load trainval 2007 and trainval 2012 at the same time?
gluon provide the function :
def init(self, root=os.path.join('~', '.mxnet', 'datasets', 'voc'),
splits=((2007, 'trainval'), (2012, 'trainval'))

torchvision provide only one year?

@fmassa
Copy link
Member

fmassa commented Jan 8, 2019

@feixiangdekaka you can use torch.utils.data.ConcatDataset to concatenate two datasets created from different years, or even COCO + Pascal.

@feixiangdekaka
Copy link

data_transforms = (
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]))

dataset= torchvision.datasets.VOCDetection(root="~/.torch",year="2007",download=True,transform=data_transforms)
trainloader = torch.utils.data.DataLoader(dataset1, batch_size=64,
shuffle=True, num_workers=2)

how to correctly iter the datesets?
(1)next(iter(trainloader))
(2)for i, images,target in enumerate(trainloader):
but get error,any hint?

@adamjstewart
Copy link
Contributor

@feixiangdekaka Two things look wrong here. (1) you define dataset but refer to it as dataset1 later in the code. I'm going to assume this is a typo. (2) you need parentheses around (images, target) as they are returned as a single tuple. It should look like:

for i, (images, target) in enumerate(trainloader):

@feixiangdekaka
Copy link

when I use cifa-10 it can get right result!
when using voc with bathsize=1 it is right,when bathsize >1 it return the error:

def _process_next_batch(self, batch):
self.rcvd_idx += 1
self._put_indices()
if isinstance(batch, ExceptionWrapper):
raise batch.exc_type(batch.exc_msg)
return batch
raise ExceptionWapper ERROR!

@feixiangdekaka
Copy link

import torch
import torchvision
import torchvision.transforms as transforms
########################################################################
data_transforms = (
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]))

dataset= torchvision.datasets.VOCDetection(root="~/.torch",year="2007",download=True,transform=data_transforms)
trainloader = torch.utils.data.DataLoader(dataset, batch_size=4,
shuffle=True, num_workers=1)

dataiter = iter(trainloader)
images, target = dataiter.next()
print(len(target))

can not run!

when like bellow:
for i, (images, target) in enumerate(trainloader):
print(images.shape)
still can not work!

@fmassa
Copy link
Member

fmassa commented Jan 16, 2019

@feixiangdekaka can you run with num_workers=0 so that the error message is clearer, and paste the results here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants