-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VOCSegmentation, VOCDetection, linting passing, examples. #663
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchvision/datasets/voc.py
Outdated
Args: | ||
root (string): Root directory of the VOC Dataset. | ||
year (string, optional): The dataset year, supports years 2007 to 2012. | ||
image_set (string, optional): Select the image_set to use, ``train, trainval or val`` |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torchvision/datasets/voc.py
Outdated
relative coordinates like``[[xmin, ymin, xmax, ymax, ind], [...], ...]``. | ||
""" | ||
_img = Image.open(self.images[index]).convert('RGB') | ||
_target = self._get_bboxes(ET.parse(self.annotations[index]).getroot()) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead return the raw result from ET.parse().getroot()
? Or maybe a dict with the full parsing of the xml?
It's up to the user to decide what they actually want for their target, and we are forcing them to use one format now (which doesn't hold all the information from the dataset, such as truncated / occluded / etc).
Hi, I'm just about to get facebookresearch/maskrcnn-benchmark#131 merged, and I'll be pushing a more genera version of this here. I can coordinate with you guys so that we can get this finally merged. |
Hi @fmassa, just checked the PR, seems really robust. Keep me in the loop if you need help with testing or implementation. |
I've just finished testing it and got some reasonable results with the codebase. I'll look into merging the PR todayish. |
Looking forwards to it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small suggestion to make usage of the DATASET_YEAR_DICT more straightforward
Add suggestions from @ellisbrown, using dict of dicts instead of array index. Co-Authored-By: bpinaya <bpg_92@hotmail.com>
Hey @fmassa , any updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I've left a few more comments.
The most relevant one is that I think we should give the possibility for the user to obtain all the information that is available in the xml
.
What do you think?
torchvision/datasets/voc.py
Outdated
relative coordinates like``[[xmin, ymin, xmax, ymax, ind], [...], ...]``. | ||
""" | ||
_img = Image.open(self.images[index]).convert('RGB') | ||
_target = self._get_bboxes(ET.parse(self.annotations[index]).getroot()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead return the raw result from ET.parse().getroot()
? Or maybe a dict with the full parsing of the xml?
It's up to the user to decide what they actually want for their target, and we are forcing them to use one format now (which doesn't hold all the information from the dataset, such as truncated / occluded / etc).
torchvision/datasets/voc.py
Outdated
def __len__(self): | ||
return len(self.images) | ||
|
||
def _get_bboxes(self, target): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably let the user write this down. Maybe what would be the most user-friendly would be to parse the ET and return a nested dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I thought about it, maybe a dict would be better, but I saw here that they pass a Bb class. I think whatever is easier for the end user maybe. Regarding the iteration of the ET, any ideas to make it recursive and elegant? I implemented a function but it's way too hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In maskrcnn-benchmark
, we have a dedicated class BoxList
which is used everywhere in the codebase.
My original plan was to move BoxList
to torchvision
, but it needs to mature a bit more before we move it here.
About the ET, I suppose there is a function call that we can use that would enable us to get it recursively? Probably something like this that I wrote for lua
torchvision/datasets/voc.py
Outdated
|
||
_splits_dir = os.path.join(_voc_root, 'ImageSets/Main') | ||
|
||
_split_f = os.path.join(_splits_dir, image_set.rstrip('\n') + '.txt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the rstrip
? just out of curiosity
torchvision/datasets/voc.py
Outdated
image_set='train', | ||
download=False, | ||
class_to_ind=None, | ||
keep_difficult=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keep_difficult
could be part of what the user pass to the target_transform
.
torchvision/datasets/voc.py
Outdated
self.images = [] | ||
self.annotations = [] | ||
with open(os.path.join(_split_f), "r") as lines: | ||
for line in lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we instead do
with open(os.path.join(_split_f), "r") as f:
image_names = f.readlines()
? I believe this strips out the \n
in the end, and is a bit faster than the current version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked and it doesn't strip the \n
the value of image_names is:
['2008_000008\n', '2008_000015\n', '2008_000019\n', '2008_000023\n', '2008_000028\n', '2008_000033\n', '2008_000036\n', '2008_000037\n', '2008_000041\n', '2008_000045\n', '2008_000053\n', '2008_000060\n', '2008_000066\n', '2008_000070\n', '2008_000074\n', '2008_000085\n', '2008_000089\n', '2008_000093\n', '2008_000095\n', '2008_000096\n', '2008_000097\n', '2008_000099\n', '2008_000103\n', '2008_000105\n', '2008_000109\n', '2008_000112\n'...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I think what I had done then was something like
with open(os.path.join(_split_f), "r") as f:
image_names = [x.strip() for x in f.readlines()]
but do it as you think it's better.
Can you also update |
I think the better approach would be to pass a dict of the parsed |
@fmassa I followed your recommendations and added this instead for loading: with open(os.path.join(_split_f), "r") as f:
file_names = [x.strip() for x in f.readlines()]
self.images = [os.path.join(_image_dir, x + ".jpg") for x in file_names]
self.annotations = [os.path.join(_annotation_dir, x + ".xml") for x in file_names] In the detection part and this in the segmentation part: with open(os.path.join(_split_f), "r") as f:
file_names = [x.strip() for x in f.readlines()]
self.images = [os.path.join(_image_dir, x + ".jpg") for x in file_names]
self.masks = [os.path.join(_mask_dir, x + ".png") for x in file_names] Removed {
'annotation': {
'filename': '2010_001870.jpg',
'folder': 'VOC2010',
'object': [{
'name': 'person',
'bndbox': {
'xmax': '325',
'xmin': '110',
'ymax': '375',
'ymin': '72'
},
'difficult': '0',
'occluded': '0',
'pose': 'Frontal',
'truncated': '1',
'part': [{
'name': 'head',
'bndbox': {
'xmin': '180',
'ymin': '75',
'xmax': '286',
'ymax': '211'
}
}, {
'name': 'hand',
'bndbox': {
'xmin': '225',
'ymin': '284',
'xmax': '292',
'ymax': '342'
}
}]
}, {
'name': 'person',
'bndbox': {
'xmax': '490',
'xmin': '57',
'ymax': '327',
'ymin': '73'
},
'difficult': '0',
'occluded': '1',
'pose': 'Frontal',
'truncated': '1',
'part': [{
'name': 'head',
'bndbox': {
'xmin': '283',
'ymin': '88',
'xmax': '360',
'ymax': '226'
}
}, {
'name': 'hand',
'bndbox': {
'xmin': '58',
'ymin': '235',
'xmax': '116',
'ymax': '329'
}
}, {
'name': 'hand',
'bndbox': {
'xmin': '356',
'ymin': '136',
'xmax': '413',
'ymax': '215'
}
}]
}],
'segmented': '0',
'size': {
'depth': '3',
'height': '375',
'width': '500'
},
'source': {
'annotation': 'PASCAL VOC2010',
'database': 'The VOC2010 Database',
'image': 'flickr'
}
}
} As you can see it handles |
That looks awesome, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost ready for merge, one last comment and then this is good to go!
Thanks a lot for the awesome work!
torchvision/datasets/voc.py
Outdated
self.transform = transform | ||
self.target_transform = target_transform | ||
self.image_set = image_set | ||
self.class_to_ind = class_to_ind or dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove class_to_ind
? It's not used anymore.
def __len__(self): | ||
return len(self.images) | ||
|
||
def parse_voc_xml(self, node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks!
Just to double check, did you try parsing all the images in say VOC2012? I know that some images have a single object and that might require some special handling, just want to verify that this is indeed being taken care here.
Oh indeed I was not using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, this is awesome!
How can I load trainval 2007 and trainval 2012 at the same time? torchvision provide only one year? |
@feixiangdekaka you can use |
data_transforms = ( dataset= torchvision.datasets.VOCDetection(root="~/.torch",year="2007",download=True,transform=data_transforms) how to correctly iter the datesets? |
@feixiangdekaka Two things look wrong here. (1) you define for i, (images, target) in enumerate(trainloader): |
when I use cifa-10 it can get right result! def _process_next_batch(self, batch): |
import torch dataset= torchvision.datasets.VOCDetection(root="~/.torch",year="2007",download=True,transform=data_transforms) dataiter = iter(trainloader) can not run! when like bellow: |
@feixiangdekaka can you run with |
Hey there guys, I hope this PR can revive some of the efforts that past PRs implemented, I'm referring to:
This PR supports VOCSegmentation and VOCDetection for multiple years (2007 to 2012). 2006 and 2005 are not supported due to the very different format.
I tried to address many of the comments of the pasts two PRs. For the VOCDetection part
__getitem__
will return the image and a list of bounding boxes of the format[[xmin, ymin, xmax, ymax, ind],[...],...]
as PR 37 directly instead of a ET Element.I think VOC is a very iconic dataset for both detection and segmentation so it'd be very useful to have it ready to use on pytorch/vision.
If there are any comments or suggestions let me know, I can put some time on implementing recommendations or so.
There are two examples (jupyter notebook gist) of the dataset being loaded:
I've tested for all the years and regarding the transforms, I'm using the same structure other datasets used, I've seen implementations where a
joint_transforms
is passed as argument, and that transform is applied to both input and target, I guess that'd work for segmentation but not much for detection so that's why it's not used.