Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample order and ID #2516

Closed
wangjiawen2013 opened this issue Nov 20, 2024 · 10 comments
Closed

Sample order and ID #2516

wangjiawen2013 opened this issue Nov 20, 2024 · 10 comments

Comments

@wangjiawen2013
Copy link
Contributor

wangjiawen2013 commented Nov 20, 2024

Hi,
When training/testing a customized dataset, how does burn determines the samples order ? For example, here is the directory tree of cifar10:
image

When testing model with this customized dataset (png format for teaching & tutorials), I can get the predicted results, but how to pair the results with the samples ? Can burn output both the sample directory and the predicted results together ?
For hugginface dataset, such as mnist, we don't even know the sample ID and cannot see the images, It's also necessary to output the samples' names !

@laggui
Copy link
Member

laggui commented Nov 20, 2024

The ImageFolderDataset items are collected by globbing the directory to find the files that match the image extension pattern. As indicated by the globwalk doc, the order of elements yielded is unspecified. This means the order is not guaranteed.

The current implementation also discards the original image path for an ImageDatasetItem (mapper implementation here). The item only contains the parsed data (i.e., image data and label for the image classification case).

If you'd like to keep this information for the use case you described, the ImageDatasetItem would have to be modified to keep the image path in an additional field. And then to keep this information within the batch, the batcher implementation would have to be adapted as well.

Perhaps the changes to ImageDatasetItem could be incorporated in Burn, not sure how important that is though.

@wangjiawen2013
Copy link
Contributor Author

What does "the order is not guaranteed" mean ? Does it mean that the order of elements yielded by globwalk is randomly and will be different when I re-run it ?
I think it is important to output the items path. As the modification is complicated, is there a workround that I can output it by myself (such as I run globwalk again and output them) ?

@laggui
Copy link
Member

laggui commented Nov 21, 2024

Ahhh sorry I totally forgot one implementation detail: we actually sort the paths so the order of the items will be sorted (as long as you don't shuffle with the dataloader).

You could double check that the data is always the same to validate. But the paths will still not be accessible.

@wangjiawen2013
Copy link
Contributor Author

I copied the code of globbing and outputed the images path. Don't you need the images path ? If not, how do you pair your predictions with the images ?

@laggui
Copy link
Member

laggui commented Nov 22, 2024

Yeah the images path are used to get the corresponding image data and ground-truth label! The list of image/annotation pairs is stored in a vec of ImageDatasetItemRaw items. And as I mentioned before, each item is mapped to an ImageDatasetItem that contains the image data (read from the path) and annotation. So the original image path is not preserved further.

@wangjiawen2013
Copy link
Contributor Author

wangjiawen2013 commented Nov 23, 2024 via email

@wangjiawen2013
Copy link
Contributor Author

wangjiawen2013 commented Nov 23, 2024 via email

@laggui
Copy link
Member

laggui commented Nov 25, 2024

As I said earlier, the image id (i.e., path to the source) is discarded when reading the image data.

This happens specifically within the mapper that transforms an ImageDatasetItemRaw, which contains the image_path and "raw" annotation fields, to an ImageDatasetItem containing the image and annotation data.

There is no way with the current implementation to preserve that information because it is not currently kept as a dataset item field. But you could easily adapt the code to simply have something like this in your implementation:

/// Modified image dataset item that preserves the image source field.
#[derive(Debug, Clone, PartialEq)]
pub struct ImageDatasetItem {
    /// Image as a vector with a valid image type.
    pub image: Vec<PixelDepth>,

    /// Annotation for the image.
    pub annotation: Annotation,

    /// Original image source.
    pub image_path: String,
}

impl Mapper<ImageDatasetItemRaw, ImageDatasetItem> for PathToImageDatasetItem {
    /// Convert a raw image dataset item (path-like) to a 3D image array with a target label.
    fn map(&self, item: &ImageDatasetItemRaw) -> ImageDatasetItem {
        let annotation = parse_image_annotation(&item.annotation, &self.classes);

        // Load image from disk
        let image = image::open(&item.image_path).unwrap();

        // Image as Vec<PixelDepth>
        let img_vec = match image.color() {
           // ...
        };

        ImageDatasetItem {
            image: img_vec,
            annotation,
            // Keep the image source as a field
            image_path: item.image_path.display().to_string(),
        }
    }
}

@wangjiawen2013
Copy link
Contributor Author

Yes, you're right. I forked burn and modify ImageDatasetItem and batch, the output looks good and this is actually what I want !
Why not adding this field to burn ? It can give great convinence to users.
image

@laggui
Copy link
Member

laggui commented Nov 27, 2024

I don't have any opposition to this addition.

If you already made the changes you could make a PR from your fork 🙂

@laggui laggui mentioned this issue Nov 28, 2024
2 tasks
@laggui laggui closed this as completed Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants