Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat dataset #1354

Merged
merged 7 commits into from
Jul 16, 2024
Merged

concat dataset #1354

merged 7 commits into from
Jul 16, 2024

Conversation

yueyinqiu
Copy link
Contributor

@yueyinqiu yueyinqiu commented Jul 14, 2024

#1348 (comment)

  • add torch.utils.data.ConcatDataset
    • The parameter of DataLoaders has been relaxed to IDataset<out T> (a new covariant interface), thus accepting the concated dataset
      • now Dataset<T> implements IDataset<T>
      • Dataset implements both IDataset<Dictionary<string, Tensor>> and IDataset<IReadOnlyDictionary<string, Tensor>>
      • IterableDataset implements IDataset<IList<string, Tensor>> and IDataset<IEnumerable<string, Tensor>>
  • parameter of collate functions has been relaxed to IReadOnlyList

@@ -13,21 +14,29 @@ public static partial class data
/// <summary>
/// Map-style data set
/// </summary>
public abstract class Dataset : Dataset<Dictionary<string, torch.Tensor>>
public abstract class Dataset : Dataset<Dictionary<string, Tensor>>,
IDataset<IReadOnlyDictionary<string, Tensor>>
Copy link
Contributor Author

@yueyinqiu yueyinqiu Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented in the codes below, due to covariation, Dataset should naturally be IDataset<IReadOnlyDictionary<string, Tensor>> (because it is IDataset<Dictionary<string, Tensor>>). However FSharp.Examples cannot be complied without this. I don't know why this happens.

I would suggest to remove this line, because it influences the detection of torch.utils.data.ConcatDataset (IReadOnlyDictionary or Dictionary are both ok, so it can't automatically choose one), but I failed to make the fsharp program work.

Copy link
Contributor Author

@yueyinqiu yueyinqiu Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more radical idea is to remove Dataset and IterableDataset at all. Actually we don't need them. DataLoaders could work on any IDataset<IReadOnlyDictionary<string, torch.Tensor>> and IDataset<IEnumerable<Tensor>>.

By the way, our IterableDataset is occupying the position of PyTorch IterableDataset. #1353 That's also part of the reason why I suggest to remove them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F# samples have to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So. just to check -- with the current commits, the F# examples build, correct? I don't see any errors in the latest build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I could be built by adding : IDataset<IReadOnlyDictionary<string, Tensor>>.

@yueyinqiu yueyinqiu changed the title (WIP) concat dataset concat dataset Jul 14, 2024
@yueyinqiu yueyinqiu marked this pull request as ready for review July 14, 2024 09:34
/// <param name="index">Index for tensor</param>
/// <returns>Tensors of index. DataLoader will catenate these tensors into batches.</returns>
[IndexerName("DatasetItems")]
T this[long index] { get; }
Copy link
Contributor Author

@yueyinqiu yueyinqiu Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using indexers instead of GetTensor in the new interface. I'm not really sure about its IndexerName.

And... shall we remove GetTensor at the same time? Currently Dataset<T> is keeping an abstract GetTensor to ensure compatibility. (And its indexer is calling GetTensor.)

@NiklasGustafsson
Copy link
Contributor

Ready to merge?

@yueyinqiu
Copy link
Contributor Author

yes

@NiklasGustafsson NiklasGustafsson merged commit 35f4c7d into dotnet:main Jul 16, 2024
2 checks passed
@NiklasGustafsson
Copy link
Contributor

@NiklasGustafsson
Copy link
Contributor

This was reverted due to breaking changes preventing a build. We're eager to make a new release with support for Ubuntu 20.04.

@yueyinqiu yueyinqiu deleted the ConcatDataset branch September 27, 2024 02:05
@yueyinqiu yueyinqiu mentioned this pull request Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants