Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Roadmap 2024 #5320

Open
JanuszL opened this issue Feb 14, 2024 · 11 comments
Open

GitHub Roadmap 2024 #5320

JanuszL opened this issue Feb 14, 2024 · 11 comments
Assignees

Comments

@JanuszL
Copy link
Contributor

JanuszL commented Feb 14, 2024

The following represents a high-level overview of our 2024 plan. Please be aware that this roadmap may change at any time and the order below does not reflect the priority of our future efforts.

We strongly encourage you to comment on our roadmap and provide us with feedback.

Some of the items mentioned below are the continuation of the 2023 effort (#4578)

Improving Usability:

  • more pythonic user experience - function prototypes, improved python exception messages
  • flexible execution model - data can go from CPU to GPU and back to the GPU in single pipeline, which can utilize NVIDIA Grace Hopper Superchip fast memory transfers
  • improved checkpointing - an ability to preserve and restore the DALI processing state from the pipeline and deep learning framework iterator level
  • tighter integration with NVIDIA ecosystem - extending function coverage by exposing functionalities from libraries like CV-CUDA or RAPIDS

Extending input format support:

  • Extending support of formats and containers with variable frame rate videos
  • Adding GPU acceleration for more image formats, like TIFF or new profiles of the existing one

Performance:

  • optimizing memory consumption
  • operators performance optimizations

New transformations:

We are constantly extending the set of operations supported by DALI. Currently, this section lists the most notable additions to our areas of interest that we plan to do this year. This list is not exhaustive and we plan on expanding the set of operators as the needs or requests arise.

  • new transformations for general data processing
  • new transformations for image processing
  • new transformations for video processing
@JanuszL JanuszL pinned this issue Feb 14, 2024
@JanuszL JanuszL closed this as completed Feb 14, 2024
@JanuszL JanuszL reopened this Feb 14, 2024
@JanuszL JanuszL mentioned this issue Feb 14, 2024
@5had3z
Copy link
Contributor

5had3z commented Mar 4, 2024

I've recently come across some few gaps in datatype compatibility.

For example I have a int32 mask, I want to perform the usual random rotate -> scale -> flip -> ... etc pipeline for augmentation. Scale does not support int32, but does support uint16....however flip does not support uint16. FP16 is also relatively weak in areas.

I think an audit of currently accepted datatypes in some of the operations, and check the reason why it is missing (maybe an NPPI impl isn't available), and if there is no particular reason, how it can be unblocked.

@mzient
Copy link
Contributor

mzient commented Mar 4, 2024

@5had3z Thanks for pointing this out. Some of the types were probably overlooked and we can take a look at what we can do. Some others were trimmed to limit the binary size. Each additional type typically translates into a kernel instantiation. If there are multiple types (multiple inputs, outputs) the size quickly gets out of control. Finally, as you pointed out, in some cases we depend on external libraries and are limited by what they support.

@5had3z
Copy link
Contributor

5had3z commented Mar 11, 2024

@5had3z Some others were trimmed to limit the binary size. Each additional type typically translates into a kernel instantiation. If there are multiple types (multiple inputs, outputs) the size quickly gets out of control.

I have zero experience with this, so have no idea on the feasibility. But would there perhaps be a way to JIT operations that have these large parameter spaces? Some ops could still be done ahead of time for faster pipeline start-up times for the common cases, and uninstantiated ones done on the fly.

Another catch-22 I have is that I want my pytorch DALIGenericIterator to know the "size" of my dataset, but if I use an external_source that is a python function, I don't get to use reader_name and I get a bunch of complaints about using size and last_batch_padded (the pipeline itself seemingly works fine anyway). I would be comfortable in writing my custom source dataset in c++, but I haven't looked into it deeply since there isn't pre-written tutorial yet.

@JanuszL
Copy link
Contributor Author

JanuszL commented Mar 11, 2024

@5had3z

I have zero experience with this, so have no idea on the feasibility. But would there perhaps be a way to JIT operations that have these large parameter spaces? Some ops could still be done ahead of time for faster pipeline start-up times for the common cases, and uninstantiated ones done on the fly.

I think you hit the spot. The runtime compilation is getting more and more established in the industry and DALI is also looking into different implementations of such capabilities. For now you can use CuPy or Numba to define operation in Python and have it compiled. We are also working on exploring approaches to the native operations DALI currently supports.

Another catch-22 I have is that I want my pytorch DALIGenericIterator to know the "size" of my dataset, but if I use an external_source that is a python function, I don't get to use reader_name and I get a bunch of complaints about using size and last_batch_padded (the pipeline itself seemingly works fine anyway). I would be comfortable in writing my custom source dataset in c++, but I haven't looked into it deeply since there isn't pre-written tutorial yet.

I'm afraid that is one of the limitations of the external source operator, while it gives you the freedom of returning any number of batches it strips the iterator of the knowledge of any samples/batches is may expect. The most flexible approach, in this case, is to always return full batches of data from the external source and just pad it with either a duplicated sample or ones that are randomly selected from the whole data set.

@idobenamram
Copy link

are you guys thinking of working on dali for rust? would help for when we don't want to use the triton server

@JanuszL
Copy link
Contributor Author

JanuszL commented Nov 4, 2024

Hi @idobenamram,

Thank you for reaching out.

are you guys thinking of working on dali for rust?

What exactly do you have in mind? Is it the native API (similar to our C API) to run the pipeline or similar to our Python one?

@idobenamram
Copy link

idobenamram commented Nov 4, 2024

Wow thanks for the quick response @JanuszL, ya sorry maybe I should clarify. Right now we have 2 separate ways of working with our models; The first is the "production" way using triton's inference server, and the second is using onnx runtime when we want to run locally or in any other setting that doesn't have the triton server up and running. We work mainly in rust so for most of our models we call ort directly in rust. Currently we don't have a good solution for our dali pipelines, which are written in python then use triton to execute. This means we implement the part we do in dali twice for when we want to use onnx. I was thinking it could be cool to have the ability to run dali from rust like you would be able to in python. Thanks again!

@JanuszL
Copy link
Contributor Author

JanuszL commented Nov 4, 2024

@idobenamram would https://github.com/rust-lang/rust-bindgen do in your case applied to https://github.com/NVIDIA/DALI/blob/main/include/dali/c_api.h? You can use daliCreatePipeline to create the pipeline, daliRun to run and daliOutput to get the data. You can refer to https://github.com/NVIDIA/DALI/blob/main/dali/c_api/c_api_test.cc or https://github.com/NVIDIA/DALI/blob/main/dali_tf_plugin/daliop.cc to see how to use it.

@klecki
Copy link
Contributor

klecki commented Nov 4, 2024

The C API is meant to run the Pipeline that was already defined in Python. You can use this method to serialize the Pipeline to a file, which can be later loaded in the C API: https://docs.nvidia.com/deeplearning/dali/main-user-guide/docs/pipeline.html#nvidia.dali.Pipeline.serialize

@idobenamram
Copy link

idobenamram commented Nov 4, 2024

oh, wow thats actually really cool, i didn't know that existed. so all i really need to do is create a small wrapper for the C API in rust.
is this something u guys would be interested in? I can do it on my side but then no one else would be able to use it.

@JanuszL
Copy link
Contributor Author

JanuszL commented Nov 4, 2024

is this something u guys would be interested in? I can do it on my side but then no one else would be able to use it.

If you can contribute the necessary automation for it to DALI that would be great. We would be more than happy to guide you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants