-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impossible to make non-C++ DeviceApi #2515
Comments
I think we should put some thoughts into this. The main reason is that this will increase the size of C++ runtime which we otherwise would not need. The minimalist nature of the runtime makes me think whether it is a good idea to mix languages in runtime side, or should we simply try to restrict runtime backend into one language. If we decided that we want this, we should put it as an optional dependency in the build. In terms of implementation, perhaps it is better to use Module to expose all the functions |
Trying to get my brain around the request. The abstraction of a Device is implemented as an abstract C++ class, with actual implementations provided for each environment. @nhynes what are you trying to do that couldn't be supported by a Rust DeviceApi class implementation that calls out to your Rust device api? |
(@Ravenwater) The issue is that one cannot make a device API from non-C++ languages because creating one requires generating a pointer to a Of course, what the PR does is independent of why it exists. The reason is because I wanted to write a VTA simulator in Rust (and eventually Scala) for easier integration testing. @tqchen is right, though: the runtime should be kept minimal. I'll batch these changes until either 1) the other languages become self-sufficient or 2) I accumulate enough of these changes to make a cohesive I'll close this issue for now. |
@nhynes turns out that we have a lot in common! I am looking into how to connect our Golang simulation environment that models are tensor processor and masquerade it as a VTA widget. Since, our sim environment can run at scale, modeling hundreds of thousands of processors, we went with a GRPC/HTTPS interface that marshals the command and data streams of the processor. So I was planning to create a C++ stub that can create and manage that connection. This simplifies the integration point and decouples the build and run environments, keeping TVM/VTA pure C++, but being able to talk to a remove cluster of simulation capability. I would think that is the way you want to go as well, particularly if you want to push the modeling into a Scala universe. |
Iiinteresting. I'm actually also trying to extract the commands and data from VTA programs. I tried extracting the cmds and uops directly from the IR, but it's easier to just run the compiled program.
Right, I see. If it's C++, you could just register your own |
Shall we make that architecture pattern a github tracking issue? IMHO, first goal would be to reproduce the simple VTA that we currently have. We also lacking an architecture register definition so that we can drive hw, architecture, and performance simulators. I wonder if we can get WindRiver interested to build an architecture stub for there SIMICS functional computer simulator so that we can boot OSes and drivers. |
Quite possibly. There's definitely some common functionality here. Also, once the VTA Chisel port is ready, we'll need a way to streamline building TVM modules, extracting their data, pushing the memories into the (Chisel) simulator stack, and running the test harness. Right now I'm trying to get the data extraction rolling. If I understand correctly, you're working on getting VTA programs into your simulation cluster. It seems like the common bit of functionality is flat-packing the VTA module and design in a common format that can be shipped to an arbitrary simulator. In that case, the current sw sim, viviado xsim, the eventual treadle sim, and your setup can all consume the same underlying format. If the above is true, what beyond the initial memories are you trying to extract? It might be worth putting together a proper RFC to see what else people need. |
In addition to memory maps, data and program management, and notifications, there are two more dimensions you will need:
Regarding 1) the driver of a machine like a GPU or a tensor processor will need to manage concurrency between the cpu address space and the accelerator address space as both the cpu and the remote accelerator might have multiple kernels running, possibly within a time-shared resource manager and multiple users. That concurrency will have timing complexity as the execution on either side can be long lived and is likely to be protected by semaphores and/or monitors Regarding 2) you will need a mechanism to deal decisively with failures and observability. Numerical problems, such as divisions by 0, underflow/overflow, NaN/NaR need to be dealt with, as well as resource management problems, such as dead-lock, live-lock, infinite-loop, etc. IMHO what the DeviceAPI object needs to expand into is the definition of all these attributes of an execution machine for the language presented by the IR. The TVM release should ALWAYS include a full software emulation of that object so that folks can always run a full stack. When a driver finds a specific target that offers acceleration, then it should connect to the accelerated hardware. It is typically not pleasant to rely on a real hw driver, like OpenGL or Vulkan, for the actual software as the complexity of keeping that stuff consistent is very labor intensive. We need to elevate this to a real discussion topic. This discussion is the definition of what the 'virtual' accelerator's attributes need to be to execute the IR. DeviceAPI is slightly 'bigger' in scope as it represents the least common denominator that unifies all the possible hardware accelerators. |
Yes, this should be done now before too much real discussion gets lost in this thread.
I think that the first dimension is something that can reasonably be expected of a Dimension 1: Concurrency
Such functionality would be amazing to see and is certainly a logical goal. My only concern would be that the scheduler implementation is highly specific to each deployment (e.g., multi-tenancy on an F1 instance vs time/user sharing a device fleet in an on-prem datacenter) and would be more suited to a collection of externally maintained libraries. If anything, the core project would only have stubs that allow such extensions. Dimension 2: Device ControlI agree that a DeviceAPI should be able to fully control
Not quite sure I agree here. Part of the issue is that the TVM IR can be wildly different before and after scheduling for a given platform. Emulation wouldn't offer a reliable estimate of performance unless the program is fully lowered. In that case, you'd be stuck writing an emulator for each and every platform (and, indeed, is what we see with the bespoke VTA C++ simulator). I don't think that I fully understand the use case, but if the goal is simply to test the computation, then there's always LLVM/LLI as the common denominator.
Would something in-band and generic like
Do you mean specifically for simulation or in general? |
It's possible to register new APIs from other languages by registering a global
device_api.my_device
, but it breaks down because the actual implementation depends on being a C++DeviceApi
with methods likeAllocDataSpace
. One solution is to create a "wrapper" DeviceAPI which calls out to globals likedevice_api.ext_dev.alloc_data_space
.The text was updated successfully, but these errors were encountered: