Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to make non-C++ DeviceApi #2515

Closed
nhynes opened this issue Jan 28, 2019 · 9 comments
Closed

Impossible to make non-C++ DeviceApi #2515

nhynes opened this issue Jan 28, 2019 · 9 comments

Comments

@nhynes
Copy link
Member

nhynes commented Jan 28, 2019

It's possible to register new APIs from other languages by registering a global device_api.my_device, but it breaks down because the actual implementation depends on being a C++ DeviceApi with methods like AllocDataSpace. One solution is to create a "wrapper" DeviceAPI which calls out to globals like device_api.ext_dev.alloc_data_space.

@tqchen
Copy link
Member

tqchen commented Jan 31, 2019

I think we should put some thoughts into this. The main reason is that this will increase the size of C++ runtime which we otherwise would not need.

The minimalist nature of the runtime makes me think whether it is a good idea to mix languages in runtime side, or should we simply try to restrict runtime backend into one language. If we decided that we want this, we should put it as an optional dependency in the build. In terms of implementation, perhaps it is better to use Module to expose all the functions

@Ravenwater
Copy link

Trying to get my brain around the request. The abstraction of a Device is implemented as an abstract C++ class, with actual implementations provided for each environment. @nhynes what are you trying to do that couldn't be supported by a Rust DeviceApi class implementation that calls out to your Rust device api?

@nhynes
Copy link
Member Author

nhynes commented Feb 1, 2019

what are you trying to do that couldn't be supported by a Rust DeviceApi

(@Ravenwater) The issue is that one cannot make a device API from non-C++ languages because creating one requires generating a pointer to a DeviceAPI which is used within C++ (and isn't C ABI-compatible POD). This PR just makes it possible to make a device API in any language (incl. Python).

Of course, what the PR does is independent of why it exists. The reason is because I wanted to write a VTA simulator in Rust (and eventually Scala) for easier integration testing. @tqchen is right, though: the runtime should be kept minimal. I'll batch these changes until either 1) the other languages become self-sufficient or 2) I accumulate enough of these changes to make a cohesive contrib library to be hidden behind compiler flags.

I'll close this issue for now.

@nhynes nhynes closed this as completed Feb 1, 2019
@Ravenwater
Copy link

@nhynes turns out that we have a lot in common! I am looking into how to connect our Golang simulation environment that models are tensor processor and masquerade it as a VTA widget.

Since, our sim environment can run at scale, modeling hundreds of thousands of processors, we went with a GRPC/HTTPS interface that marshals the command and data streams of the processor. So I was planning to create a C++ stub that can create and manage that connection. This simplifies the integration point and decouples the build and run environments, keeping TVM/VTA pure C++, but being able to talk to a remove cluster of simulation capability.

I would think that is the way you want to go as well, particularly if you want to push the modeling into a Scala universe.

@nhynes
Copy link
Member Author

nhynes commented Feb 1, 2019

Iiinteresting. I'm actually also trying to extract the commands and data from VTA programs. I tried extracting the cmds and uops directly from the IR, but it's easier to just run the compiled program.

C++ stub that can create and manage that connection

Right, I see. If it's C++, you could just register your own device_api.ext_dev and replace the VTA simulator with your own high-performance, distributed implementation. It's an extra layer of indirection, but it keeps TVM simple and focused, so that makes complete sense.

@Ravenwater
Copy link

Shall we make that architecture pattern a github tracking issue? IMHO, first goal would be to reproduce the simple VTA that we currently have. We also lacking an architecture register definition so that we can drive hw, architecture, and performance simulators. I wonder if we can get WindRiver interested to build an architecture stub for there SIMICS functional computer simulator so that we can boot OSes and drivers.

@nhynes
Copy link
Member Author

nhynes commented Feb 2, 2019

architecture pattern a github tracking issue

Quite possibly. There's definitely some common functionality here. Also, once the VTA Chisel port is ready, we'll need a way to streamline building TVM modules, extracting their data, pushing the memories into the (Chisel) simulator stack, and running the test harness.

Right now I'm trying to get the data extraction rolling. If I understand correctly, you're working on getting VTA programs into your simulation cluster. It seems like the common bit of functionality is flat-packing the VTA module and design in a common format that can be shipped to an arbitrary simulator. In that case, the current sw sim, viviado xsim, the eventual treadle sim, and your setup can all consume the same underlying format.

If the above is true, what beyond the initial memories are you trying to extract? It might be worth putting together a proper RFC to see what else people need.

@Ravenwater
Copy link

In addition to memory maps, data and program management, and notifications, there are two more dimensions you will need:

  1. concurrency between cpu and accelerator:
  2. interrupts/panics/debug/single-step control

Regarding 1) the driver of a machine like a GPU or a tensor processor will need to manage concurrency between the cpu address space and the accelerator address space as both the cpu and the remote accelerator might have multiple kernels running, possibly within a time-shared resource manager and multiple users. That concurrency will have timing complexity as the execution on either side can be long lived and is likely to be protected by semaphores and/or monitors

Regarding 2) you will need a mechanism to deal decisively with failures and observability. Numerical problems, such as divisions by 0, underflow/overflow, NaN/NaR need to be dealt with, as well as resource management problems, such as dead-lock, live-lock, infinite-loop, etc.

IMHO what the DeviceAPI object needs to expand into is the definition of all these attributes of an execution machine for the language presented by the IR. The TVM release should ALWAYS include a full software emulation of that object so that folks can always run a full stack. When a driver finds a specific target that offers acceleration, then it should connect to the accelerated hardware.

It is typically not pleasant to rely on a real hw driver, like OpenGL or Vulkan, for the actual software as the complexity of keeping that stuff consistent is very labor intensive.

We need to elevate this to a real discussion topic. This discussion is the definition of what the 'virtual' accelerator's attributes need to be to execute the IR. DeviceAPI is slightly 'bigger' in scope as it represents the least common denominator that unifies all the possible hardware accelerators.

@nhynes
Copy link
Member Author

nhynes commented Feb 3, 2019

We need to elevate this to a real discussion topic

Yes, this should be done now before too much real discussion gets lost in this thread.

there are two more dimensions you will need

I think that the first dimension is something that can reasonably be expected of a DeviceAPI since synchronization comes up in any context that's not (host cpu)=(target cp). I'm not so sure about the second, though.

Dimension 1: Concurrency

multiple kernels running, possibly within a time-shared resource manager and multiple users

Such functionality would be amazing to see and is certainly a logical goal. My only concern would be that the scheduler implementation is highly specific to each deployment (e.g., multi-tenancy on an F1 instance vs time/user sharing a device fleet in an on-prem datacenter) and would be more suited to a collection of externally maintained libraries. If anything, the core project would only have stubs that allow such extensions.

Dimension 2: Device Control

I agree that a DeviceAPI should be able to fully control

include a full software emulation of that object

Not quite sure I agree here. Part of the issue is that the TVM IR can be wildly different before and after scheduling for a given platform. Emulation wouldn't offer a reliable estimate of performance unless the program is fully lowered. In that case, you'd be stuck writing an emulator for each and every platform (and, indeed, is what we see with the bespoke VTA C++ simulator). I don't think that I fully understand the use case, but if the goal is simply to test the computation, then there's always LLVM/LLI as the common denominator.

deal decisively with failures and observability

Would something in-band and generic like tf.add_check_numerics_ops not suffice? If so, then this would probably fall under the purview of concurrency and synchronization.

not pleasant to rely on a real hw driver

Do you mean specifically for simulation or in general?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants