Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

Accelerated inference #147

Merged
merged 37 commits into from
Aug 30, 2022
Merged

Accelerated inference #147

merged 37 commits into from
Aug 30, 2022

Conversation

alecgunny
Copy link
Collaborator

@alecgunny alecgunny commented Aug 18, 2022

Implementing batched inference to increase throughput

  • Triton stateful backend hermes#21
  • Perform response concatenation in client callback thread
  • Add inference_rate kwarg to control sleep between requests
  • Braid requests from concurrent sequences to increase overall request rate
  • Attempt to use TensorRT (both FP32 and FP16 precision)

This was linked to issues Aug 18, 2022
@alecgunny alecgunny added this to the Pipeline Iteration 2 milestone Aug 18, 2022
@alecgunny
Copy link
Collaborator Author

TensorRT export/inference works, but there needs to be close coordination with the versions of both TensorRT and CUDA that the corresponding Triton container is running. I have the TRT version currently hard-pinned to the version of TRT used by the container on LDG (8.2.3), but the CUDA version is trickier.

You have to run the export with a version of CUDA <11.7 to be compatible with the version used in the container (I was able to get 11.2 to work, but I think 11.6 should as well). This is a bit trickier, since this is essentially controlled by the LD_LIBRARY_PATH environment variable (which you point to the path of the CUDA libraries you want to use), and has to be set before you do any imports, which precludes including this as e.g. a command line arg. This isn't a big deal if you're running export as a one-off, but it's a bigger problem when you want to automate the pipeline.

One solution here would be to build this in at the pinto level, since pinto can set the environment variable then call the script as a subprocess. I do get the impression that this is something we're going to want to have more control on in all these projects going forward, just because of how brittle TensorRT is to these things. I can envision adding a cuda_version entry to the tool.pinto table of a project's pyproject.toml that, if it's set, will have pinto run set LD_LIBRARY_PATH to start with "/usr/local/cuda-{cuda_version}/lib64", or raise an error if this path doesn't exist.

(FWIW, the way this works right now on LDG is that they keep multiple versions of CUDA available, but link the latest one to /usr/local/cuda which is used as the first search location in LD_LIBRARY_PATH).

@alecgunny
Copy link
Collaborator Author

Had to fix an issue with pinto that was causing torch imports to throw off lower library search paths. This is merged as of ML4GW/pinto#21, so once that container has pushed from the workflow this will be good to retest.

@alecgunny
Copy link
Collaborator Author

Marking as ready for review now that tests are passing. Notable changes:

  1. Including TensorRT dependency in export project and setting that as the default inference platform in projects/sandbox/pyproject.toml. Defaults to using FP16 inference when TensorRT is selected.
  2. Whitening transform has been updated to better support TensorRT. In particular, slice indices and other small scalar constants are removed as parameters of the WhiteningTransform module and are just baked into the ONNX graph at export time. The slice indices in particular were causing a problem because TensorRT doesn't support dynamic slicing like that (i.e. slicing use slice indices that are functions of some other input). It only supports slicing using constants. The main reason to keep the time domain filter as a parameter is so that a) it gets saved on export and b) calling WhiteningTransform.to moves it to the appropriate device. I don't include the window as a parameter either since it's really just a vector constant in the graph, and isn't fit to the data at all.
  3. Supporting batched snapshotter updates at both export and inference time, and including tests for different batch sizes.
  4. Implementing the same prior_file.is_abs() check in generate_waveforms as we do in timeslide_injections for consistency.
  5. hermes submodule now points to my local repo. Figured it's not worth holding this up before I figure out how to implement the aggregator model using Triton implicit state, so developers will need to add my repo as a remote for the time being.

@alecgunny alecgunny marked this pull request as ready for review August 30, 2022 04:05
Comment on lines +146 to +156
# TODO: hardcoding these kwargs for now, but worth
# thinking about a more robust way to handle this
kwargs = {}
if platform == qv.Platform.ONNX:
kwargs["opset_version"] = 13

# turn off graph optimization because of this error
# https://github.com/triton-inference-server/server/issues/3418
bbhnet.config.optimization.graph.level = -1
elif platform == qv.Platform.TENSORRT:
kwargs["use_fp16"] = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the **kwargs of the main function are assumed to be platform specific variables? I'm not sure if typeo has the ability to parse variables that aren't directly in the function signature into **kwargs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it doesn't right now, since it relies on annotations to infer the type the parser needs to map to. A few markup parsers (yaml, toml) have functionality for applying heuristics that we could maybe lift, but no idea how complicated that might turn out to be. Off the top of my head, the real headaches would be

  • reimplementing argparse's logic in a less structured format for iterating through the unused arguments and grouping them into (arg_name, value) pairs, accounting for booleans which would just be formatted as --flag-name with no associated values.
  • handling arguments which are misspelled, silently giving you their default behavior (if they don't have a default, this will luckily raise a missing argument error)

Copy link
Collaborator

@EthanMarx EthanMarx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that LD_LIBRARY_PATH issue was giving me headaches. Ended up adding my base conda /libs directory to it in my bash_profile. Glad that its solved. Looks cool excited to see it in action.

@alecgunny alecgunny merged commit 7b0e31b into ML4GW:main Aug 30, 2022
@alecgunny alecgunny deleted the accelerated-inference branch August 30, 2022 21:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Start batching inference requests Use TensorRT in model export
2 participants