Accelerated inference #147

alecgunny · 2022-08-18T16:37:59Z

Implementing batched inference to increase throughput

Triton stateful backend hermes#21
Perform response concatenation in client callback thread
Add inference_rate kwarg to control sleep between requests
Braid requests from concurrent sequences to increase overall request rate
Attempt to use TensorRT (both FP32 and FP16 precision)

alecgunny · 2022-08-23T01:47:27Z

TensorRT export/inference works, but there needs to be close coordination with the versions of both TensorRT and CUDA that the corresponding Triton container is running. I have the TRT version currently hard-pinned to the version of TRT used by the container on LDG (8.2.3), but the CUDA version is trickier.

You have to run the export with a version of CUDA <11.7 to be compatible with the version used in the container (I was able to get 11.2 to work, but I think 11.6 should as well). This is a bit trickier, since this is essentially controlled by the LD_LIBRARY_PATH environment variable (which you point to the path of the CUDA libraries you want to use), and has to be set before you do any imports, which precludes including this as e.g. a command line arg. This isn't a big deal if you're running export as a one-off, but it's a bigger problem when you want to automate the pipeline.

One solution here would be to build this in at the pinto level, since pinto can set the environment variable then call the script as a subprocess. I do get the impression that this is something we're going to want to have more control on in all these projects going forward, just because of how brittle TensorRT is to these things. I can envision adding a cuda_version entry to the tool.pinto table of a project's pyproject.toml that, if it's set, will have pinto run set LD_LIBRARY_PATH to start with "/usr/local/cuda-{cuda_version}/lib64", or raise an error if this path doesn't exist.

(FWIW, the way this works right now on LDG is that they keep multiple versions of CUDA available, but link the latest one to /usr/local/cuda which is used as the first search location in LD_LIBRARY_PATH).

alecgunny · 2022-08-29T22:35:58Z

Had to fix an issue with pinto that was causing torch imports to throw off lower library search paths. This is merged as of ML4GW/pinto#21, so once that container has pushed from the workflow this will be good to retest.

alecgunny · 2022-08-30T04:05:40Z

Marking as ready for review now that tests are passing. Notable changes:

Including TensorRT dependency in export project and setting that as the default inference platform in projects/sandbox/pyproject.toml. Defaults to using FP16 inference when TensorRT is selected.
Whitening transform has been updated to better support TensorRT. In particular, slice indices and other small scalar constants are removed as parameters of the WhiteningTransform module and are just baked into the ONNX graph at export time. The slice indices in particular were causing a problem because TensorRT doesn't support dynamic slicing like that (i.e. slicing use slice indices that are functions of some other input). It only supports slicing using constants. The main reason to keep the time domain filter as a parameter is so that a) it gets saved on export and b) calling WhiteningTransform.to moves it to the appropriate device. I don't include the window as a parameter either since it's really just a vector constant in the graph, and isn't fit to the data at all.
Supporting batched snapshotter updates at both export and inference time, and including tests for different batch sizes.
Implementing the same prior_file.is_abs() check in generate_waveforms as we do in timeslide_injections for consistency.
hermes submodule now points to my local repo. Figured it's not worth holding this up before I figure out how to implement the aggregator model using Triton implicit state, so developers will need to add my repo as a remote for the time being.

EthanMarx · 2022-08-30T14:24:18Z

projects/sandbox/export/export.py

+    # TODO: hardcoding these kwargs for now, but worth
+    # thinking about a more robust way to handle this
+    kwargs = {}
+    if platform == qv.Platform.ONNX:
+        kwargs["opset_version"] = 13
+
+        # turn off graph optimization because of this error
+        # https://github.com/triton-inference-server/server/issues/3418
+        bbhnet.config.optimization.graph.level = -1
+    elif platform == qv.Platform.TENSORRT:
+        kwargs["use_fp16"] = True


maybe the **kwargs of the main function are assumed to be platform specific variables? I'm not sure if typeo has the ability to parse variables that aren't directly in the function signature into **kwargs.

Unfortunately it doesn't right now, since it relies on annotations to infer the type the parser needs to map to. A few markup parsers (yaml, toml) have functionality for applying heuristics that we could maybe lift, but no idea how complicated that might turn out to be. Off the top of my head, the real headaches would be

reimplementing argparse's logic in a less structured format for iterating through the unused arguments and grouping them into (arg_name, value) pairs, accounting for booleans which would just be formatted as --flag-name with no associated values.

handling arguments which are misspelled, silently giving you their default behavior (if they don't have a default, this will luckily raise a missing argument error)

EthanMarx

that LD_LIBRARY_PATH issue was giving me headaches. Ended up adding my base conda /libs directory to it in my bash_profile. Glad that its solved. Looks cool excited to see it in action.

alecgunny added 3 commits August 17, 2022 12:35

adding batch size arguments

ebff0d5

adding monitor to inference script

643aac5

getting accelerated inference working

3c25be7

This was linked to issues Aug 18, 2022

Use TensorRT in model export #131

Closed

Start batching inference requests #134

Closed

alecgunny added this to the Pipeline Iteration 2 milestone Aug 18, 2022

alecgunny added 8 commits August 18, 2022 11:41

adding more parallelization to inference

b2f4d3b

working on higher performance implementation

ff3a79c

overhauling inference again

b59dbb5

adding additional logging

c74338b

combining future callbacks

eabce5f

inference fully working

4037bd6

adding tensorrt dependency

1bf27db

getting TRT export working

a474301

alecgunny added 14 commits August 24, 2022 16:47

working on last infer bug

4411e58

overhauling infer script

aa063ad

getting new multithreaded inference working

52dc7f7

adding batch size arguments

b5d5366

adding monitor to inference script

acf0423

getting accelerated inference working

1880409

adding more parallelization to inference

e184fdb

working on higher performance implementation

e70dea8

overhauling inference again

54a3995

adding additional logging

5cfd612

combining future callbacks

dd4fe13

inference fully working

b0a2af0

adding tensorrt dependency

d74a4f0

getting TRT export working

442c566

alecgunny added 4 commits August 29, 2022 10:38

working on last infer bug

36aca00

overhauling infer script

2adc26d

getting new multithreaded inference working

59a7544

adding relative path code to generate_waveforms

ff15493

alecgunny force-pushed the accelerated-inference branch from 52dc7f7 to ff15493 Compare August 29, 2022 17:41

alecgunny mentioned this pull request Aug 29, 2022

Specify CUDA version ML4GW/pinto#20

Closed

alecgunny added 3 commits August 29, 2022 10:55

fixing bug in whitening tests

e04a98c

adding up-to-date gxx dependency in envrionment yaml

11a3f44

fixing merge conflicts

0eb6e60

alecgunny added 5 commits August 29, 2022 15:40

removing extra dependency after pinto fix

83d431e

adding batch size to tests

7fb769a

getting export tests passing

623d8f6

getting infer tests passing

b7c3a7a

using base.test_stop again

5835d88

alecgunny marked this pull request as ready for review August 30, 2022 04:05

EthanMarx reviewed Aug 30, 2022

View reviewed changes

EthanMarx approved these changes Aug 30, 2022

View reviewed changes

alecgunny merged commit 7b0e31b into ML4GW:main Aug 30, 2022

alecgunny deleted the accelerated-inference branch August 30, 2022 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerated inference #147

Accelerated inference #147

alecgunny commented Aug 18, 2022 •

edited

Loading

alecgunny commented Aug 23, 2022

alecgunny commented Aug 29, 2022

alecgunny commented Aug 30, 2022

EthanMarx Aug 30, 2022

alecgunny Aug 30, 2022

EthanMarx left a comment

Accelerated inference #147

Accelerated inference #147

Conversation

alecgunny commented Aug 18, 2022 • edited Loading

alecgunny commented Aug 23, 2022

alecgunny commented Aug 29, 2022

alecgunny commented Aug 30, 2022

EthanMarx Aug 30, 2022

Choose a reason for hiding this comment

alecgunny Aug 30, 2022

Choose a reason for hiding this comment

EthanMarx left a comment

Choose a reason for hiding this comment

alecgunny commented Aug 18, 2022 •

edited

Loading