Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add event tracing and ETDumps to executor_runner #5027

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

benkli01
Copy link
Collaborator

@benkli01 benkli01 commented Sep 2, 2024

  • Enabled via EXECUTORCH_ENABLE_EVENT_TRACER
  • Add flag 'etdump_path' to specify the file path for the ETDump file
  • Add flag 'num_executions' for number of iterations to run
  • Create and pass event tracer 'ETDumpGen'
  • Save ETDump to disk
  • Update docs to reflect the changes

Re-upload of #4502 to discuss with @GregoryComer.

Copy link

pytorch-bot bot commented Sep 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5027

Note: Links to docs will display an error until the docs builds have been completed.

❌ 31 New Failures

As of commit fde5862 with merge base 01d526f (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 2, 2024
@benkli01
Copy link
Collaborator Author

benkli01 commented Sep 2, 2024

@pytorchbot label 'partner: arm'

@pytorch-bot pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Sep 2, 2024
@benkli01
Copy link
Collaborator Author

benkli01 commented Sep 2, 2024

@pytorchbot label ciflow/trunk

Copy link

pytorch-bot bot commented Sep 2, 2024

Can't add following labels to PR: ciflow/trunk. Please ping one of the reviewers for help.

@benkli01
Copy link
Collaborator Author

benkli01 commented Sep 4, 2024

Hi @GregoryComer. Would it be possible to run the CI on your side to see if the issue from the previous PR is still occurring? I'm having a hard time understanding where this comes from.

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

- Enabled via EXECUTORCH_ENABLE_EVENT_TRACER
- Add flag 'etdump_path' to specify the file path for the ETDump file
- Add flag 'num_executions' for number of iterations to run
- Create and pass event tracer 'ETDumpGen'
- Save ETDump to disk
- Update docs to reflect the changes

Signed-off-by: Benjamin Klimczak <benjamin.klimczak@arm.com>
Change-Id: I7e8e8b7f21453bb8d88fa2b9c2ef66c532f3ea46
@benkli01 benkli01 force-pushed the add-profiling-to-xnn-executor-runner-2 branch from 3288eda to b09d09e Compare September 23, 2024 09:48
@benkli01
Copy link
Collaborator Author

Hi @dbort . Sorry for dragging you into this, but I saw your comment on EXECUTORCH_SEPARATE_FLATCC_HOST_PROJECT in the code, so I thought you might be able to help with resolving the failing test here. Any idea how to fix this?

@digantdesai
Copy link
Contributor

I don't see a CI failure anymore

@benkli01
Copy link
Collaborator Author

I don't see a CI failure anymore

@digantdesai To me pull / test-llama-runner-qnn-linux (fp32, cmake, qnn) / linux-job (pull_request) is showing up as failing after my latest update. The CI run for the previous version you imported did not finish for me, i.e. I could not see any results, but it did not seem to have this test included anyway.

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@digantdesai
Copy link
Contributor

Yeah I see, from the main CMakeList, and qnn does have EXECUTORCH_ENABLE_EVENT_TRACER=ON

@digantdesai
Copy link
Contributor

Any update on this?

@benkli01
Copy link
Collaborator Author

Hi @digantdesai, I'm still hoping for some pointer from @dbort or you as I'm struggling to reproduce it locally and can't really make sense of the error.

@benkli01 benkli01 added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Nov 25, 2024
@freddan80
Copy link
Collaborator

@digantdesai will you have a look at this one since it touches code outside arm delegate. Thx!

@cccclai
Copy link
Contributor

cccclai commented Nov 30, 2024

The error shows up when running this script https://github.com/pytorch/executorch/blob/main/backends/qualcomm/scripts/build.sh based on the log.

If you have a linux machine, can you follow https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html and see if the script fails?

@benkli01
Copy link
Collaborator Author

benkli01 commented Dec 2, 2024

@cccclai I finally managed to reproduce the issue by running script backends/qualcomm/scripts/build.sh with the parameters from the CI script here. Interestingly, the issue seems to be caused by --job_number 2 (not my first guess). If I remove the parameter entirely, defaulting to --job_number 16, the issue disappears (not sure this is an acceptable solution and/or would work in the CI). I'm guessing that this is related to the TODO here. Any input on how to proceed would be much appreciated.

@cccclai
Copy link
Contributor

cccclai commented Dec 4, 2024

@cccclai I finally managed to reproduce the issue by running script backends/qualcomm/scripts/build.sh with the parameters from the CI script here. Interestingly, the issue seems to be caused by --job_number 2 (not my first guess). If I remove the parameter entirely, defaulting to --job_number 16, the issue disappears (not sure this is an acceptable solution and/or would work in the CI). I'm guessing that this is related to the TODO here. Any input on how to proceed would be much appreciated.

Ah I remember that. @dbort and @Olivia-liu, any thought on this?

Copy link
Contributor

@dbort dbort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I missed your mentions of me for so long! Thanks @cccclai for pointing me to this.

Just to double check, is this the error you're seeing?

https://github.com/pytorch/executorch/actions/runs/12161392244/job/33915905755#step:14:1039

gmake[2]: *** No rule to make target '../third-party/flatcc/lib/libflatccrt.a', needed by 'executor_runner'.  Stop.

I think I remember @Olivia-liu looking into this, and worked around it by building in release mode. (hence

-DCMAKE_BUILD_TYPE=Release \
) Which is still something we need to figure out. Olivia, do you know if there was a github issue tracking this problem?


Result<Method> method = program->load_method(method_name, &memory_manager);
EventTracer* event_tracer_ptr = nullptr;
#ifdef ET_EVENT_TRACER_ENABLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is already so long and complex, I'd like to factor out these ifdefs if possible.

You could create a class to encapsulate the event tracing, like

class TraceManager {
 public:
  TraceManager();
  EventTracer* get_event_tracer();
  Error write_etdump_to_file(const char* filename);
};

If tracing is enabled, the ctor could create the ETDump (a field), get_event_tracer can return a pointer to it, and write_to_file can open the file and write the contents. If tracing is disabled, the class is basically empty, returning a null tracer and just returning Error::NotSupported when asked to write.

Then main() can say

TraceManager tracer;
program->load_method(..., tracer->get_event_tracer());
...
if (tracer->get_event_tracer() != nullptr) {
  status = tracer->write_etdump_to_file(FLAGS_etdump_path.c_str());
  ET_CHECK_MSG(status == Error::Ok, ...);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Let me know if the implementation looks ok.

examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
backends/xnnpack/CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
docs/source/tutorial-xnnpack-delegate-lowering.md Outdated Show resolved Hide resolved
@dbort
Copy link
Contributor

dbort commented Dec 5, 2024

Also for what it's worth, I'm trying to merge dvidelabs/flatcc#306 into upstream flatcc to let us remove -DEXECUTORCH_SEPARATE_FLATCC_HOST_PROJECT and similar hacks

@benkli01
Copy link
Collaborator Author

benkli01 commented Dec 5, 2024

gmake[2]: *** No rule to make target '../third-party/flatcc/lib/libflatccrt.a', needed by 'executor_runner'.  Stop.

Thanks @dbort , this is the error exactly.

I'm not sure about the workaround to use release mode. The command used in the ci script is already using --release. As mentioned above, removing --num_jobs 2 seems to work locally, but it's a strange workaround and might not work in CI.

@dbort
Copy link
Contributor

dbort commented Dec 7, 2024

I'm not sure about the workaround to use release mode. The command used in the ci script is already using --release.

Ok, thanks for looking into that.

As mentioned above, removing --num_jobs 2 seems to work locally, but it's a strange workaround and might not work in CI.

That means that there's some kind of race condition.

Based on your PR, it looks like executor_runner has a proper dependency on libflatccrt; otherwise I would have expected a use-without-dependency situation.

...except maybe it doesn't. This PR adds a dep on "${FLATCCRT_LIB}" from the top-level CMakeLists.txt, but when I look for a place that sets FLATCCRT_LIB I only see it in the cmake config file at

if(CMAKE_BUILD_TYPE MATCHES "Debug")
set(FLATCCRT_LIB flatccrt_d)
else()
set(FLATCCRT_LIB flatccrt)
endif()

afaik, that config isn't included in the top-level cmake system. That file is used to point to an already-built version of the core ET libs from external projects, like

find_package(executorch CONFIG REQUIRED)

If FLATCCRT_LIB is empty in this PR at https://github.com/pytorch/executorch/pull/5027/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR817-R819

list(APPEND _executor_runner_libs etdump ${FLATCCRT_LIB})

then executor_runner wouldn't properly depend on libflatccrt.a. A parallel build could cause that lib to be coincidentally built earlier with -j16, "fixing" the problem, while -j2 would be less likely to do so.

Could you try printing the value of FLATCCRT_LIB from the top-level CMakeLists.txt to see if it's empty?

@dbort
Copy link
Contributor

dbort commented Dec 7, 2024

Though theoretically executor_runner shouldn't even need to know about libflatccrt: it should inherit the dep from the PUBLIC section of

target_link_libraries(
etdump
PUBLIC etdump_schema flatccrt
PRIVATE executorch
)

But in this case, you could try updating this PR to use flatccrt as the dep instead of using ${FLATCCRT_LIB}. Even if FLATCCRT_LIB were defined, I think it's actually wrong to use it as the dep name -- I believe that the target is always called flatccrt even if, in debug mode, the file that it generates is called libflatccrt_d.a.

@benkli01
Copy link
Collaborator Author

Hi @dbort . I tried fixing the flatccrt dependency as suggested, but without any effect:

  1. Replace ${FLATCCRT_LIB} with flatccrt
  2. Remove flatccrt dependency completely and rely on inherited dependency from etdump

(I did this both in CMakeLists.txt and examples/qualcomm/executor_runner/CMakeLists.txt)

I did find a new workaround though, which should be more stable than just removing the --num_jobs 2:
Run the command to build the QNN SDK twice, first clean then without cleaning in file .ci/scripts/build-qnn-sdk.sh.

I feel like the flatccrt issue is not related to my change so I will open an issue for it. I can push the above workaround in a separate PR. I will be off from tomorrow until next year, but I really hope we can find a solution together to get this PR merged.

- Raise a CMake error if event tracing is enabled without the devtools
- Re-factoring of the changes in the portable executor_runner
- Minor fix in docs

Change-Id: Ia50fef8172f678f9cbe2b33e2178780ff983f335
Signed-off-by: Benjamin Klimczak <benjamin.klimczak@arm.com>
Copy link
Collaborator Author

@benkli01 benkli01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! All issues be fixed now.

examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
docs/source/tutorial-xnnpack-delegate-lowering.md Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
backends/xnnpack/CMakeLists.txt Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants