Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(trim-paths): trim SO and DW_AT_comp_dir symbols for root DI node #118518

Closed
wants to merge 3 commits into from

Conversation

weihanglo
Copy link
Member

@weihanglo weihanglo commented Dec 1, 2023

This is one way to fix #117652.

The issue

When --remap-path-scope=object is specified, user expect that there is
no local path embedded in final executables. Under object scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
split-debuginfo=packed|unpacked is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work.

  • On Linux, the root DW_AT_comp_dir of a compile unit seems to go into the binary executables.
  • On macOS, SO symbols are embedded in binary executables and libraries regardless a split-debuginfo file is built. Each SO symbol contains a path to the root source file of a debug info compile unit.

The approach

Path of working directory in the root DI node seems to be embedded in
executables. Hence, we trim them when the scope of unsplit-debuginfo
is present, as if it is kinda a "unsplit" debuginfo.

Unresolved issue

@rustbot
Copy link
Collaborator

rustbot commented Dec 1, 2023

r? @b-naber

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 1, 2023
@weihanglo
Copy link
Member Author

cc @Urgau if you're interested.

@weihanglo weihanglo added O-macos Operating system: macOS A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) F-trim-paths Feature: trim-paths labels Dec 1, 2023
@weihanglo weihanglo marked this pull request as draft December 4, 2023 19:51
@weihanglo
Copy link
Member Author

Not the right fix…

@weihanglo
Copy link
Member Author

r? @pnkfelix

as you're one member of debugging wg
Thanks!

@rustbot rustbot assigned pnkfelix and unassigned b-naber Dec 5, 2023
@weihanglo weihanglo marked this pull request as ready for review December 5, 2023 14:48
@weihanglo weihanglo changed the title fix(trim-paths): remove SO symbols when unsplit-debuginfo specified fix(trim-paths): remove SO symbols when object scope specified Dec 5, 2023
@weihanglo weihanglo changed the title fix(trim-paths): remove SO symbols when object scope specified fix(trim-paths): remove SO symbols for object scope Dec 5, 2023
@weihanglo weihanglo marked this pull request as draft December 7, 2023 23:41
@weihanglo
Copy link
Member Author

Find a way that can fix issue on both Linux and Darwin.

@weihanglo weihanglo changed the title fix(trim-paths): remove SO symbols for object scope fix(trim-paths): trim SO and DW_AT_comp_dir symbols for root DI node Dec 8, 2023
@weihanglo
Copy link
Member Author

@bors try

@bors
Copy link
Contributor

bors commented Dec 8, 2023

⌛ Trying commit 4fa22f1 with merge 4311f2e...

bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 8, 2023
fix(trim-paths): trim `SO` and `DW_AT_comp_dir` symbols for root DI node

This is one way to fix <rust-lang#117652>.

## The issue

When `--remap-path-scope=object` is specified, user expect that there is
no local path embedded in final executables. Under `object` scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
`split-debuginfo=packed|unpacked` is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work.

* On Linux, the root `DW_AT_comp_dir` of a compile unit seems to go into the binary executables.
* On macOS, `SO` symbols are embedded in binary executables and libraries regardless a split-debuginfo file is built. Each `SO` symbol contains a path to the root source file of a debug info compile unit.

## The approach

Path of working directory in the root DI node seems to be embedded in
executables. Hence, we trim them when the scope of `unsplit-debuginfo`
is present, as if it is kinda a "unsplit" debuginfo.

## Unresolved issue

* Not sure where to add more test to consolidate it.
* Haven't investigate if we should apply the same logic to cranelift [here](https://github.com/rust-lang/rust/blob/64d7e0d0b61c460fbc882ae37c0f236756dd9c39/compiler/rustc_codegen_cranelift/src/debuginfo/mod.rs#L68-L80).
* Not sure if there is any other consequence doing this, but AFAIK debugging still works on macOS and Linux  with `profile.dev.trim-paths="object"` fairly (with some extra tweaks in cargo).
@bors
Copy link
Contributor

bors commented Dec 8, 2023

☀️ Try build successful - checks-actions
Build commit: 4311f2e (4311f2ea0060c6b2a4a6abaaf3199c9e4eed9f7a)

@weihanglo weihanglo marked this pull request as ready for review December 8, 2023 21:02
@weihanglo
Copy link
Member Author

This is ready for review again :)

@michaelwoerister
Copy link
Member

don't think so, the goal the for_scope function is to return a local or remapped filename if the scope passed was enabled not if some part of the scope was enabled.

Here is my reasoning for why I think this should be .intersect():

  1. We have the following method:

    fn for_scope(&self, sess: &Session, scopes: RemapPathScopeComponents) -> Self::Output<'_> {
          if sess.opts.unstable_opts.remap_path_scope.contains(scopes) {
              self.prefer_remapped_unconditionaly()
          } else {
              self.prefer_local()
          }
    }

    where sess.opts.unstable_opts.remap_path_scope denotes which scopes should be remapped and scopes denotes all the scopes a given filepath belongs too. For example, if the path will be emitted to split-debuginfo, then scopes will be SPLIT_DEBUGINFO, if the path will end up in both in split-debuginfo and unsplit-debuginfo (for example because LLVM will just emit that single string into both artifacts), then scopes will be SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO.

  2. If the given path will only end up in a single scope (which is usually the case), then .contains() and .intersects() will yield the same result, so it does not matter.

  3. If the path ends up in multiple scopes, some of which are enabled and some of which aren't, then we have to decide how to deal with that. We can either only apply the remapping if all scopes are enabled (= .contains()) or apply the remapping if any of the scopes are enabled (= .intersects()). In both cases we don't have a perfect match. However, I would argue that we should definitely err on the side of applying the remapping too much rather than too little, because remapping is used for privacy reasons. Otherwise we could end up with private information being embedded even though a seemingly correct remapping is specified.

Alternatively, we could assert that scopes only has a single bit set. I'm not sure if there are cases where a given path ends up in two different places (I thought that would be the case for the DW_AT_comp_dir but that turned out to be wrong).

@Urgau
Copy link
Member

Urgau commented Jan 16, 2024

I'm not sure I follow your logic for the for_scope function. In particular this part:

scopes denotes all the scopes a given filepath belongs too.

Does this mean that you associate a filepath with scopes? Because that not how I see things.

When I implemented the RFC in #115214 I just made all the places that remapped paths to be conditional on the scope(s) they applied to (when emitting a diagnostic it's DIAGNOSTICS, for macro it's MACROS, ...). A filepath is not associated with any scopes, it's remapped based on the scope of the area that wants to "emit it".

For example, if the path will be emitted to split-debuginfo, then scopes will be SPLIT_DEBUGINFO

The variable scopes is not user-defined, it's defined in the code for each specific location. Only sess.opts.unstable_opts.remap_path_scope is user-defined.

if the path will end up in both in split-debuginfo and unsplit-debuginfo (for example because LLVM will just emit that single string into both artifacts), then scopes will be SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO.

scopes cannot currently ever be that combination. (well technically yes, as nothing prevents another dev to test for it but it's not the case currently)

Alternatively, we could assert that scopes only has a single bit set.

Well that's currently the case, it's only ever called with the MACRO and DIAGNOSTICS scopes. So we could assert this in the function.

@michaelwoerister
Copy link
Member

Does this mean that you associate a filepath with scopes? Because that not how I see things.

When I implemented the RFC in #115214 I just made all the places that remapped paths to be conditional on the scope(s) they applied to (when emitting a diagnostic it's DIAGNOSTICS, for macro it's MACROS, ...). A filepath is not associated with any scopes, it's remapped based on the scope of the area that wants to "emit it".

I think I might have formulated things in a confusing way. With "filepath" I just mean the path that is being remapped, that is, the thing that implements RemapFileNameExt.

A filepath is not associated with any scopes, it's remapped based on the scope of the area that wants to "emit it".

I'm not sure I agree with that (although the distinction is subtle). In my mind, we use the concept of "scopes" to model where a given path will be emitted to. For example, if we are about to emit some path as part of a diagnostic message, the we use .for_scope(DIAGNOSTICS). It's not DIAGNOSTICS because it is being emitted as part of some diagnostics machinery -- it is DIAGNOSTICS because it will show up in diagnostic output.

By extension, if we are storing a path to some place (like debuginfo in LLVM IR) that might then end up in multiple output artifacts and these output artifacts are in different scopes (e.g. LLVM might write it both to the object file and to the .dwo), then the corresponding .for_scope() needs to take the union of all scopes that the path will end up in. This is the reason why something like SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO could validly occur.

@Urgau
Copy link
Member

Urgau commented Jan 16, 2024

I think I might have formulated things in a confusing way. With "filepath" I just mean the path that is being remapped, that is, the thing that implements RemapFileNameExt.

I see, using the same word for a (subtle) different meaning. Thanks for clearing that up.

It's not DIAGNOSTICS because it is being emitted as part of some diagnostics machinery -- it is DIAGNOSTICS because it will show up in diagnostic output.

👍 Agree.

By extension, if we are storing a path to some place (like debuginfo in LLVM IR) that might then end up in multiple output artifacts and these output artifacts are in different scopes (e.g. LLVM might write it both to the object file and to the .dwo), then the corresponding .for_scope() needs to take the union of all scopes that the path will end up in. This is the reason why something like SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO could validly occur.

I think we already have the logic for handling this in should_prefer_remapped_for_codegen() called by RemapFileNameExt::for_codegen.

I also wouldn't expect the need to call for_scope with the SPLIT_DEBUGINFO | UNSPLIT_DEBUGINFO scopes, but I see the principle. I'm not sure we'll ever run into it, but if we do, we can change the contains to intersect.

# - Debuginfo in binary file
# - `.o` deleted
# - `.dSYM` present
# - in binary, paths from `N_SO` (source files) and `N_OSO` (object files) shouldn be remapped
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @michaelwoerister. I've move #118518 (review) to this thread so we can follow the discussion better.

I'm not quite clear yet on the situation on macOS. Do you have a link to some documentation on the various SO symbols?

It's also unclear to me. I have no idea if there is an official doc for Mach-O debug symbols. Here is what I've found:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to read up on those links by the end of the week. If we can't get to a clear conclusion, I think we should merge the PR nonetheless, for the improved behavior it provides on Linux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, from reading through those links it sounds to me like:

  • SO symbols we can handle fine by remapping the paths generated by debuginfo.
  • OSO symbols are added by the linker. They are not needed after dsymutil has built the dSYM bundle. So it would be fine to remove them after building the dSYM bundle.

I think the best option would be to use the -oso_prefix linker option. My guess is that will not be a problem even on older macOS versions. But I don't know if we have a minimum supported XCode version (in addition to a minimum support macOS version).

Other than that, rewriting/removing the OSO symbols in a postprocessing step seems to be our only option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weihanglo, do you know how -Cstrip=debuginfo behaves on macOS? Does that remove the OSO symbols?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs strip -S on macOS and removes OSO symbols successfully, and the .dSYM bundle can successfully find sources for a simple hello-world.rs example.

Should we always pass -Cstrip=debuginfo to deal with OSO symbols? I am not sure…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like -Cstrip=debuginfo is now the default unless debuginfo is requested. We could also do a similar thing for when trim-paths is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, but only when Cargo is involved. OSO is still there if using rustc directly.

(granted, people invoking rustc directly should be able to set -Cstript=debuginfo)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do you mean that rustc should implicitly strip debuginfo under some trim-paths circumstances?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking about the Cargo case. For direct rustc invocations, the defaults are less important, I think, because that's more of an advanced use case. The test cases here could simulate what we expect Cargo to do.

@michaelwoerister
Copy link
Member

Well that's currently the case, it's only ever called with the MACRO and DIAGNOSTICS scopes. So we could assert this in the function.

@Urgau, if you are interested in opening a PR that asserts that only a single scope is passed to for_scope, and that also adds a docstring saying that only a single scope is supported, you can r? me. Since for_scope currently accepts a set of scopes, I think it is a bit confusing at the moment. But a bit of documentation and an assertion, would be an easy fix for that.

@rustbot
Copy link
Collaborator

rustbot commented Jan 22, 2024

Failed to set assignee to me: invalid assignee

Note: Only org members with at least the repository "read" role, users with write permissions, or people who have commented on the PR may be assigned.

When `--remap-path-scope=object` is specified, user expect that there is
no local path embedded in final executables. Under `object` scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
`split-debuginfo=packed|unpacked` is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work on macOS. On macOS, `SO` symbols are embedded
in binary executables and libraries regardless a split-debuginfo file is
built. Each `SO` symbol contains a path to the root source file of a
debug info compile unit.

This commit demonstrates the case, and hope there is a fix soon.
When `--remap-path-scope=object` is specified, user expect that there is
no local path embedded in final executables. Under `object` scope, the
current implementation only remap debug symbols if debug info is
splitted into its own file. In other words, when
`split-debuginfo=packed|unpacked` is set, rustc assumes there is no
embedded path in the final executable needing to be remapped.

However, this doesn't work on Linux. On Linux, the root `DW_AT_comp_dir`
of a compile unit seems to go into the binary binary executables.

This commit demonstrates the case, and hope there is a fix soon.
Path of working directory in the root DI node seems to be embedded in
executables. Hence, we trim them when the scope of `unsplit-debuginfo`
is present, as if it is kinda a "unsplit" debuginfo.
@michaelwoerister michaelwoerister added S-blocked Status: Blocked on something else such as an RFC or other implementation work. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 24, 2024
@michaelwoerister
Copy link
Member

Marking the PR as blocked until we've discussed whether the RFC can be implemented at all in its current form. See #118518 (comment)

fmease added a commit to fmease/rust that referenced this pull request Jan 24, 2024
…chaelwoerister

Assert that a single scope is passed to `for_scope`

Addresses rust-lang#118518 (comment)

r? `@michaelwoerister`
@weihanglo weihanglo marked this pull request as draft January 24, 2024 15:36
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jan 25, 2024
…chaelwoerister

Assert that a single scope is passed to `for_scope`

Addresses rust-lang#118518 (comment)

r? ``@michaelwoerister``
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jan 25, 2024
Rollup merge of rust-lang#120230 - Urgau:for_scope-single-scope, r=michaelwoerister

Assert that a single scope is passed to `for_scope`

Addresses rust-lang#118518 (comment)

r? ``@michaelwoerister``
@bors
Copy link
Contributor

bors commented Mar 29, 2024

☔ The latest upstream changes (presumably #122450) made this pull request unmergeable. Please resolve the merge conflicts.

@weihanglo
Copy link
Member Author

I am a bit lost. @michaelwoerister, this is still worth pursuing, or at least I should verify the current situation again, right?

@Urgau
Copy link
Member

Urgau commented Mar 29, 2024

Some of the test changes look worth landing, even if the logic change is no longer necessary.

Better have more tests than not enough.

@michaelwoerister
Copy link
Member

It's definitely worth having tests for the trim-paths related functionality. Maybe its better to create a number of smaller test cases. It might also be possible to use regular UI tests for this. I recently saw something pretty clever in another PR where a UI test looks for its own PDB file:
https://github.com/rust-lang/rust/blob/95adb6d7b520230971522c8806cc6ab518e419c5/tests/ui/debuginfo/msvc-strip-debuginfo.rs

We could do the same, i.e. just load binaries and separate debuginfo files into memory and then treat them as byte slices where we look for the text we want to find (or don't want to find). That should be a lot more readable than run-make tests. If compiletest allows it, we can also put some shared code into a common aux-crate.

Regarding what to test: I suggest picking the most relevant configurations from the table in the tracking issue and then have a test for linux, macOS, and MSVC. For each useful (platform, trim-paths, split-debuginfo) combination, we compile with the commandline options that we expect Cargo to pass for that combination and then make sure the binaries and separate debuginfo files are either sanitized or not sanitized. I don't think we need to test the less useful combinations.

It might be good to split this into a number of smaller PRs, starting with a simple case just for Linux or macOS. Feel free to assign to me for review.

@weihanglo
Copy link
Member Author

Currently don't have time working on this. It seems that we need a complete rewrite because runmake is under the gigantic porting. So close.

@weihanglo weihanglo closed this Jun 19, 2024
@weihanglo weihanglo deleted the trim-paths-macos branch June 19, 2024 20:16
@michaelwoerister
Copy link
Member

Thanks, @weihanglo! This PR is triggered a lot of valuable work around clarifying the RFC. We'll get there 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) F-trim-paths Feature: trim-paths O-macos Operating system: macOS S-blocked Status: Blocked on something else such as an RFC or other implementation work. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only with all or split-debuginfo can -Zremap-path-scope remap SO symbols on macOS
7 participants