Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how different sorts of file names should be normalized #28

Open
nightlark opened this issue Dec 22, 2024 · 0 comments
Open

Determine how different sorts of file names should be normalized #28

nightlark opened this issue Dec 22, 2024 · 0 comments
Assignees

Comments

@nightlark
Copy link
Collaborator

Different file names need to be normalized to have good odds of finding a match in our datasets. Generally, these are centered around a few things:

  • Removing identifiers for a specific version from files names
  • Removing platform/architecture specific information from file/folder names

What needs to get done may vary based on the type of file:

  • Linux shared libraries (file name needs normalizing to remove architecture and version identifiers)
  • C/C++ headers, pkgconfig/CMake Config files (need to recognize common include file paths, including ones that specify a multiarch triplet or GNU triplet)
  • Linux binaries (need to recognize common bin folder paths, including ones that specif a multiarch triplet and maybe GNU triplet)
  • Language/Ecosystem specific - different languages or ecosystems like Python, Windows, NuGet, and macOS could require handling the above in ways that are different than Linux
@nightlark nightlark self-assigned this Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant