Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg_tar: Need archive name and package_dir to be configurable #193

Closed
konste opened this issue Jun 24, 2020 · 27 comments
Closed

pkg_tar: Need archive name and package_dir to be configurable #193

konste opened this issue Jun 24, 2020 · 27 comments
Assignees
Labels
feature-request P1 An issue that must be resolved. Must have an assignee

Comments

@konste
Copy link
Contributor

konste commented Jun 24, 2020

We use pkg_tar rule to package build artifacts in a specific way requested by consumers. That requires the archive name and package_dir attribute to be calculated by a custom Bazel rule, in other words we need those parameters configurable. We would gladly contribute necessary changes to pkg_tar, but before jumping on the coding we would like to make sure the approach we take is acceptable.
So far I only found two ways to make parameters configurable. Please let me know if I am missing something.
One way is to supplement the attribute of type "string" with another attribute of type "label" which would point to the file containing the value to use. It is exactly how package_dir_file attribute is implemented, but we would need to add another one for the archive name.
Another possible way is to introduce a new attribute pkg_tar_config of type "label" which would expect new PkgTarInfo provider. Our custom rule would calculate required strings, stick them into PkgTarInfo and return it. pkg_tar_config attribute then points to that custom rule and pkg_tar extracts the strings from the provider.

Please let us know if either of the above two ways sounds like an acceptable solution and we can work on PR. If neither is acceptable then please explain how else parameters can be made configurable.

Workaround with renaming the archive after it is created does not work because .tgz contains zipped TAR and while we can rename .tgz we cannot rename inner .TAR archive and it is a blocker for us.

@aiuto
Copy link
Collaborator

aiuto commented Jun 24, 2020

What is wrong with making a macro that wraps the pkg_tar call so that it can calculate the name and package_dir attributes?
Also for renaming the extension of the dependency, why can't you set the extension file on that target which is the dependency?

On a different tack, I don't want to add a Provider specific to tar. nacl@ is working on an enhanced pkg_filegroup provider that will handle things like renaming files, re-treeing other packages, and assigning the attributes desired in the package. This is essentially all the things needed to define the layout of the package, without being specific to any package format. Then rpm, tar, deb, zip, and whatever will reuse that shared provider.

@konste
Copy link
Contributor Author

konste commented Jun 24, 2020

-- What is wrong with making a macro that wraps the pkg_tar call so that it can calculate the name and package_dir attributes?

We need name and package_dir to depend on build configuration, for example, target platform, bitness, debug/release, workspace status, custom command line parameters. All that data are not available at the Loading phase when macro is expanded. We can only access it in a custom rule at the Analysis phase and custom rule calculates the final name and package_dir based on many parameters.

-- Also for renaming the extension of the dependency, why can't you set the extension file on that target which is the dependency?

I don't understand, sorry. Just for clarity - those archives are consumed outside of the Bazel build and we don't have much control over it.

-- I don't want to add a Provider specific to tar.

I absolutely agree that such a provider would be mostly applicable to all packaging formats and therefore only one should be implemented for all packaging rules. I used pkg_tar as an example and because it is the one we are immediately interested in.

-- @nacl is working on an enhanced pkg_filegroup provider...

This is very interesting! May be it is what we are looking for. Can we take a look at it somewhere? I wonder if it supports the two attributes we need configurable. We also would be glad to contribute if needed.

@aiuto
Copy link
Collaborator

aiuto commented Jun 25, 2020

-- What is wrong with making a macro that wraps the pkg_tar call so that it can calculate the name and package_dir attributes?

We need name and package_dir to depend on build configuration, for example, target platform, bitness, debug/release, workspace status, custom command line parameters. All that data are not available at the Loading phase when macro is expanded. We can only access it in a custom rule at the Analysis phase and custom rule calculates the final name and package_dir based on many parameters.

Now I understand. Yes. This does require effort at analysis time.

-- Also for renaming the extension of the dependency, why can't you set the extension file on that target which is the dependency?

I don't understand, sorry. Just for clarity - those archives are consumed outside of the Bazel build and we don't have much control over it.

I was trying to understand what you meant about

Workaround with renaming the archive after it is created does not work because .tgz contains zipped TAR and
while we can rename .tgz we cannot rename inner .TAR archive and it is a blocker for us.

I though you wanted to rename the extensions .tgz to .tar. But now I realize you want to rename the full file name.

-- I don't want to add a Provider specific to tar.

I absolutely agree that such a provider would be mostly applicable to all packaging formats and therefore only one should be implemented for all packaging rules. I used pkg_tar as an example and because it is the one we are immediately interested in.

Two approaches come to mind. Both share the drawback that the output file name can not be known until after analysis. They will have to declare the output file dynamically. You could mitigate that by having pkg_tar write a manifest of where the real output is in the default output. But this won't work when the tar file is a dependency of another rule. How will you point to it. Perhaps the output could be a symlink to the real one. Whatever... this needs design work.

  1. Create a rule which just writes the custom filename to a file
    Add output_filename_file attribute to pkg_tar. The type is label list and you point to a target which generates the name.
  • Sort of ugly
  • Not flexible
  1. A modified version of your pkgtarInfo:
  • Create a generic "ConfigurationInfo" provider that contains a map of names to values.
  • Add a rule of your own which returns one of those. It can contain whatever you want
  • Modify pkg_tar:
    • add naming_info as an optional attribute which must be of type ConfigurationInfo
    • add output_file_name as an attribute which takes a string with ${var} replacements in it. The
      replacements have to come from naming_info provider value.

Now we have something which could be reused in other places. I could see someone wanting
different output names for a .so file depending on the cpu.

-- @nacl is working on an enhanced pkg_filegroup provider...

This is very interesting! May be it is what we are looking for. Can we take a look at it somewhere? I wonder if it supports the two attributes we need configurable. We also would be glad to contribute if needed.

It is currently in the experimental folder, and only being used for rpm. I would like to move the pkg_filegroup up a level soon and then iterate on the other platforms.

@konste
Copy link
Contributor Author

konste commented Jun 26, 2020

Great! We are on the same page now. Your approach #2 is very close to what we have in mind.

-- generic "ConfigurationInfo" provider that contains a map of names to values

That map of names to values I see as a map from parameter name to parameter value, where parameters are whatever attributes specific rules may need, such as archive name, package_dir, etc.

-- add output_file_name as an attribute which takes a string with ${var} replacements in it.

I think it can be simplified. As soon as somebody provides naming_info (in your naming scheme) it implies the intent to configure at least some parameters through ConfigurationInfo provider. In this case we can simply request that ALL parameters should be supplied this way. I don't insist on that, it just seems easier to implement and understand.
We don't really need ${var} template expansion - as soon as there is the way to supply parameters programmatically (from the rule) it is enough flexibility for people to implement it any way they deem reasonable.
Would it help if we find time to work on it?

@aiuto
Copy link
Collaborator

aiuto commented Jun 26, 2020

With the simplified way for the naming_info rule would have to be written for each target, because you include the archive name.

The dict approach is more like

cpu: x86_64
target_os: qnx
copt: debug
...  anything else

Then you have

pkg_tar(
  name = "tarball",
  package_name = "my_package-${target_os}-${cpu}-${opt}",
  naming_info = "//shared:naming_info"
  ...
)

You could write one instance of the rule to provide the naming, and point every pkg_tar in your build at a single shared rule.

@konste
Copy link
Contributor Author

konste commented Jun 26, 2020

I see your point. Let me discuss with colleagues.

@aiuto
Copy link
Collaborator

aiuto commented Jun 26, 2020

Both our solutions still have a big problem that that the output file name is not known in advance. That breaks so much.
I think I have an idea. I have to prototype it first. That might not happen for a week or two.

@konste
Copy link
Contributor Author

konste commented Jun 26, 2020

After thinking about your design with the parameters template expansion overnight it kind of settles in my head and now I agree with it and consider it a way to go.

So what exactly is a problem with the output file name not known in advance? ctx.actions.declare_file always saved our bacon before.

@aiuto
Copy link
Collaborator

aiuto commented Jun 27, 2020

Proof of concept here: https://github.com/aiuto/rules_pkg/tree/naming

The problem shows up when you try to use the output file in any situation where you need the name.
There is a block in tests/build_test_tar.sh marked with !!!! that highlights that.

@konste
Copy link
Contributor Author

konste commented Jun 27, 2020

I cloned your PoC and look at block in tests/build_test_tar.sh marked with !!!!. Let me understand - the problem is specifically for the test code or the test is just a demo for the bigger issue which may affect actual users of the packaging rules? Test can use predefined hardcoded names. If the issue is with the rule usability, then I still don't see it, sorry. This maybe because I am not very familiar with the shell scripts. Can you describe the problematic scenario please?

@aiuto
Copy link
Collaborator

aiuto commented Jun 28, 2020

It would happen whenever you need another rule to use that .tar file. Right now, we only see that in the tar test.
If I changed the test cases to have them include a tar file that had a custom name, there is no way I could reasonably write that test. If I wanted a rule that took a tar file and untarred it to do something with the content I wild run into the same problem. That is why I started on the PackageArtifactInfo. A consuming rule could use this to find the actual filename produced. For shell scripts we would also have to write the value of the provider to a file, so the script could read that to find the actual file name.

It is sort of obvious when you think about it.

  • you want to be able to dynamically name the file at build time based on some flag
  • how can you write the filename in a script that needs to find it if you don't know the name

We could almost solve the problem by also writing the content to a file named by name. But that is ugly becuase you have two copies of the output when you could have one. This is wasteful.

@konste
Copy link
Contributor Author

konste commented Jun 28, 2020

So it seems to me that the problem is specific to shell script consumers rather than consuming rules. For rules we have at least three options in place: 1. Add produced file to DefaultInfo and let consuming rule discover the file properties; 2. Return a provider (same as the one we use for input or another one) from the packaging rule, letting consuming rule discover all the details it may need. 3. Same provider consumed by the packaging rule can be consumed by the other rules too, letting them discover the same input.
For scripts... I don't know scripting well enough. Where can scripts read the data from? Environment does not propagate between targets. It seems the data file is the only option. But packaging rules should not write it out always, only when requested in the input configuration provider. BTW naming of that output configuration file after the name of the packaging rule seems logical. Oh, wait a second! This file is already written - it is .args file created for the packaging rule internal needs. The only problem it is named the same as the archive and we don't know it upfront. But if we name it after the rule, then it will be known upfront. Do you see any problem with that approach?

@konste
Copy link
Contributor Author

konste commented Jun 28, 2020

If the script can be executed by ctx.actions.run_shell, then we can pass the file name as a parameter to the script if the script executing rule depends on the packaging rule and receives a custom or default provider from it.

@aiuto
Copy link
Collaborator

aiuto commented Jun 29, 2020 via email

@konste
Copy link
Contributor Author

konste commented Jun 29, 2020

In an attempt to grasp the problem I have created a tiny "consuming" repo here.

To get it out of the way you have a typo here. Instead of Label("//:build_tar") it should be Label("//pkg:build_tar").

Now I tried to visualize the problem when the rules down the stream from packaging may need to know the file name.
When we have

pkg_tar(name = "A", ...)
pkg_tar(name = "B", 
    deps = [":A"],
    ...
)

The name of the file for the rule A is conducted to rule B through the default provider. No need to fiddle with the custom providers.

Now when the next rule needs to know the name of the archive file it can do this. Again not a problem.

Now deliberating between the option with PackageNamingInfo carrying the dictionary of substitutions and the option when it carries plain parameter name to parameter value table. Your argument is With the simplified way for the naming_info rule would have to be written for each target, because you include the archive name. Yes in order to use "dynamic naming" one would need to write one naming rule for one packaging rule, but from my practical experience it is Ok! Typically those rules are the result of macro expansion, like that:

def package_cc_target(rule_name_prefix , srcs, **kwargs):
    # Introspect the build graph to find direct and indirect dependencies of the given targets.
    collect_deps(
        name = rule_name_prefix + "_deps",
        srcs = srcs,
    )
    prepare_binaries(
        name = rule_name_prefix + "_binaries",
        srcs = [rule_name_prefix + "_deps"],
    )
    pkg_tar(
        name = rule_name_prefix + "_tar",
        srcs = [
            rule_name_prefix + "_binaries", 
        ],
    ...
    )

And then this macro is invoked for every package to be created. So it is perfectly normal to add a naming rule here before pkg_tar. Same rule implementation will be reused, but the name of the archive it produces can be different, based on the rule name, configuration or other attributes.

@aiuto
Copy link
Collaborator

aiuto commented Jul 6, 2020

How are you going to compute rule_name_prefix? Your example with the macro would require knowing the value at load time.

@konste
Copy link
Contributor Author

konste commented Jul 6, 2020

Yes, in my example rule_name_prefix corresponds to undecorated component name "Foo" which is known at the place where the packaging macro is invoked. Each component uses its own BUILD.bazel authoring and as part of it invoked packaging macro. So the "base" name is known at loading just fine. But then packaging macro supposed to take that base name and decorate it with a bunch of configuration dependent prefixes and suffixes (CPU in your example) and use the resulted decorated name as the name for produced archive file for the given component Foo. I see this pattern often and consider it well established best practice.

@aiuto
Copy link
Collaborator

aiuto commented Jul 7, 2020 via email

@konste
Copy link
Contributor Author

konste commented Jul 7, 2020

Yes, with the current implementation of pkg_tar that is what we are getting. My goal is to extend the implementation so that the described scenario would work.

@aiuto
Copy link
Collaborator

aiuto commented Jul 9, 2020

I just made pr #198 for a design discussion. I am strongly in favor of PackageNamingInfo returning a map of values rather than a single string. That will allow a single rule to provide the naming information for all pkg_ types.

@aiuto aiuto added feature-request P1 An issue that must be resolved. Must have an assignee labels Aug 7, 2020
@aiuto
Copy link
Collaborator

aiuto commented Sep 25, 2020

#198 is now ready for review. I think it will allow you to rename archives any way you want.
In the simple form, you just add another attribute with the name you want.
In the more complex usage, you also point to a rule providing a map of name/value pairs you can substitute into the filename. (e.g. {cpu}). cpu is implemented. Other variables can follow.
Once that gets submitted, I can do the same thing for deb, rpm and zip.

@aiuto
Copy link
Collaborator

aiuto commented Feb 24, 2021

Status update: We can rename the package file in many ways now, but not package_dir.
It looks like that would be easy

  • at pkg.bzl:line61 we would do variable substitutions on package_dir in the same way we do to package_file_name
  • there is no need to handle substitutions in package_dir_file. AFAICT, package_dir_file is a hack to work around the inability to paramaterize package_dir.

The bigger change is

  • adding a test for it
  • adding another example to examples/naming_package_files.

@konste
You offered to help when you created the issue. Is that offer still on the table?

@konste
Copy link
Contributor Author

konste commented Feb 24, 2021

Yes, the offer still stands and I will be looking into it later today.

@aiuto
Copy link
Collaborator

aiuto commented Feb 24, 2021 via email

@konste
Copy link
Contributor Author

konste commented Feb 26, 2021

Of course, work emergencies prevented me from working on it right away, but I am still on it and determined to finish it soon.

@konste
Copy link
Contributor Author

konste commented Feb 27, 2021

Here you go: #305

@aiuto
Copy link
Collaborator

aiuto commented May 3, 2021

Closed by #305

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request P1 An issue that must be resolved. Must have an assignee
Projects
None yet
Development

No branches or pull requests

2 participants