-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of DataSet API #45
Comments
Thanks @hughleat, I think that a python-side dataset API makes a lot of sense, and I like the class hierarchy you're proposing. I think a prerequisite for this is to figure out the role of the backend service in managing datasets as at the moment, the frontend python code and backend service have slightly jumbled and overlapping roles. We could shift the responsibility of managing benchmarks from the service to the frontend by:
Cheers, |
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.
I think it would be a good idea also to wrap the Benchmark proto with a python class that can add extra functionality like the ability to validate benchmark behavior etc. To begin with, something simple like: class Benchmark(object):
def __init__(self, proto: BenchmarkProto)
def sha1(self) -> bytes # Name for caching benchmark+any other attributes
def program_data(self) -> BenchmarkProto # The data that the service needs
def program_data_sha1(self) -> bytes # Used for caching benchmarks on service-side
def is_validatable(self) -> bool
def validation_callbacks(self) -> List[Callback[[CompilerEnv], Optional[str]] # Run any ad-hoc validation, e.g. difftest, valgrind, etc |
Add a new CompilerEnv.validate() method that replaces the previous validate_state(env, state) call. This is a stepping stone to enabling a more flexible API for custom benchmark validation routines. github.com//issues/45
Add a new CompilerEnv.validate() method that replaces the previous validate_state(env, state) call. This is a stepping stone to enabling a more flexible API for custom benchmark validation routines. github.com//issues/45
Add a new CompilerEnv.validate() method that replaces the previous validate_state(env, state) call. This is a stepping stone to enabling a more flexible API for custom benchmark validation routines. github.com//issues/45
Add a new CompilerEnv.validate() method that replaces the previous validate_state(env, state) call. This is a stepping stone to enabling a more flexible API for custom benchmark validation routines. github.com//issues/45
In preparation for introducing a new Dataset class. Issue #45.
In preparation for introducing a new Dataset class. Issue #45.
In preparation for introducing a new Dataset class. Issue #45.
In preparation for introducing a new Dataset class. Issue #45.
With the new dataset API, enumerating the benchmarks is not advised (the list may be infinite), and there is now no need to install datasets ahead of time. Issue facebookresearch#45.
This test is flaky, and the functionality tested here will be removed in facebookresearch#45.
A benchmark represents that particular program that is being compiled. Issue facebookresearch#45.
This extends the LLVM data archive to include the following additional binaries: bin/llc bin/llvm-as bin/llvm-bcanalyer bin/llvm-config bin/llvm-dis bin/llvm-mca This also moves the location of the unpacked archive to llvm-v0 (with a version suffix), and fixes a race condition in the download logic. Issue facebookresearch#45.
Decode the binary data from the manifest. Issue facebookresearch#45.
This adds python operator overloads that alias to existing methods to make the Dataset class "feel" more like a regular python dictionary: >>> len(dataset) # equivalent to dataset.n 23 >>> for benchmark in dataset: # iterate over the class directly ... pass >>> dataset["cbench-v1/crc32"] # key a benchmark This also renames Dataset.n to Dataset.size for consistent with other containers like np.ndarray, and returns math.inf if the number of benchmarks is infinite, not a negative integer. The advantage of math.inf is that will poison any integer arithemtic, e.g. >>> sum(d.size for d in datasets) inf if any one of the datasets has an infinite size. With a negative number, this would instead compute a regular integer value. Issue facebookresearch#45.
This patch makes two simplifications to the Datasets API: 1) It removes the random-benchmark selection logic from `Dataset.benchmark()`. Now, calling `benchmark()` requires a URI. If you wish to select a benchmark randomly, you can implement this random selection yourself. The idea is that random benchmark selection is quite a minor use case that introduces quite a bit of complexity into the implementation. 2) It removes the `Union[str, Dataset]` types to `Datasets` methods. Now, only a string is permitted. This is to make it easier to understand the argument types. If the user has a `Dataset` instance that they would like to use, they can explicitly pass in `dataset.name`. Issue facebookresearch#45.
This is to start the transition from the LegacyDatasets to the new Datasets API. Issue facebookresearch#45.
This adds new Dataset class implementations of some of the LLVM datasets. The original LegacyDatasets are still used for now, they will be migrated once everything is in place. Issue facebookresearch#45.
This differs from the previous version in that it downloads the original C++ sources and compiles them on-demand, rather than downloading prepared bitcodes. Issue facebookresearch#45.
This adds two new datasets, csmith-v0 and llvm-stress-v0, that are parametrized program generators. csmith-v0 uses Csmith to generate C99 programs that are then lowered to bitcode. llvm-stress-v0 generates random LLVM-IR. Both generators were developed to stress test compilers, so they have an above-average chance that a generated benchmark will cause the compiler to enter an unexpected state. Issue facebookresearch#45.
This adds a dataset of 1k OpenCL kernels that were used in the paper: Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. "Synthesizing benchmarks for predictive modeling." In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 86-99. IEEE, 2017. The OpenCL kernels are compiled on-demand. Issue facebookresearch#45.
The dataset is from: da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. "ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction." In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021. Issue facebookresearch#45.
This adds the new Dataset implementation of the cBench dataset. The validation logic isn't super tidy and could be tidied up a bit, its just copied over from //compiler_gym/envs/llvm:legacy_datasets. Issue facebookresearch#45.
Do not permit that the benchmark name can be missing for a benchmark URI to be considered well formed, as we no longer support dataset-only URIs. Issue facebookresearch#45.
We no longer require running compiler_gym.bin.datasets to download a dataset for testing. Issue facebookresearch#45.
This updates the documentation of the getting started guide, tutorials, API reference etc to the new dataset API. In general, this means simplifying things, as we no longer need to explain how to download and manage datasets. Issue facebookresearch#45.
This switches over the `CompilerEnv` environment to use the new dataset API, dropping the `LegacyDataset` class. Background ---------- Since the very first prototype of CompilerGym, a `Benchmark` protocol buffer has been used to provide a serializable representation of benchmarks that can be passed back and forth between the service and the frontend. Initially, it was up to the compiler service to maintain the set of available benchmarks, exposing the available benchmarks with a `GetBenchmarks()` RPC method, and allowing new benchmarks to be added using an `AddBenchmarks()` method. This was fine for the initial use case of shipping a handful of benchmarks and allowing ad-hoc new benchmarks to be added, but for managing larger sets of benchmarks, a *datasets* abstraction was added. Initial Datasets abstraction ---------------------------- To add support for managing large sets of programs, a [Dataset](https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20) tuple was added that describes a set of programs, and a link to the a tarball containing those programs. The tarball is required to have a JSON file containing metadata, and a directory containing the benchmarks, one file per benchmark. A set of operations were added to the frontend command line to make downloading and unpacking these tarballs easier: https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133 Problems with this approach --------------------------- (1) **Leaky abstraction** Both the environment and backend service have to know about datasets. This means redundant duplicated logic, and adds a maintenance burden of keeping the C++/python logic in sync. (2) **Inflexible** Only supports environments in which a single file represents a benchmark. No support for multi-file benchmarks, benchmarks that are compiled on-demand, etc. (3) **O(n) space and time overhead** on each service instance, where *n* is the total number of benchmarks. At init time, each service needs to recursively scan a directory tree to build a list of available benchmarks. This list must be kept in memory. This adds startup time, and also causes cache invalidation issues when multiple environment instances are modifying the underlying filesystem. New Dataset API --------------- This commit changes the ownership model so that the *Environment* owns the benchmarks and datasets, not the service. This uses the new `Dataset` class hierarchy that has been added in previous pull requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201. Now, the backend has no knowledge of "datasets". Instead the service simply keeps a small cache of benchmarks that it has seen. If a session request has a benchmark URI that is not in this cache, the service returns a "resource not found" error and the frontend logic can then respond by sending it a copy of the benchmark as a `Benchmark` proto. The service is free to cache this for future use, and can empty the cache whenever it wants. This new approach has a few key benefits: (1) By moving all of the datasets logic into the frontend, it becomes much easier for users to define their own datasets. (2) Reduces compiler service startup time as it removes the need for each service to do a recursive filesystem sweep. (3) Removes the requirement that the set of benchmarks is fully enumerable, allow for program generators that can produce a theoretically infinite number of benchmarks. (4) Adds support for lazily-compiled datasets of programs that are generated on-demand. (5) Removes the need to download datasets ahead of time. Datasets can now be installed on-demand. Summary of changes ------------------ (1) Changes the type of `env.benchmark` from a string to a `Benchmark` instance. (2) Makes `env.benchmark` a mandatory attribute. If no benchmark is provided at init time, one is chosen deterministically. If you wish to select a random benchmark, use `env.datasets.benchmark()`. (3) `env.fork()` no longer requires `env.reset()` to have been called first. It will call `env.reset()` if required. (4) `env.benchmark = None` is no longer a valid way of requesting a random benchmark. If you would like a random benchmark, you must now roll your own random picker using `env.datasets.benchmark_uris()` and similar. (5) Deprecates all `LegacyDataset` operations, changing their behavior to no-ops, and removing the class. (6) Renames `cBench` to `cbench` to be consistent with the lower-case naming convention of gym. The old `cBench` datasets are kept around but are marked deprecated to encourage migration. Migrating to the new interface ------------------------------ To migrate existing code to the new interface: (1) Update references to `cBench-v[01]` to `cbench-v1`. (2) Review code that accesses the `env.benchmark` property and update to `env.benchmark.uri` if a string name is required. (3) Review code that calls `env.reset()` without first setting a benchmark. Previously, calling `env.reset()` would select a random benchmark. Now, `env.reset()` always selects the last used benchmark, or a predetermined default if none is specified. (4) Review code that relies on `env.benchmark` being `None` to select benchmarks randomly. Now, `env.benchmark` is always set to the previously used benchmark, or a predetermined default benchmark if none has been provided. (5) Remove calls to `env.require_dataset()`. Issue facebookresearch#45.
This replaces the boolean `hidden` value with a `deprecated` message, which is emitted automatically on a call to `install()`. Issue facebookresearch#45. Fixes facebookresearch#219.
Redesign the Dataset class to not depend on tar balls and particular data structures.
Currently the datasets are hard coded as tarballs - https://github.com/facebookresearch/CompilerGym/blob/development/compiler_gym/envs/llvm/datasets.py
these later get unpacked into a particular format where the directory structure is very important.
This means that we have to curate them. E.g. we can't pull benchmarks from Anghabench directly, we have to host a tarball somewhere.
Also, if we had random program generators, e.g. CSmith, CLGen, we couldn't really work with them.
Instead, add methods to the Dataset class to install and extract:
At the same time, something like this would free datasets to have their own directory structure. This might be useful when considering input data for correctness and performance. Sometimes multiple benchmarks will share things.
At init time, the gym could look in the dataset dir (e.g. ~/.compiler_gym/datasets or [ENV COMPILER_GYM_DIR]). Any python scripts in there could be run to register datasets.
Programmatically, people could register their own datasets, outside of that common mechanism.
A command line tool,
install_dataset url
, could fetch a script from the url, drop it in the dataset dir, then run install.The text was updated successfully, but these errors were encountered: