[llvm] Add the AnghaBench dataset. #210

ChrisCummins · 2021-04-23T17:21:58Z

The dataset is from:

da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de
Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira
Guimaraes, and Fernando Magno Quinão Pereira. "ANGHABENCH: A Suite
with One Million Compilable C Benchmarks for Code-Size Reduction."
In 2021 IEEE/ACM International Symposium on Code Generation and
Optimization (CGO), pp. 378-390. IEEE, 2021.

Issue #45.

JD-ETH

lgtm, some questions I had.

This dataset will be really nice for training!

compiler_gym/envs/llvm/datasets/anghabench.py

tests/llvm/datasets/anghabench_test.py

compiler_gym/envs/llvm/datasets/anghabench.py

hughleat

LGTM

The dataset is from: da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. "ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction." In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021. Issue #45.

This release introduces some significant changes to the way that benchmarks are managed, introducing a new dataset API. This enabled us to add support for millions of new benchmarks and a more efficient implementation for the LLVM environment, but this will require some migrating of old code to the new interfaces (see “Migration Checklist” below). Some of the key changes of this release are: - [Core API change] We have added a Python Benchmark class (#190). The env.benchmark attribute is now an instance of this class rather than a string (#222). - [Core behavior change] Environments will no longer select benchmarks randomly. Now env.reset() will now always select the last-used benchmark, unless the benchmark argument is provided or env.benchmark has been set. If no benchmark is specified, a default is used. - [API deprecations] We have added a new Dataset class hierarchy (#191, #192). All datasets are now available without needing to be downloaded first, and a new Datasets class can be used to iterate over them (#200). We have deprecated the old dataset management operations, the compiler_gym.bin.datasets script, and removed the --dataset and --ls_benchmark flags from the command line tools. - [RPC interface change] The StartSession RPC endpoint now accepts a list of initial observations to compute. This removes the need for an immediate call to Step, reducing environment reset time by 15-21% (#189). - [LLVM] We have added several new datasets of benchmarks, including the Csmith and llvm-stress program generators (#207), a dataset of OpenCL kernels (#208), and a dataset of compilable C functions (#210). See the docs for an overview. - CompilerEnv now takes an optional Logger instance at construction time for fine-grained control over logging output (#187). - [LLVM] The ModuleID and source_filename of LLVM-IR modules are now anonymized to prevent unintentional overfitting to benchmarks by name (#171). - [docs] We have added a Feature Stability section to the documentation (#196). - Numerous bug fixes and improvements. Please use this checklist when updating code for the previous CompilerGym release: - Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required. Setting this attribute by string (env.benchmark = "benchmark://a-v0/b") and comparison to string types (env.benchmark == "benchmark://a-v0/b") still work. - Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified. - Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been specified. Setting env.benchmark = None will raise an error. Select a benchmark randomly by sampling from the env.datasets.benchmark_uris() iterator. - Remove calls to env.require_dataset() and related operations. These are no longer required. - Remove accesses to env.benchmarks. An iterator over available benchmark URIs is now available at env.datasets.benchmark_uris(), but the list of URIs cannot be relied on to be fully enumerable (the LLVM environments have over 2^32 URIs). - Review code that accesses env.observation_space and update to env.observation_space_spec where necessary (#228). - Update compiler service implementations to support the updated RPC interface by removing the deprecated GetBenchmarks RPC endpoint and replacing it with Dataset classes. See the example service for details. - [LLVM] Update references to the poj104-v0 dataset to poj104-v1. - [LLVM] Update references to the cBench-v1 dataset to cbench-v1.

This release introduces some significant changes to the way that benchmarks are managed, introducing a new dataset API. This enabled us to add support for millions of new benchmarks and a more efficient implementation for the LLVM environment, but this will require some migrating of old code to the new interfaces (see “Migration Checklist” below). Some of the key changes of this release are: - [Core API change] We have added a Python Benchmark class (facebookresearch#190). The env.benchmark attribute is now an instance of this class rather than a string (facebookresearch#222). - [Core behavior change] Environments will no longer select benchmarks randomly. Now env.reset() will now always select the last-used benchmark, unless the benchmark argument is provided or env.benchmark has been set. If no benchmark is specified, a default is used. - [API deprecations] We have added a new Dataset class hierarchy (facebookresearch#191, facebookresearch#192). All datasets are now available without needing to be downloaded first, and a new Datasets class can be used to iterate over them (facebookresearch#200). We have deprecated the old dataset management operations, the compiler_gym.bin.datasets script, and removed the --dataset and --ls_benchmark flags from the command line tools. - [RPC interface change] The StartSession RPC endpoint now accepts a list of initial observations to compute. This removes the need for an immediate call to Step, reducing environment reset time by 15-21% (facebookresearch#189). - [LLVM] We have added several new datasets of benchmarks, including the Csmith and llvm-stress program generators (facebookresearch#207), a dataset of OpenCL kernels (facebookresearch#208), and a dataset of compilable C functions (facebookresearch#210). See the docs for an overview. - CompilerEnv now takes an optional Logger instance at construction time for fine-grained control over logging output (facebookresearch#187). - [LLVM] The ModuleID and source_filename of LLVM-IR modules are now anonymized to prevent unintentional overfitting to benchmarks by name (facebookresearch#171). - [docs] We have added a Feature Stability section to the documentation (facebookresearch#196). - Numerous bug fixes and improvements. Please use this checklist when updating code for the previous CompilerGym release: - Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required. Setting this attribute by string (env.benchmark = "benchmark://a-v0/b") and comparison to string types (env.benchmark == "benchmark://a-v0/b") still work. - Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified. - Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been specified. Setting env.benchmark = None will raise an error. Select a benchmark randomly by sampling from the env.datasets.benchmark_uris() iterator. - Remove calls to env.require_dataset() and related operations. These are no longer required. - Remove accesses to env.benchmarks. An iterator over available benchmark URIs is now available at env.datasets.benchmark_uris(), but the list of URIs cannot be relied on to be fully enumerable (the LLVM environments have over 2^32 URIs). - Review code that accesses env.observation_space and update to env.observation_space_spec where necessary (facebookresearch#228). - Update compiler service implementations to support the updated RPC interface by removing the deprecated GetBenchmarks RPC endpoint and replacing it with Dataset classes. See the example service for details. - [LLVM] Update references to the poj104-v0 dataset to poj104-v1. - [LLVM] Update references to the cBench-v1 dataset to cbench-v1.

ChrisCummins added this to the v0.1.8 milestone Apr 23, 2021

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2021

ChrisCummins requested review from hughleat and JD-ETH April 23, 2021 17:22

ChrisCummins force-pushed the llvm-datasets-4 branch 2 times, most recently from 598e7a2 to 4e57d83 Compare April 26, 2021 21:38

ChrisCummins force-pushed the llvm-datasets-5 branch 2 times, most recently from c43019f to c6de6fb Compare April 26, 2021 21:42

JD-ETH approved these changes Apr 27, 2021

View reviewed changes

compiler_gym/envs/llvm/datasets/anghabench.py Show resolved Hide resolved

tests/llvm/datasets/anghabench_test.py Show resolved Hide resolved

compiler_gym/envs/llvm/datasets/anghabench.py Outdated Show resolved Hide resolved

ChrisCummins force-pushed the llvm-datasets-4 branch from 4e57d83 to 26b6001 Compare April 27, 2021 07:22

hughleat approved these changes Apr 27, 2021

View reviewed changes

ChrisCummins force-pushed the llvm-datasets-4 branch from 26b6001 to c2f6ab7 Compare April 27, 2021 07:51

ChrisCummins force-pushed the llvm-datasets-5 branch from c6de6fb to 31c2b3f Compare April 27, 2021 08:15

ChrisCummins force-pushed the llvm-datasets-4 branch from c2f6ab7 to a7e80c1 Compare April 27, 2021 08:16

ChrisCummins force-pushed the llvm-datasets-5 branch from 31c2b3f to 9c15186 Compare April 27, 2021 08:17

ChrisCummins force-pushed the llvm-datasets-4 branch from a7e80c1 to 0989103 Compare April 27, 2021 09:48

ChrisCummins force-pushed the llvm-datasets-5 branch from 9c15186 to 554eed1 Compare April 27, 2021 09:49

Base automatically changed from llvm-datasets-4 to development April 27, 2021 11:47

ChrisCummins merged commit 6a2740c into development Apr 27, 2021

ChrisCummins deleted the llvm-datasets-5 branch April 27, 2021 11:47

ChrisCummins mentioned this pull request Apr 30, 2021

Release v0.1.8 #238

Merged

9 tasks

ChrisCummins mentioned this pull request Apr 30, 2021

Release v0.1.8 #241

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llvm] Add the AnghaBench dataset. #210

[llvm] Add the AnghaBench dataset. #210

ChrisCummins commented Apr 23, 2021

JD-ETH left a comment

hughleat left a comment

[llvm] Add the AnghaBench dataset. #210

[llvm] Add the AnghaBench dataset. #210

Conversation

ChrisCummins commented Apr 23, 2021

JD-ETH left a comment

Choose a reason for hiding this comment

hughleat left a comment

Choose a reason for hiding this comment