Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm] Add the AnghaBench dataset. #210

Merged
merged 1 commit into from
Apr 27, 2021
Merged

Conversation

ChrisCummins
Copy link
Contributor

The dataset is from:

da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de
Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira
Guimaraes, and Fernando Magno Quinão Pereira. "ANGHABENCH: A Suite
with One Million Compilable C Benchmarks for Code-Size Reduction."
In 2021 IEEE/ACM International Symposium on Code Generation and
Optimization (CGO), pp. 378-390. IEEE, 2021.

Issue #45.

@ChrisCummins ChrisCummins added this to the v0.1.8 milestone Apr 23, 2021
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2021
@ChrisCummins ChrisCummins requested review from hughleat and JD-ETH April 23, 2021 17:22
@ChrisCummins ChrisCummins force-pushed the llvm-datasets-4 branch 2 times, most recently from 598e7a2 to 4e57d83 Compare April 26, 2021 21:38
@ChrisCummins ChrisCummins force-pushed the llvm-datasets-5 branch 2 times, most recently from c43019f to c6de6fb Compare April 26, 2021 21:42
Copy link
Contributor

@JD-ETH JD-ETH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, some questions I had.

This dataset will be really nice for training!

compiler_gym/envs/llvm/datasets/anghabench.py Show resolved Hide resolved
tests/llvm/datasets/anghabench_test.py Show resolved Hide resolved
compiler_gym/envs/llvm/datasets/anghabench.py Outdated Show resolved Hide resolved
Copy link
Contributor

@hughleat hughleat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The dataset is from:

    da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de
    Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira
    Guimaraes, and Fernando Magno Quinão Pereira. "ANGHABENCH: A Suite
    with One Million Compilable C Benchmarks for Code-Size Reduction."
    In 2021 IEEE/ACM International Symposium on Code Generation and
    Optimization (CGO), pp. 378-390. IEEE, 2021.

Issue #45.
Base automatically changed from llvm-datasets-4 to development April 27, 2021 11:47
@ChrisCummins ChrisCummins merged commit 6a2740c into development Apr 27, 2021
@ChrisCummins ChrisCummins deleted the llvm-datasets-5 branch April 27, 2021 11:47
@ChrisCummins ChrisCummins mentioned this pull request Apr 30, 2021
9 tasks
ChrisCummins added a commit that referenced this pull request Apr 30, 2021
This release introduces some significant changes to the way that
benchmarks are managed, introducing a new dataset API. This enabled us
to add support for millions of new benchmarks and a more efficient
implementation for the LLVM environment, but this will require some
migrating of old code to the new interfaces (see “Migration Checklist”
below). Some of the key changes of this release are:

-   [Core API change] We have added a Python Benchmark class (#190). The
    env.benchmark attribute is now an instance of this class rather than
    a string (#222).
-   [Core behavior change] Environments will no longer select benchmarks
    randomly. Now env.reset() will now always select the last-used
    benchmark, unless the benchmark argument is provided or
    env.benchmark has been set. If no benchmark is specified, a default
    is used.
-   [API deprecations] We have added a new Dataset class hierarchy
    (#191, #192). All datasets are now available without needing to be
    downloaded first, and a new Datasets class can be used to iterate
    over them (#200). We have deprecated the old dataset management
    operations, the compiler_gym.bin.datasets script, and removed the
    --dataset and --ls_benchmark flags from the command line tools.
-   [RPC interface change] The StartSession RPC endpoint now accepts a
    list of initial observations to compute. This removes the need for
    an immediate call to Step, reducing environment reset time by 15-21%
    (#189).
-   [LLVM] We have added several new datasets of benchmarks, including
    the Csmith and llvm-stress program generators (#207), a dataset of
    OpenCL kernels (#208), and a dataset of compilable C functions
    (#210). See the docs for an overview.
-   CompilerEnv now takes an optional Logger instance at construction
    time for fine-grained control over logging output (#187).
-   [LLVM] The ModuleID and source_filename of LLVM-IR modules are now
    anonymized to prevent unintentional overfitting to benchmarks by
    name (#171).
-   [docs] We have added a Feature Stability section to the
    documentation (#196).
-   Numerous bug fixes and improvements.

Please use this checklist when updating code for the previous
CompilerGym release:

-   Review code that accesses the env.benchmark property and update to
    env.benchmark.uri if a string name is required. Setting this
    attribute by string (env.benchmark = "benchmark://a-v0/b") and
    comparison to string types (env.benchmark == "benchmark://a-v0/b")
    still work.
-   Review code that calls env.reset() without first setting a
    benchmark. Previously, calling env.reset() would select a random
    benchmark. Now, env.reset() always selects the last used benchmark,
    or a predetermined default if none is specified.
-   Review code that relies on env.benchmark being None to select
    benchmarks randomly. Now, env.benchmark is always set to the
    previously used benchmark, or a predetermined default benchmark if
    none has been specified. Setting env.benchmark = None will raise an
    error. Select a benchmark randomly by sampling from the
    env.datasets.benchmark_uris() iterator.
-   Remove calls to env.require_dataset() and related operations. These
    are no longer required.
-   Remove accesses to env.benchmarks. An iterator over available
    benchmark URIs is now available at env.datasets.benchmark_uris(),
    but the list of URIs cannot be relied on to be fully enumerable (the
    LLVM environments have over 2^32 URIs).
-   Review code that accesses env.observation_space and update to
    env.observation_space_spec where necessary (#228).
-   Update compiler service implementations to support the updated RPC
    interface by removing the deprecated GetBenchmarks RPC endpoint and
    replacing it with Dataset classes. See the example service for
    details.
-   [LLVM] Update references to the poj104-v0 dataset to poj104-v1.
-   [LLVM] Update references to the cBench-v1 dataset to cbench-v1.
@ChrisCummins ChrisCummins mentioned this pull request Apr 30, 2021
9 tasks
bwasti pushed a commit to bwasti/CompilerGym that referenced this pull request Aug 3, 2021
This release introduces some significant changes to the way that
benchmarks are managed, introducing a new dataset API. This enabled us
to add support for millions of new benchmarks and a more efficient
implementation for the LLVM environment, but this will require some
migrating of old code to the new interfaces (see “Migration Checklist”
below). Some of the key changes of this release are:

-   [Core API change] We have added a Python Benchmark class (facebookresearch#190). The
    env.benchmark attribute is now an instance of this class rather than
    a string (facebookresearch#222).
-   [Core behavior change] Environments will no longer select benchmarks
    randomly. Now env.reset() will now always select the last-used
    benchmark, unless the benchmark argument is provided or
    env.benchmark has been set. If no benchmark is specified, a default
    is used.
-   [API deprecations] We have added a new Dataset class hierarchy
    (facebookresearch#191, facebookresearch#192). All datasets are now available without needing to be
    downloaded first, and a new Datasets class can be used to iterate
    over them (facebookresearch#200). We have deprecated the old dataset management
    operations, the compiler_gym.bin.datasets script, and removed the
    --dataset and --ls_benchmark flags from the command line tools.
-   [RPC interface change] The StartSession RPC endpoint now accepts a
    list of initial observations to compute. This removes the need for
    an immediate call to Step, reducing environment reset time by 15-21%
    (facebookresearch#189).
-   [LLVM] We have added several new datasets of benchmarks, including
    the Csmith and llvm-stress program generators (facebookresearch#207), a dataset of
    OpenCL kernels (facebookresearch#208), and a dataset of compilable C functions
    (facebookresearch#210). See the docs for an overview.
-   CompilerEnv now takes an optional Logger instance at construction
    time for fine-grained control over logging output (facebookresearch#187).
-   [LLVM] The ModuleID and source_filename of LLVM-IR modules are now
    anonymized to prevent unintentional overfitting to benchmarks by
    name (facebookresearch#171).
-   [docs] We have added a Feature Stability section to the
    documentation (facebookresearch#196).
-   Numerous bug fixes and improvements.

Please use this checklist when updating code for the previous
CompilerGym release:

-   Review code that accesses the env.benchmark property and update to
    env.benchmark.uri if a string name is required. Setting this
    attribute by string (env.benchmark = "benchmark://a-v0/b") and
    comparison to string types (env.benchmark == "benchmark://a-v0/b")
    still work.
-   Review code that calls env.reset() without first setting a
    benchmark. Previously, calling env.reset() would select a random
    benchmark. Now, env.reset() always selects the last used benchmark,
    or a predetermined default if none is specified.
-   Review code that relies on env.benchmark being None to select
    benchmarks randomly. Now, env.benchmark is always set to the
    previously used benchmark, or a predetermined default benchmark if
    none has been specified. Setting env.benchmark = None will raise an
    error. Select a benchmark randomly by sampling from the
    env.datasets.benchmark_uris() iterator.
-   Remove calls to env.require_dataset() and related operations. These
    are no longer required.
-   Remove accesses to env.benchmarks. An iterator over available
    benchmark URIs is now available at env.datasets.benchmark_uris(),
    but the list of URIs cannot be relied on to be fully enumerable (the
    LLVM environments have over 2^32 URIs).
-   Review code that accesses env.observation_space and update to
    env.observation_space_spec where necessary (facebookresearch#228).
-   Update compiler service implementations to support the updated RPC
    interface by removing the deprecated GetBenchmarks RPC endpoint and
    replacing it with Dataset classes. See the example service for
    details.
-   [LLVM] Update references to the poj104-v0 dataset to poj104-v1.
-   [LLVM] Update references to the cBench-v1 dataset to cbench-v1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants