Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hailtop.batch] add default_regions to hb.Batch, improve docs #14224

Merged
merged 7 commits into from
Feb 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 86 additions & 24 deletions hail/python/hailtop/batch/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -413,42 +413,100 @@ async def _async_close(self):


class ServiceBackend(Backend[bc.Batch]):
ANY_REGION: ClassVar[List[str]] = ['any_region']

"""Backend that executes batches on Hail's Batch Service on Google Cloud.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually the doc-string for the ANY_REGION class variable. By moving the class variable down, everything started rendering properly again.


Examples
--------

>>> service_backend = ServiceBackend(billing_project='my-billing-account', remote_tmpdir='gs://my-bucket/temporary-files/') # doctest: +SKIP
>>> b = Batch(backend=service_backend) # doctest: +SKIP
Create and use a backend that bills to the Hail Batch billing project named "my-billing-account"
and stores temporary intermediate files in "gs://my-bucket/temporary-files".

>>> import hailtop.batch as hb
>>> service_backend = hb.ServiceBackend(
... billing_project='my-billing-account',
... remote_tmpdir='gs://my-bucket/temporary-files/'
... ) # doctest: +SKIP
>>> b = hb.Batch(backend=service_backend) # doctest: +SKIP
>>> j = b.new_job() # doctest: +SKIP
>>> j.command('echo hello world!') # doctest: +SKIP
>>> b.run() # doctest: +SKIP
>>> service_backend.close() # doctest: +SKIP

If the Hail configuration parameters batch/billing_project and
batch/remote_tmpdir were previously set with ``hailctl config set``, then
one may elide the `billing_project` and `remote_tmpdir` parameters.
Same as above, but set the billing project and temporary intermediate folders via a
configuration file::

>>> service_backend = ServiceBackend()
>>> b = Batch(backend=service_backend)
>>> b.run() # doctest: +SKIP
>>> service_backend.close()
cat >my-batch-script.py >>EOF
import hailtop.batch as hb
b = hb.Batch(backend=ServiceBackend())
j = b.new_job()
j.command('echo hello world!')
b.run()
EOF
hailctl config set batch/billing_project my-billing-account
hailctl config set batch/remote_tmpdir gs://my-bucket/temporary-files/
python3 my-batch-script.py

Same as above, but also specify the use of the :class:`.ServiceBackend` via configuration file::

cat >my-batch-script.py >>EOF
import hailtop.batch as hb
b = hb.Batch()
j = b.new_job()
j.command('echo hello world!')
b.run()
EOF
hailctl config set batch/billing_project my-billing-account
hailctl config set batch/remote_tmpdir gs://my-bucket/temporary-files/
hailctl config set batch/backend service
python3 my-batch-script.py

Create a backend which stores temporary intermediate files in
"https://my-account.blob.core.windows.net/my-container/tempdir".

>>> service_backend = hb.ServiceBackend(
... billing_project='my-billing-account',
... remote_tmpdir='https://my-account.blob.core.windows.net/my-container/tempdir'
... ) # doctest: +SKIP

Require all jobs in all batches in this backend to execute in us-central1::

>>> b = hb.Batch(backend=hb.ServiceBackend(regions=['us-central1']))

Same as above, but using a configuration file::

hailctl config set batch/regions us-central1
python3 my-batch-script.py

Same as above, but using the ``HAIL_BATCH_REGIONS`` environment variable::

export HAIL_BATCH_REGIONS=us-central1
python3 my-batch-script.py

Permit jobs to execute in *either* us-central1 or us-east1::

>>> b = hb.Batch(backend=hb.ServiceBackend(regions=['us-central1', 'us-east1']))

Same as above, but using a configuration file::

hailctl config set batch/regions us-central1,us-east1

Allow reading or writing to buckets even though they are "cold" storage:

>>> b = hb.Batch(
... backend=hb.ServiceBackend(
... gcs_bucket_allow_list=['cold-bucket', 'cold-bucket2'],
... ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
... ),
... )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid Python and the preferred style of the ruff formatter we're now using.

... )

Parameters
----------
billing_project:
Name of billing project to use.
bucket:
Name of bucket to use. Should not include the ``gs://`` prefix. Cannot be used with
`remote_tmpdir`. Temporary data will be stored in the "/batch" folder of this
bucket. This argument is deprecated. Use `remote_tmpdir` instead.
This argument is deprecated. Use `remote_tmpdir` instead.
remote_tmpdir:
Temporary data will be stored in this cloud storage folder. Cannot be used with deprecated
argument `bucket`. Paths should match a GCS URI like gs://<BUCKET_NAME>/<PATH> or an ABS
URI of the form https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>.
Temporary data will be stored in this cloud storage folder.
google_project:
DEPRECATED. Please use gcs_requester_pays_configuration.
This argument is deprecated. Use `gcs_requester_pays_configuration` instead.
gcs_requester_pays_configuration : either :class:`str` or :class:`tuple` of :class:`str` and :class:`list` of :class:`str`, optional
If a string is provided, configure the Google Cloud Storage file system to bill usage to the
project identified by that string. If a tuple is provided, configure the Google Cloud
Expand All @@ -458,15 +516,19 @@ class ServiceBackend(Backend[bc.Batch]):
The authorization token to pass to the batch client.
Should only be set for user delegation purposes.
regions:
Cloud region(s) to run jobs in. Use py:staticmethod:`.ServiceBackend.supported_regions` to list the
available regions to choose from. Use py:attribute:`.ServiceBackend.ANY_REGION` to signify the default is jobs
can run in any available region. The default is jobs can run in any region unless a default value has
been set with hailctl. An example invocation is `hailctl config set batch/regions "us-central1,us-east1"`.
Cloud regions in which jobs may run. :attr:`.ServiceBackend.ANY_REGION` indicates jobs may
run in any region. If unspecified or ``None``, the ``batch/regions`` Hail configuration
variable is consulted. See examples above. If none of these variables are set, then jobs may
run in any region. :meth:`.ServiceBackend.supported_regions` lists the available regions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The py:attribute: stuff didn't seem to have any effect, it just rendered as text. These generate correct links.

gcs_bucket_allow_list:
A list of buckets that the :class:`.ServiceBackend` should be permitted to read from or write to, even if their
default policy is to use "cold" storage. Should look like ``["bucket1", "bucket2"]``.
default policy is to use "cold" storage.

"""

ANY_REGION: ClassVar[List[str]] = ['any_region']
"""A special value that indicates a job may run in any region."""

@staticmethod
def supported_regions():
"""
Expand Down
24 changes: 17 additions & 7 deletions hail/python/hailtop/batch/batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ class Batch:
--------
Create a batch object:

>>> p = Batch()
>>> import hailtop.batch as hb
>>> p = hb.Batch()

Create a new job that prints "hello":

Expand All @@ -35,6 +36,10 @@ class Batch:

>>> p.run()

Require all jobs in this batch to execute in us-central1:

>>> b = hb.Batch(backend=hb.ServiceBackend(), default_regions=['us-central1'])

Notes
-----

Expand Down Expand Up @@ -77,6 +82,9 @@ class Batch:
default_storage:
Storage setting to use by default if not specified by a job. Only
applicable for the :class:`.ServiceBackend`. See :meth:`.Job.storage`.
default_regions:
Cloud regions in which jobs may run. When unspecified or ``None``, use the regions attribute of
:class:`.ServiceBackend`. See :class:`.ServiceBackend` for details.
default_timeout:
Maximum time in seconds for a job to run before being killed. Only
applicable for the :class:`.ServiceBackend`. If `None`, there is no
Expand Down Expand Up @@ -157,6 +165,7 @@ def __init__(
default_memory: Optional[Union[int, str]] = None,
default_cpu: Optional[Union[float, int, str]] = None,
default_storage: Optional[Union[int, str]] = None,
default_regions: Optional[List[str]] = None,
default_timeout: Optional[Union[float, int]] = None,
default_shell: Optional[str] = None,
default_python_image: Optional[str] = None,
Expand Down Expand Up @@ -195,6 +204,9 @@ def __init__(
self._default_memory = default_memory
self._default_cpu = default_cpu
self._default_storage = default_storage
self._default_regions = default_regions
if self._default_regions is None and isinstance(self._backend, _backend.ServiceBackend):
self._default_regions = self._backend.regions
self._default_timeout = default_timeout
self._default_shell = default_shell
self._default_python_image = default_python_image
Expand Down Expand Up @@ -316,14 +328,13 @@ def new_bash_job(
j.cpu(self._default_cpu)
if self._default_storage is not None:
j.storage(self._default_storage)
if self._default_regions is not None:
j.regions(self._default_regions)
if self._default_timeout is not None:
j.timeout(self._default_timeout)
if self._default_spot is not None:
j.spot(self._default_spot)

if isinstance(self._backend, _backend.ServiceBackend):
j.regions(self._backend.regions)

self._jobs.append(j)
return j

Expand Down Expand Up @@ -388,14 +399,13 @@ def hello(name):
j.cpu(self._default_cpu)
if self._default_storage is not None:
j.storage(self._default_storage)
if self._default_regions is not None:
j.regions(self._default_regions)
if self._default_timeout is not None:
j.timeout(self._default_timeout)
if self._default_spot is not None:
j.spot(self._default_spot)

if isinstance(self._backend, _backend.ServiceBackend):
j.regions(self._backend.regions)

self._jobs.append(j)
return j

Expand Down
83 changes: 71 additions & 12 deletions hail/python/hailtop/batch/docs/service.rst
Original file line number Diff line number Diff line change
Expand Up @@ -227,22 +227,15 @@ error messages in the terminal window.
Submitting a Batch to the Service
---------------------------------

.. warning::

To avoid substantial network costs, ensure your jobs and data reside in the same `region`_.

To execute a batch on the Batch service rather than locally, first
construct a :class:`.ServiceBackend` object with a billing project and
bucket for storing intermediate files. Your service account must have read
and write access to the bucket.

.. warning::

By default, the Batch Service runs jobs in any region in the US. Make sure you have considered additional `ingress and
egress fees <https://cloud.google.com/storage/pricing>`_ when using regional buckets and container or artifact
registries. Multi-regional buckets also have additional replication fees when writing data. A good rule of thumb is to use
a multi-regional artifact registry for Docker images and regional buckets for data. You can then specify which region(s)
you want your job to run in with :meth:`.Job.regions`. To set the default region(s) for all jobs, you can set the input
regions argument to :class:`.ServiceBackend` or use hailctl to set the default value. An example invocation is
`hailctl config set batch/regions "us-central1,us-east1"`. You can also get the full list of supported regions
with py:staticmethod:`.ServiceBackend.supported_regions`.

Next, pass the :class:`.ServiceBackend` object to the :class:`.Batch` constructor
with the parameter name `backend`.

Expand All @@ -252,7 +245,7 @@ and execute the following batch:

.. code-block:: python

>>> import hailtop.batch as hb # doctest: +SKIP
>>> import hailtop.batch as hb
>>> backend = hb.ServiceBackend('my-billing-project', remote_tmpdir='gs://my-bucket/batch/tmp/') # doctest: +SKIP
>>> b = hb.Batch(backend=backend, name='test') # doctest: +SKIP
>>> j = b.new_job(name='hello') # doctest: +SKIP
Expand All @@ -271,6 +264,72 @@ have previously set them with ``hailctl``:

A trial billing project is automatically created for you with the name {USERNAME}-trial

.. _region:

Regions
-------

Data and compute both reside in a physical location. In Google Cloud Platform, the location of data
is controlled by the location of the containing bucket. ``gcloud`` can determine the location of a
bucket::

gcloud storage buckets describe gs://my-bucket

If your compute resides in a different location from the data it reads or writes, then you will
accrue substantial `network charges <https://cloud.google.com/storage/pricing#network-pricing>`__.

To avoid network charges ensure all your data is in one region and specify that region in one of the
following five ways. As a running example, we consider data stored in `us-central1`. The options are
listed from highest to lowest precedence.

1. :meth:`.Job.regions`:

.. code-block:: python

>>> b = hb.Batch(backend=hb.ServiceBackend())
>>> j = b.new_job()
>>> j.regions(['us-central1'])

2. The ``default_regions`` parameter of :class:`.Batch`:

.. code-block:: python

>>> b = hb.Batch(backend=hb.ServiceBackend(), default_regions=['us-central1'])


3. The ``regions`` parameter of :class:`.ServiceBackend`:

.. code-block:: python

>>> b = hb.Batch(backend=hb.ServiceBackend(regions=['us-central1']))

4. The ``HAIL_BATCH_REGIONS`` environment variable:

.. code-block:: sh

export HAIL_BATCH_REGIONS=us-central1
python3 my-batch-script.py

5. The ``batch/region`` configuration variable:

.. code-block:: sh

hailctl config set batch/regions us-central1
python3 my-batch-script.py

.. warning::

If none of the five options above are specified, your job may run in *any* region!

In Google Cloud Platform, the location of a multi-region bucket is considered *different* from any
region within that multi-region. For example, if a VM in the `us-central1` region reads data from a
bucket in the `us` multi-region, this incurs network charges becuse `us` is not considered equal to
`us-central1`.

Container (aka Docker) images are a form of data. In Google Cloud Platform, we recommend storing
your images in a multi-regional artifact registry, which at time of writing, despite being
"multi-regional", does not incur network charges in the manner described above.


Using the UI
------------
Expand Down
30 changes: 29 additions & 1 deletion hail/python/test/hailtop/batch/test_batch_service_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -798,7 +798,7 @@ async def foo(i, j):


def test_specify_job_region(backend: ServiceBackend):
b = batch(backend, cancel_after_n_failures=1)
b = batch(backend)
j = b.new_job('region')
possible_regions = backend.supported_regions()
j.regions(possible_regions)
Expand All @@ -809,6 +809,34 @@ def test_specify_job_region(backend: ServiceBackend):
assert res_status['state'] == 'success', str((res_status, res.debug_info()))


def test_job_regions_controls_job_execution_region(backend: ServiceBackend):
the_region = backend.supported_regions()[0]

b = batch(backend)
j = b.new_job()
j.regions([the_region])
j.command('true')
res = b.run()

assert res
job_status = res.get_job(1).status()
assert job_status['status']['region'] == the_region, str((job_status, res.debug_info()))


def test_job_regions_overrides_batch_regions(backend: ServiceBackend):
the_region = backend.supported_regions()[0]

b = batch(backend, default_regions=['some-other-region'])
j = b.new_job()
j.regions([the_region])
j.command('true')
res = b.run()

assert res
job_status = res.get_job(1).status()
assert job_status['status']['region'] == the_region, str((job_status, res.debug_info()))


def test_always_copy_output(backend: ServiceBackend, output_tmpdir: str):
output_path = os.path.join(output_tmpdir, 'test_always_copy_output.txt')

Expand Down
Loading