Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet to BigQuery import for GCP-backed AnVIL snapshots (#6355) #6392

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

nadove-ucsc
Copy link
Contributor

@nadove-ucsc nadove-ucsc commented Jul 10, 2024

Connected issues: #6355

Checklist

Author

  • PR is a draft
  • Target branch is develop
  • Name of PR branch matches issues/<GitHub handle of author>/<issue#>-<slug>
  • On ZenHub, PR is connected to all issues it (partially) resolves
  • PR description links to connected issues
  • PR title matches1 that of a connected issue or comment in PR explains why they're different
  • PR title references all connected issues
  • For each connected issue, there is at least one commit whose title references that issue

1 when the issue title describes a problem, the corresponding PR
title is Fix: followed by the issue title

Author (partiality)

  • Added p tag to titles of partial commits
  • This PR is labeled partial or completely resolves all connected issues
  • This PR partially resolves each of the connected issues or does not have the partial label

Author (chains)

  • This PR is blocked by previous PR in the chain or is not chained to another PR
  • The blocking PR is labeled base or this PR is not chained to another PR
  • This PR is labeled chained or is not chained to another PR

Author (reindex, API changes)

  • Added r tag to commit title or the changes introduced by this PR will not require reindexing of any deployment
  • This PR is labeled reindex:dev or the changes introduced by it will not require reindexing of dev
  • This PR is labeled reindex:anvildev or the changes introduced by it will not require reindexing of anvildev
  • This PR is labeled reindex:anvilprod or the changes introduced by it will not require reindexing of anvilprod
  • This PR is labeled reindex:prod or the changes introduced by it will not require reindexing of prod
  • This PR is labeled reindex:partial and its description documents the specific reindexing procedure for dev, anvildev, anvilprod and prod or requires a full reindex or carries none of the labels reindex:dev, reindex:anvildev, reindex:anvilprod and reindex:prod
  • This PR and its connected issues are labeled API or this PR does not modify a REST API
  • Added a (A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST API
  • Updated REST API version number in app.py or this PR does not modify a REST API

Author (upgrading deployments)

  • Ran make image_manifests.json and committed the resulting changes or this PR does not modify azul_docker_images, or any other variables referenced in the definition of that variable
  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading deployments
  • Added u tag to commit title or this PR does not require upgrading deployments
  • This PR is labeled upgrade or does not require upgrading deployments
  • This PR is labeled deploy:shared or does not modify image_manifests.json, and does not require deploying the shared component for any other reason
  • This PR is labeled deploy:gitlab or does not require deploying the gitlab component
  • This PR is labeled deploy:runner or does not require deploying the runner image

Author (hotfixes)

  • Added F tag to main commit title or this PR does not include permanent fix for a temporary hotfix
  • Reverted the temporary hotfixes for any connected issues or the none of the stable branches (anvilprod and prod) have temporary hotfixes for any of the issues connected to this PR

Author (before every review)

  • Rebased PR branch on develop, squashed old fixups
  • Ran make requirements_update or this PR does not modify requirements*.txt, common.mk, Makefile and Dockerfile
  • Added R tag to commit title or this PR does not modify requirements*.txt
  • This PR is labeled reqs or does not modify requirements*.txt
  • make integration_test passes in personal deployment or this PR does not modify functionality that could affect the IT outcome

Peer reviewer (after approval)

  • PR is not a draft
  • Ticket is in Review requested column
  • PR is awaiting requested review from system administrator
  • PR is assigned to only the system administrator

System administrator (after approval)

  • Actually approved the PR
  • Labeled connected issues as demo or no demo
  • Commented on connected issues about demo expectations or all connected issues are labeled no demo
  • Decided if PR can be labeled no sandbox
  • A comment to this PR details the completed security design review
  • PR title is appropriate as title of merge commit
  • N reviews label is accurate
  • Moved ticket to Approved column
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Checked reindex:… labels and r commit title tag
  • Checked that demo expectations are clear or all connected issues are labeled no demo
  • Squashed PR branch and rebased onto develop
  • Sanity-checked history
  • Pushed PR branch to GitHub
  • Ran _select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Ran _select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Checked the items in the next section or this PR is labeled deploy:gitlab
  • PR is assigned to only the system administrator or this PR is not labeled deploy:gitlab

System administrator

  • Background migrations for dev.gitlab are complete or this PR is not labeled deploy:gitlab
  • Background migrations for anvildev.gitlab are complete or this PR is not labeled deploy:gitlab
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Ran _select dev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Ran _select anvildev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Added sandbox label or PR is labeled no sandbox
  • Pushed PR branch to GitLab dev or PR is labeled no sandbox
  • Pushed PR branch to GitLab anvildev or PR is labeled no sandbox
  • Build passes in sandbox deployment or PR is labeled no sandbox
  • Build passes in anvilbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in sandbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in anvilbox deployment or PR is labeled no sandbox
  • Deleted unreferenced indices in sandbox or this PR does not remove catalogs or otherwise causes unreferenced indices in dev
  • Deleted unreferenced indices in anvilbox or this PR does not remove catalogs or otherwise causes unreferenced indices in anvildev
  • Started reindex in sandbox or this PR is not labeled reindex:dev
  • Started reindex in anvilbox or this PR is not labeled reindex:anvildev
  • Checked for failures in sandbox or this PR is not labeled reindex:dev
  • Checked for failures in anvilbox or this PR is not labeled reindex:anvildev
  • The title of the merge commit starts with the title of this PR
  • Added PR # reference to merge commit title
  • Collected commit title tags in merge commit title but only included p if the PR is also labeled partial
  • Moved connected issues to Merged lower column in ZenHub
  • Moved blocked issues to Triage or no issues are blocked on the connected issues
  • Pushed merge commit to GitHub

Operator (chain shortening)

  • Changed the target branch of the blocked PR to develop or this PR is not labeled base
  • Removed the chained label from the blocked PR or this PR is not labeled base
  • Removed the blocking relationship from the blocked PR or this PR is not labeled base
  • Removed the base label from this PR or this PR is not labeled base

Operator (after pushing the merge commit)

  • Pushed merge commit to GitLab dev
  • Pushed merge commit to GitLab anvildev
  • Build passes on GitLab dev
  • Reviewed build logs for anomalies on GitLab dev
  • Build passes on GitLab anvildev
  • Reviewed build logs for anomalies on GitLab anvildev
  • Ran _select dev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Ran _select anvildev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Deleted PR branch from GitHub
  • Deleted PR branch from GitLab dev
  • Deleted PR branch from GitLab anvildev

Operator (reindex)

  • Deindexed all unreferenced catalogs in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed all unreferenced catalogs in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Deindexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Indexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Indexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Started reindex in dev or this PR does not require reindexing dev
  • Started reindex in anvildev or this PR does not require reindexing anvildev
  • Checked for, triaged and possibly requeued messages in both fail queues in dev or this PR does not require reindexing dev
  • Checked for, triaged and possibly requeued messages in both fail queues in anvildev or this PR does not require reindexing anvildev
  • Emptied fail queues in dev or this PR does not require reindexing dev
  • Emptied fail queues in anvildev or this PR does not require reindexing anvildev

Operator

  • Propagated the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels to the next promotion PRs or this PR carries none of these labels
  • Propagated any specific instructions related to the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labels
  • PR is assigned to no one

Shorthand for review comments

  • L line is too long
  • W line wrapping is wrong
  • Q bad quotes
  • F other formatting problem

@github-actions github-actions bot added the orange [process] Done by the Azul team label Jul 10, 2024
@coveralls
Copy link

coveralls commented Jul 10, 2024

Coverage Status

coverage: 85.139% (-0.3%) from 85.398%
when pulling 78545b6 on issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots
into 34a12fb on develop.

Copy link

codecov bot commented Jul 10, 2024

Codecov Report

Attention: Patch coverage is 10.66667% with 67 lines in your changes missing coverage. Please review.

Project coverage is 85.12%. Comparing base (34a12fb) to head (78545b6).
Report is 437 commits behind head on develop.

Files with missing lines Patch % Lines
src/azul/terra.py 9.83% 55 Missing ⚠️
src/azul/plugins/repository/tdr_anvil/__init__.py 14.28% 12 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6392      +/-   ##
===========================================
- Coverage    85.38%   85.12%   -0.26%     
===========================================
  Files          155      155              
  Lines        20754    20823      +69     
===========================================
+ Hits         17720    17725       +5     
- Misses        3034     3098      +64     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nadove-ucsc nadove-ucsc added the reqs [process] PR includes commit requiring ``make requirements`` label Jul 11, 2024
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch 2 times, most recently from ccd127d to 09c7a41 Compare July 11, 2024 03:37
'Actual Google project of TDR source differs from configured one',
source.project, source_spec.project)
source.project, source_spec.project, config.google_project())
Copy link
Contributor

@dsotirho-ucsc dsotirho-ucsc Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source.project, source_spec.project, config.google_project())
source_spec.project, source.project, config.google_project())

to match the order of these variables in the condition part of the require()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided on a different approach here to make the diff clearer and hopefully make the intent more apparent.

Makefile Show resolved Hide resolved
@dsotirho-ucsc dsotirho-ucsc removed their assignment Jul 12, 2024
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch from 09c7a41 to 0d1c70f Compare July 12, 2024 01:24
@nadove-ucsc
Copy link
Contributor Author

I also moved some changes from the first commit to the 2nd, hence the other fixup.

dsotirho-ucsc
dsotirho-ucsc previously approved these changes Jul 12, 2024
Copy link
Contributor

@dsotirho-ucsc dsotirho-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@dsotirho-ucsc dsotirho-ucsc marked this pull request as ready for review July 12, 2024 16:02
Makefile Outdated
Comment on lines 105 to 106
python $(project_root)/scripts/reindex.py --import --sources "tdr:${GOOGLE_PROJECT}:snapshot/*"
python $(project_root)/scripts/verify_tdr_sources.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two commands should be extracted to a separate make target called import, and a corresponding GitLab job. We also need to think about sandbox and personal deployments that typically share the sources with a main deployment. I think #6426 will help with this which is why I've added it as a blocker of #6355.

@@ -0,0 +1,103 @@
"""
Export parquet files from TDR and download them to local storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Export parquet files from TDR and download them to local storage.
Export Parquet files from TDR and download them to local storage.

Here and in elsewhere in documentation.

@hannes-ucsc hannes-ucsc added the 0 reviews [process] Lead didn't request any changes label Jul 22, 2024
@hannes-ucsc hannes-ucsc removed their assignment Jul 22, 2024
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch from 0d1c70f to b3d454f Compare July 23, 2024 00:38
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch 2 times, most recently from 6c01bba to 2168ed5 Compare July 31, 2024 21:30
@nadove-ucsc nadove-ucsc added the chained [process] PR needs to based of develop before merging label Jul 31, 2024
@nadove-ucsc nadove-ucsc changed the base branch from develop to issues/nadove-ucsc/6426-cleanup-generalize-tdr-source-spec July 31, 2024 21:31
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch from 2168ed5 to aae4509 Compare July 31, 2024 21:32
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6426-cleanup-generalize-tdr-source-spec branch from 4589e62 to 1a0a33a Compare August 1, 2024 00:49
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch from aae4509 to 5195059 Compare August 1, 2024 04:19
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6426-cleanup-generalize-tdr-source-spec branch from 192d97a to a5f7692 Compare August 1, 2024 23:48
@achave11-ucsc achave11-ucsc force-pushed the issues/nadove-ucsc/6426-cleanup-generalize-tdr-source-spec branch from 6e7481e to be2dfe3 Compare August 12, 2024 21:48
@achave11-ucsc achave11-ucsc changed the base branch from issues/nadove-ucsc/6426-cleanup-generalize-tdr-source-spec to develop August 13, 2024 14:49
@achave11-ucsc achave11-ucsc removed the chained [process] PR needs to based of develop before merging label Aug 13, 2024
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch 4 times, most recently from 084a127 to 14df96c Compare August 28, 2024 23:18
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somethings amiss here. It appears that this adds code that isn't actually exercised because none of the sources are configured to use Parquet. We should switch one source in dev, sandbox, anvildev and anvilbox to use Parquet. Lets also discuss in PL what to do about personal deployments.

Subject: [PATCH] make fo
---
Index: src/azul/terra.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/terra.py b/src/azul/terra.py
--- a/src/azul/terra.py	(revision 14df96cd41cd529093df1520abd73d725b44aa28)
+++ b/src/azul/terra.py	(date 1724981432852)
@@ -541,6 +541,7 @@
                         endpoint: furl,
                         response: urllib3.HTTPResponse
                         ) -> MutableJSON:
+        # REVIEW: Short comment here explaining when we'd expect a 202
         if response.status in (200, 202):
             return json.loads(response.data)
         # FIXME: Azul sometimes conflates 401 and 403
@@ -661,8 +662,8 @@
 
     def create_dataset(self, dataset_name: str):
         """
-        Create a BigQuery dataset in the project and region configured for the
-        current deployment.
+        Create a BigQuery dataset in the GCP project associated with the current
+        credentials and the GCP region configure for the current deployment.
 
         :param dataset_name: Unqualified name of the dataset to create.
                              `google.cloud.exceptions.Conflict` will be raised
@@ -670,10 +671,23 @@
         """
         bigquery = self._bigquery(self.credentials.project_id)
         ref = DatasetReference(bigquery.project, dataset_name)
+        # We get a false warning from PyCharm here, probably because of
+        #
+        # https://youtrack.jetbrains.com/issue/PY-23400/regression-PEP484-type-annotations-in-docstrings-nearly-completely-broken
+        #
+        # Google uses the docstring syntax to annotate types in its BQ client.
+        #
+        # noinspection PyTypeChecker
         dataset = Dataset(ref)
+        # REVIEW: This changes the meaning of AZUL_TDR_SOURCE_LOCATION somewhat.
+        #         While I don't think we need to introduce a new variable, we
+        #         should document the new semantics so that someone modifying
+        #         it is aware of the implications.
         dataset.location = config.tdr_source_location
         log.info('Creating BigQuery dataset %r in region %r',
                  dataset.dataset_id, dataset.location)
+        # REVIEW: This method returns something. Let's assert key aspects of the
+        #         return value.
         bigquery.create_dataset(dataset)
 
     def create_table(self,
@@ -692,8 +706,11 @@
 
         :param table_name: Unqualified name of the new table
 
+        REVIEW: Technically gs://… is a URI. If "URL" is TDR lingo I'm happy to
+                adopt it, otherwise we should use "URI".
+
         :param import_urls: URLs of Parquet file(s) to populate the table. These
-                            must be `gs://` URLS and the GCS bucket's region
+                            must be `gs://` URLs and the GCS bucket's region
                             must be compatible with the target dataset's. See
                             https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#limitations
 
@@ -705,7 +722,7 @@
                                   https://cloud.google.com/bigquery/docs/clustered-tables
         """
         for url in import_urls:
-            require(url.scheme == 'gs', url)
+            require(url.scheme == 'gs', 'Expected gs:// URI', url)
         table_id = f'{dataset_name}.{table_name}'
         bigquery = self._bigquery(self.credentials.project_id)
         write_disposition = (
@@ -716,25 +733,26 @@
             clustering_fields=clustering_fields,
             source_format=SourceFormat.PARQUET,
             # Avoids convoluted data types for array fields
+            # REVIEW: Please elaborate
             parquet_options=ParquetOptions.from_api_repr(dict(enable_list_inference=True))
         )
-        log.info('Creating BigQuery table %r',
-                 f'{bigquery.project}.{dataset_name}.{table_name}')
+        table_ref = f'{bigquery.project}.{table_id}'
+        log.info('Creating BigQuery table %r', table_ref)
         load_job = bigquery.load_table_from_uri(source_uris=list(map(str, import_urls)),
                                                 destination=table_id,
                                                 job_config=job_config)
         load_job.result()
-        log.info('Table created successfully')
+        log.info('Table %r created successfully', table_ref)
 
     def export_parquet_urls(self,
                             snapshot_id: str
                             ) -> Optional[dict[str, list[mutable_furl]]]:
         """
         Obtain URLs of Parquet files for the data tables of the specified
-        snapshot. This is an time-consuming operation that usually takes on the
-        order of 1 minute to complete.
+        snapshot. This is a time-consuming operation that usually takes on the
+        order of one minute to complete.
 
-        :param snapshot_id: The UUID of the snapshot.
+        :param snapshot_id: The UUID of the snapshot
 
         :return: A mapping of table names to lists of Parquet file download
                  URLs, or `None` if if no Parquet downloads are available for
@@ -744,6 +762,8 @@
         url = self._repository_endpoint('snapshots', snapshot_id, 'export')
         # Required for Azure-backed snapshots
         url.args.add('validatePrimaryKeyUniqueness', False)
+        # REVIEW: We should apply a timeout here. I suggest five times the
+        #         longest observed duration
         while True:
             response = self._request('GET', url)
             response_body = self._check_response(url, response)
@@ -752,6 +772,8 @@
             if jobs_status == 'running':
                 url = self._repository_endpoint('jobs', job_id)
                 log.info('Waiting for job %r ...', job_id)
+                # REVIEW: What's this choice of two seconds based on? It seems
+                # rather short considering everything about TDR's robustness.
                 time.sleep(2)
             elif jobs_status == 'succeeded':
                 break
Index: .gitlab-ci.yml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
--- a/.gitlab-ci.yml	(revision 14df96cd41cd529093df1520abd73d725b44aa28)
+++ b/.gitlab-ci.yml	(date 1724981655861)
@@ -100,6 +100,8 @@
 import:
   extends: .base_on_push
   stage: deploy
+  # REVIEW: Assume all sources need to be imported, estimate the expected
+  #         running time and double that. Remove the comment.
   timeout: 5m  # probably needs to be extended
   needs:
     - build_image

@hannes-ucsc hannes-ucsc removed their assignment Aug 30, 2024
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch 2 times, most recently from 2b088c8 to d7ebee7 Compare September 4, 2024 04:16
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch from d7ebee7 to 65fc638 Compare September 18, 2024 05:18
@nadove-ucsc
Copy link
Contributor Author

nadove-ucsc commented Sep 18, 2024

@hannes-ucsc: "Regarding personal deployments, the import should only be performed on shared deployments. By the time developers upgrade their personal deployments to mirror the respective sandbox, the GitLab build for that sandbox will have already imported the snapshot. If someone really needs to import a snapshot for a personal deployment, they can temporarily enable the import for personal deployments."

@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/6355-parquet-bigquery-import-gcp-anvil-snapshots branch from 65fc638 to 78545b6 Compare September 18, 2024 22:05
@nadove-ucsc nadove-ucsc added the reindex:anvildev [process] PR requires reindexing anvildev label Sep 18, 2024
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts.

import:
extends: .base_on_push
stage: deploy
# The 1000G snapshot on `anvildev` takes about 3.5 minutes to import. There
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure assuming that every snapshot is as big as 1000G leads to practical timeout.

A timeout is a heuristic defense against hung workloads, i.e., workloads that stop making significant progress. We don't want to constantly update the timeout, we don't want it to prematurely kill workloads that are progressing at the average rate, and we don't want the workload to be in the hung state for > 80% of it's running time. A 5min timeout goes against the first rule, a 30h timeout goes against the last.

@@ -73,7 +73,7 @@ def mkdict(previous_catalog: dict[str, str],


anvil_sources = mkdict({}, 3, mkdelta([
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source spec should state where the source is, not where it will be when it is imported. The logic should be to import every parquet source. So I think this should read

Suggested change
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
mksrc('bigquery', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),

@@ -64,7 +64,7 @@ def mkdict(previous_catalog: dict[str, str],


anvil_sources = mkdict({}, 3, mkdelta([
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
mksrc('parquet', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),

endef

$(eval $(call deploy,))
$(eval $(call deploy,auto_))

.PHONY: import
import: check_python
python $(project_root)/scripts/reindex.py --import --sources "tdr:parquet:gcp:${GOOGLE_PROJECT}:*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python $(project_root)/scripts/reindex.py --import --sources "tdr:parquet:gcp:${GOOGLE_PROJECT}:*"
python $(project_root)/scripts/reindex.py --import --sources "tdr:parquet:gcp:*"

@hannes-ucsc hannes-ucsc removed their assignment Sep 23, 2024
@hannes-ucsc hannes-ucsc added the iceboxed [process] not planned in the near future label Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 reviews [process] Lead didn't request any changes iceboxed [process] not planned in the near future orange [process] Done by the Azul team reindex:anvildev [process] PR requires reindexing anvildev reqs [process] PR includes commit requiring ``make requirements``
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants