-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update metadata for old staging bucket files #403
Comments
Can I just check this is not the same work as described in https://app.zenhub.com/workspaces/dcp-ingest-product-development-5f71ca62a3cb47326bdc1b5c/issues/ebi-ait/dcp-ingest-central/246 ? |
I think so. |
I actually used the python client to identify 4,989 files that:
out of 130,989 total files in the This tool approximately a minute to run (in find mode): Screen.Recording.2021-04-12.at.14.00.55.movUsing the following snippet import datetime
import json
import sys
from datetime import timezone
from google.cloud import storage
from google.oauth2.service_account import Credentials
class CloudFinder:
def __init__(self, credentials_file_path: str, bucket_name: str):
with open(credentials_file_path) as credentials_file:
credentials_json = json.load(credentials_file)
storage_project = credentials_json.get('project_id', '')
storage_credentials = Credentials.from_service_account_info(credentials_json)
self.storage_client = storage.Client(project=storage_project, credentials=storage_credentials)
self.bucket = bucket_name
def list_files(self, prefix='prod/'):
return self.__iterate_files(prefix=prefix, update=False)
def update_file_meta(self, prefix='prod/'):
return self.__iterate_files(prefix=prefix, update=True)
def __iterate_files(self, prefix='prod/', update=False):
blobs = self.storage_client.list_blobs(self.bucket, prefix=prefix)
files = []
searched = 0
found = 0
updated = 0
for blob in blobs:
searched += 1
if (blob.name.endswith('.json') and
blob.updated < datetime.datetime(2021, 3, 12, tzinfo=timezone.utc) and
not (blob.metadata and blob.metadata.get('export_completed', False))):
found += 1
files.append(blob.name)
if update:
if blob.metadata:
blob.metadata['export_completed'] = True
else:
blob.metadata = {'export_completed': True}
blob.patch()
updated += 1
line = f'\rFiles inspected:\t{searched:,}\tFiles found:\t{found:,}'
if update:
line += f'\tFiles updated:\t{updated:,}'
sys.stdout.write(line)
sys.stdout.flush()
return files The list of files that need updates: |
@clairerye @aaclan-ebi:
@aaclan-ebi |
Speed of updates seems acceptable: Screen.Recording.2021-04-12.at.15.03.34.mov |
UUIDs for Projects that are affected by this issue:
|
@MightyAx that script looks good. Cool, I thought it would be very slow. Looks like not. @clairerye, Alexie is fixing a different issue. That ticket for updates needs to fix the filenames of the data files so that they won't be reexported again. Alexie's fixing the issue when we migrated the DCP2 datasets to be in the project subdirectory structure, the no. 2 item in this comment ebi-ait/dcp-ingest-central#175 (comment) |
Running Overnight Remediation to update the file-metadata on previously exported metadata files to ensure the exporter can verify that they exist. |
Files inspected: 140,007 |
Fix for: ebi-ait/hca-ebi-wrangler-central#72
Due to issue detailed here: ebi-ait/dcp-ingest-central#175
Write script to detected metadata files that were exported successfully by previous version of the exporter and that are now missing a crucial piece of metadata to be detected by the current exporter.
gsutil ls
to identify json filesexport_completed
metadata using commandgsutil stat
gsutil setmeta -h "x-goog-meta-export_completed:True"
The text was updated successfully, but these errors were encountered: