Update metadata for old staging bucket files #403

MightyAx · 2021-04-09T14:55:51Z

Fix for: ebi-ait/hca-ebi-wrangler-central#72
Due to issue detailed here: ebi-ait/dcp-ingest-central#175

Write script to detected metadata files that were exported successfully by previous version of the exporter and that are now missing a crucial piece of metadata to be detected by the current exporter.

Delve the folder structure of staging bucket using gsutil ls to identify json files
Detect json files that are both old (based on update time) and missing the export_completed metadata using command gsutil stat
Remediate these old files using gsutil setmeta -h "x-goog-meta-export_completed:True"

The text was updated successfully, but these errors were encountered:

clairerye · 2021-04-09T15:26:17Z

Can I just check this is not the same work as described in https://app.zenhub.com/workspaces/dcp-ingest-product-development-5f71ca62a3cb47326bdc1b5c/issues/ebi-ait/dcp-ingest-central/246 ?

MightyAx · 2021-04-12T08:59:07Z

I think so.
ebi-ait/dcp-ingest-central#246 doesn't mention that you will crash exporting for all other projects if you try to export a project that was previously exported.
Which is what makes it an operational issue.

MightyAx · 2021-04-12T13:17:18Z

I actually used the python client to identify 4,989 files that:

live in the prod/ folder
have .json extension
were updated before 2021-03-12
either have no metadata or are missing the export_completed metadata

out of 130,989 total files in the prod/ folder.

This tool approximately a minute to run (in find mode):

Screen.Recording.2021-04-12.at.14.00.55.mov

Using the following snippet

import datetime
import json
import sys
from datetime import timezone
from google.cloud import storage
from google.oauth2.service_account import Credentials


class CloudFinder:
    def __init__(self, credentials_file_path: str, bucket_name: str):
        with open(credentials_file_path) as credentials_file:
            credentials_json = json.load(credentials_file)
            storage_project = credentials_json.get('project_id', '')
            storage_credentials = Credentials.from_service_account_info(credentials_json)
            self.storage_client = storage.Client(project=storage_project, credentials=storage_credentials)
        self.bucket = bucket_name

    def list_files(self, prefix='prod/'):
        return self.__iterate_files(prefix=prefix, update=False)

    def update_file_meta(self, prefix='prod/'):
        return self.__iterate_files(prefix=prefix, update=True)

    def __iterate_files(self, prefix='prod/', update=False):
        blobs = self.storage_client.list_blobs(self.bucket, prefix=prefix)
        files = []
        searched = 0
        found = 0
        updated = 0
        for blob in blobs:
            searched += 1
            if (blob.name.endswith('.json') and
                    blob.updated < datetime.datetime(2021, 3, 12, tzinfo=timezone.utc) and
                    not (blob.metadata and blob.metadata.get('export_completed', False))):
                found += 1
                files.append(blob.name)
                if update:
                    if blob.metadata:
                        blob.metadata['export_completed'] = True
                    else:
                        blob.metadata = {'export_completed': True}
                    blob.patch()
                    updated += 1
            line = f'\rFiles inspected:\t{searched:,}\tFiles found:\t{found:,}'
            if update:
                line += f'\tFiles updated:\t{updated:,}'
            sys.stdout.write(line)
            sys.stdout.flush()
        return files

The list of files that need updates:
meta-files.txt

MightyAx · 2021-04-12T13:29:28Z

@clairerye @aaclan-ebi:
Are these the right files to be updating metadata for?

files in the prod/ folder
that have .json extension
that were updated before 2021 March 12th
that either have no metadata or are missing the export_completed metadata

@aaclan-ebi
Do you agree that the above code snippet reflects those requirements?

MightyAx · 2021-04-12T14:14:17Z

Speed of updates seems acceptable:

Screen.Recording.2021-04-12.at.15.03.34.mov

MightyAx · 2021-04-12T15:04:46Z

UUIDs for Projects that are affected by this issue:

2086eb05-10b9-432b-b7f0-169ccc49d270
2ef3655a-973d-4d69-9b41-21fa4041eed7
3089d311-f9ed-44dd-bb10-397059bad4dc
38449aea-70b5-40db-84b3-1e08f32efe34
42d4f8d4-5422-4b78-adae-e7c3c2ef511c
5b5f05b7-2482-468d-b76d-8f68c04a7a47
7027adc6-c9c9-46f3-84ee-9badc3a4f53b
83f5188e-3bf7-4956-9544-cea4f8997756
95f07e6e-6a73-4e1b-a880-c83996b3aa5c
b176d756-62d8-4933-83a4-8b026380262f
b4a7d12f-6c2f-40a3-9e35-9756997857e3
c1a9a93d-d9de-4e65-9619-a9cec1052eaa
c41dffbf-ad83-447c-a0e1-13e689d9b258
f2fe82f0-4454-4d84-b416-a885f3121e59
f48e7c39-cc67-4055-9d79-bc437892840c
455b46e6-d8ea-4611-861e-de720a562ada

aaclan-ebi · 2021-04-12T15:12:05Z

@MightyAx that script looks good. Cool, I thought it would be very slow. Looks like not.

@clairerye, Alexie is fixing a different issue. That ticket for updates needs to fix the filenames of the data files so that they won't be reexported again. Alexie's fixing the issue when we migrated the DCP2 datasets to be in the project subdirectory structure, the no. 2 item in this comment ebi-ait/dcp-ingest-central#175 (comment)

MightyAx · 2021-04-12T15:33:48Z

Running Overnight Remediation to update the file-metadata on previously exported metadata files to ensure the exporter can verify that they exist.

MightyAx · 2021-04-13T07:47:47Z

Files inspected: 140,007
Files found: 4,797
Files updated: 4,797
Looking at the file update times for the first and last file, this update took approximatley 25 minutes.

updated-files.txt

MightyAx added the operations This issue is an operational task label Apr 9, 2021

MightyAx self-assigned this Apr 9, 2021

MightyAx closed this as completed Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update metadata for old staging bucket files #403

Update metadata for old staging bucket files #403

MightyAx commented Apr 9, 2021 •

edited

Loading

clairerye commented Apr 9, 2021

MightyAx commented Apr 12, 2021

MightyAx commented Apr 12, 2021 •

edited

Loading

MightyAx commented Apr 12, 2021 •

edited

Loading

MightyAx commented Apr 12, 2021

MightyAx commented Apr 12, 2021

aaclan-ebi commented Apr 12, 2021

MightyAx commented Apr 12, 2021

MightyAx commented Apr 13, 2021 •

edited

Loading

Update metadata for old staging bucket files #403

Update metadata for old staging bucket files #403

Comments

MightyAx commented Apr 9, 2021 • edited Loading

clairerye commented Apr 9, 2021

MightyAx commented Apr 12, 2021

MightyAx commented Apr 12, 2021 • edited Loading

MightyAx commented Apr 12, 2021 • edited Loading

MightyAx commented Apr 12, 2021

MightyAx commented Apr 12, 2021

aaclan-ebi commented Apr 12, 2021

MightyAx commented Apr 12, 2021

MightyAx commented Apr 13, 2021 • edited Loading

MightyAx commented Apr 9, 2021 •

edited

Loading

MightyAx commented Apr 12, 2021 •

edited

Loading

MightyAx commented Apr 12, 2021 •

edited

Loading

MightyAx commented Apr 13, 2021 •

edited

Loading