Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update metadata for old staging bucket files #403

Closed
3 tasks done
MightyAx opened this issue Apr 9, 2021 · 9 comments
Closed
3 tasks done

Update metadata for old staging bucket files #403

MightyAx opened this issue Apr 9, 2021 · 9 comments
Assignees
Labels
operations This issue is an operational task

Comments

@MightyAx
Copy link
Contributor

MightyAx commented Apr 9, 2021

Fix for: ebi-ait/hca-ebi-wrangler-central#72
Due to issue detailed here: ebi-ait/dcp-ingest-central#175

Write script to detected metadata files that were exported successfully by previous version of the exporter and that are now missing a crucial piece of metadata to be detected by the current exporter.

  • Delve the folder structure of staging bucket using gsutil ls to identify json files
  • Detect json files that are both old (based on update time) and missing the export_completed metadata using command gsutil stat
  • Remediate these old files using gsutil setmeta -h "x-goog-meta-export_completed:True"
@MightyAx MightyAx added the operations This issue is an operational task label Apr 9, 2021
@MightyAx MightyAx self-assigned this Apr 9, 2021
@clairerye
Copy link

@MightyAx
Copy link
Contributor Author

I think so.
ebi-ait/dcp-ingest-central#246 doesn't mention that you will crash exporting for all other projects if you try to export a project that was previously exported.
Which is what makes it an operational issue.

@MightyAx
Copy link
Contributor Author

MightyAx commented Apr 12, 2021

I actually used the python client to identify 4,989 files that:

  • live in the prod/ folder
  • have .json extension
  • were updated before 2021-03-12
  • either have no metadata or are missing the export_completed metadata

out of 130,989 total files in the prod/ folder.

This tool approximately a minute to run (in find mode):

Screen.Recording.2021-04-12.at.14.00.55.mov

Using the following snippet

import datetime
import json
import sys
from datetime import timezone
from google.cloud import storage
from google.oauth2.service_account import Credentials


class CloudFinder:
    def __init__(self, credentials_file_path: str, bucket_name: str):
        with open(credentials_file_path) as credentials_file:
            credentials_json = json.load(credentials_file)
            storage_project = credentials_json.get('project_id', '')
            storage_credentials = Credentials.from_service_account_info(credentials_json)
            self.storage_client = storage.Client(project=storage_project, credentials=storage_credentials)
        self.bucket = bucket_name

    def list_files(self, prefix='prod/'):
        return self.__iterate_files(prefix=prefix, update=False)

    def update_file_meta(self, prefix='prod/'):
        return self.__iterate_files(prefix=prefix, update=True)

    def __iterate_files(self, prefix='prod/', update=False):
        blobs = self.storage_client.list_blobs(self.bucket, prefix=prefix)
        files = []
        searched = 0
        found = 0
        updated = 0
        for blob in blobs:
            searched += 1
            if (blob.name.endswith('.json') and
                    blob.updated < datetime.datetime(2021, 3, 12, tzinfo=timezone.utc) and
                    not (blob.metadata and blob.metadata.get('export_completed', False))):
                found += 1
                files.append(blob.name)
                if update:
                    if blob.metadata:
                        blob.metadata['export_completed'] = True
                    else:
                        blob.metadata = {'export_completed': True}
                    blob.patch()
                    updated += 1
            line = f'\rFiles inspected:\t{searched:,}\tFiles found:\t{found:,}'
            if update:
                line += f'\tFiles updated:\t{updated:,}'
            sys.stdout.write(line)
            sys.stdout.flush()
        return files

The list of files that need updates:
meta-files.txt

@MightyAx
Copy link
Contributor Author

MightyAx commented Apr 12, 2021

@clairerye @aaclan-ebi:
Are these the right files to be updating metadata for?

  • files in the prod/ folder
  • that have .json extension
  • that were updated before 2021 March 12th
  • that either have no metadata or are missing the export_completed metadata

@aaclan-ebi
Do you agree that the above code snippet reflects those requirements?

@MightyAx
Copy link
Contributor Author

Speed of updates seems acceptable:

Screen.Recording.2021-04-12.at.15.03.34.mov

@MightyAx
Copy link
Contributor Author

UUIDs for Projects that are affected by this issue:

2086eb05-10b9-432b-b7f0-169ccc49d270
2ef3655a-973d-4d69-9b41-21fa4041eed7
3089d311-f9ed-44dd-bb10-397059bad4dc
38449aea-70b5-40db-84b3-1e08f32efe34
42d4f8d4-5422-4b78-adae-e7c3c2ef511c
5b5f05b7-2482-468d-b76d-8f68c04a7a47
7027adc6-c9c9-46f3-84ee-9badc3a4f53b
83f5188e-3bf7-4956-9544-cea4f8997756
95f07e6e-6a73-4e1b-a880-c83996b3aa5c
b176d756-62d8-4933-83a4-8b026380262f
b4a7d12f-6c2f-40a3-9e35-9756997857e3
c1a9a93d-d9de-4e65-9619-a9cec1052eaa
c41dffbf-ad83-447c-a0e1-13e689d9b258
f2fe82f0-4454-4d84-b416-a885f3121e59
f48e7c39-cc67-4055-9d79-bc437892840c
455b46e6-d8ea-4611-861e-de720a562ada

@aaclan-ebi
Copy link

@MightyAx that script looks good. Cool, I thought it would be very slow. Looks like not.

@clairerye, Alexie is fixing a different issue. That ticket for updates needs to fix the filenames of the data files so that they won't be reexported again. Alexie's fixing the issue when we migrated the DCP2 datasets to be in the project subdirectory structure, the no. 2 item in this comment ebi-ait/dcp-ingest-central#175 (comment)

@MightyAx
Copy link
Contributor Author

Running Overnight Remediation to update the file-metadata on previously exported metadata files to ensure the exporter can verify that they exist.

@MightyAx
Copy link
Contributor Author

MightyAx commented Apr 13, 2021

Files inspected: 140,007
Files found: 4,797
Files updated: 4,797
Looking at the file update times for the first and last file, this update took approximatley 25 minutes.

updated-files.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operations This issue is an operational task
Projects
None yet
Development

No branches or pull requests

3 participants