Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data to purge from the repository for rights #100

Closed
4 tasks done
griff-rees opened this issue May 4, 2023 · 27 comments
Closed
4 tasks done

Data to purge from the repository for rights #100

griff-rees opened this issue May 4, 2023 · 27 comments
Assignees

Comments

@griff-rees
Copy link
Collaborator

griff-rees commented May 4, 2023

This may require purging the git history and worth checking with @claireaustin01

@griff-rees griff-rees changed the title Concern of including any of this data in the repository Data to purge from the repository for rights May 5, 2023
@griff-rees
Copy link
Collaborator Author

Removed (but not purged) fixture-files/mitchells_db [v1].csv

@kallewesterling
Copy link
Collaborator

Upon review, I don't think we need to remove the census data. It is available open-access through the UK Data Service… I believe we are able to re-share it (I wouldn’t have added it to the repo otherwise), and upon revisiting CC BY 4.0, it states that we “are free to . . . copy and redistribute the material in any medium or format” (see here).

Looping in @claireaustin01 might be good regarding this bit, however.

@griff-rees
Copy link
Collaborator Author

Great thanks @kallewesterling. Perhaps the safest option would be to automatically download that link in a local deploy? Arguably that's applicable to many of these.

@griff-rees
Copy link
Collaborator Author

A potential structure for managing the workflow, where data folders include csv etc. files and fixtures the generated json for the respective models:

newspapers
├── data
├── fixtures
mitchels
├── data
├── fixtures
gazetteer
├── data
├── fixtures
census
├── data
└── fixtures

@kallewesterling
Copy link
Collaborator

Great thanks @kallewesterling. Perhaps the safest option would be to automatically download that link in a local deploy? Arguably that's applicable to many of these.

Sounds like a good idea to me. As far as I can see, it would apply to the two publicly available datasets that are used here (if we're sticking with keeping census data in there for now):

The scary thing about download files is obviously that the link are depending on services that provide them, long term etc. etc... You know all this, of course! :)

@griff-rees
Copy link
Collaborator Author

Well done, I was having a quick peak at those links and annoyed to figure out the js involved, thanks for sorting that.

The scary thing about download files is obviously that the link are depending on services that provide them, long term etc. etc... You know all this, of course! :)

Yeah it's hard to maintain. I guess I'm thinking: maybe that addresses that concern for now, and we can return to the issue of having a final version of these included in the repository when we've had enough time to decide what's ok.

Any thoughts on this all much appreciated @claireaustin01

@kallewesterling
Copy link
Collaborator

I agree with that @griff-rees !

@claireaustin01
Copy link

  • I think the UK Data Service will be available for the foreseeable future so we don't need to worry about longevity.
  • The majority of the Mitchell's directories are already on the BL's repository so also no concerns.
  • The gazetteer will depend on the source of the material included in it, so will wait for reply from @mcollardanuy

@mcollardanuy
Copy link
Collaborator

Hi @griff-rees, @kallewesterling, @claireaustin01,

The following files in this folder contain data from Wikidata and Geonames:

  • dict_admin_counties.json
  • dict_countries.json
  • dict_historic_counties.json
  • nlp_loc_wikidata_concat.csv
  • wikidata_gazetteer_selected_columns.csv
  • wikidata_ids_publication_mitchells.txt

Wikidata: according to https://dumps.wikimedia.org/legal.html:

Copyrights of structured data in the main, Property, Lexeme, and EntitySchema namespaces are waived using the Creative Commons Zero (CC0) public domain dedication. All unstructured content in other namespaces is licensed under the Creative Commons Attribution-Share-Alike 3.0 License.

Geonames: according to http://download.geonames.org/export/dump/:

This work is licensed under a Creative Commons Attribution 4.0 License, see https://creativecommons.org/licenses/by/4.0/ The Data is provided "as is" without warranty or any representation of accuracy, timeliness or completeness.

So, as far as I can see, it should be fine.

@griff-rees
Copy link
Collaborator Author

Have backed up all the fixture files. First attempt to purge via https://rtyley.github.io/bfg-repo-cleaner/ has raised the following errors:

$ git push
Enumerating objects: 43, done.
Counting objects: 100% (40/40), done.
Delta compression using up to 4 threads
Compressing objects: 100% (15/15), done.
Writing objects: 100% (24/24), 16.97 KiB | 8.48 MiB/s, done.
Total 24 (delta 18), reused 15 (delta 9), pack-reused 0                                      
remote: Resolving deltas: 100% (18/18), completed with 9 local objects.
To github.com:Living-with-machines/lwmdb
 ! [remote rejected] refs/pull/101/head -> refs/pull/101/head (deny updating a hidden ref)    
 ! [remote rejected] refs/pull/102/head -> refs/pull/102/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/107/head -> refs/pull/107/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/107/merge -> refs/pull/107/merge (deny updating a hidden ref) 
 ! [remote rejected] refs/pull/11/head -> refs/pull/11/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/12/head -> refs/pull/12/head (deny updating a hidden ref)     
 ! [remote rejected] refs/pull/13/head -> refs/pull/13/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/15/head -> refs/pull/15/head (deny updating a hidden ref)     
 ! [remote rejected] refs/pull/18/head -> refs/pull/18/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/19/head -> refs/pull/19/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/2/head -> refs/pull/2/head (deny updating a hidden ref)      
 ! [remote rejected] refs/pull/20/head -> refs/pull/20/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/27/head -> refs/pull/27/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/28/head -> refs/pull/28/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/30/head -> refs/pull/30/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/33/head -> refs/pull/33/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/38/head -> refs/pull/38/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/39/head -> refs/pull/39/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/40/head -> refs/pull/40/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/41/head -> refs/pull/41/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/42/head -> refs/pull/42/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/43/head -> refs/pull/43/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/44/head -> refs/pull/44/head (deny updating a hidden ref)      
 ! [remote rejected] refs/pull/46/head -> refs/pull/46/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/5/head -> refs/pull/5/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/57/head -> refs/pull/57/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/58/head -> refs/pull/58/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/59/head -> refs/pull/59/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/62/head -> refs/pull/62/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/63/head -> refs/pull/63/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/67/head -> refs/pull/67/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/68/head -> refs/pull/68/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/69/head -> refs/pull/69/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/7/head -> refs/pull/7/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/72/head -> refs/pull/72/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/73/head -> refs/pull/73/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/74/head -> refs/pull/74/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/77/head -> refs/pull/77/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/78/head -> refs/pull/78/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/8/head -> refs/pull/8/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/85/head -> refs/pull/85/head (deny updating a hidden ref)
error: failed to push some refs to 'github.com:Living-with-machines/lwmdb'

@kallewesterling
Copy link
Collaborator

This looks like a good place to start troubleshooting... It looks like it might be an issue with dropping files in a repo with open pull requests :/

@AoifeHughes
Copy link
Collaborator

@griff-rees do you have. the commands you tried with bfg just so I don't re do exactly what you tried

@griff-rees
Copy link
Collaborator Author

Thanks @AoifeHughes pretty sure this is what I found best:

$ bfg --delete-files fixture-files lwmdb.git

@griff-rees
Copy link
Collaborator Author

For reference: I installed bfg via:

$ sudo snap install bfg-repo-cleaner --beta

on an azure vm

@AoifeHughes
Copy link
Collaborator

Just tried it with slightly different command:

(playground) ➜  erase git clone git@github.com:Living-with-machines/lwmdb.git
Cloning into 'lwmdb'...
remote: Enumerating objects: 2319, done.
remote: Counting objects: 100% (351/351), done.
remote: Compressing objects: 100% (263/263), done.
remote: Total 2319 (delta 135), reused 167 (delta 82), pack-reused 1968
Receiving objects: 100% (2319/2319), 29.95 MiB | 4.80 MiB/s, done.
Resolving deltas: 100% (1358/1358), done.
(playground) ➜  erase cd lwmdb
(playground) ➜  lwmdb git:(main) java -jar ~/Downloads/bfg-1.14.0.jar --delete-folders fixture-files --delete-files fixture-files --private

Using repo : /Users/ahughes/erase/lwmdb/.git

Found 134 objects to protect
Found 17 commit-pointing refs : HEAD, refs/heads/main, refs/remotes/origin/HEAD, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 63f18ff4 (protected by 'HEAD') - contains 17 dirty files :
	- fixture-files/JISC papers.csv (14.2 KB)
	- fixture-files/UKDA-8613-csv/1851_rsd_data.csv (1.4 MB)
	- ...

WARNING: The dirty content above may be removed from other commits, but as
the *protected* commits still use it, it will STILL exist in your repository.

Details of protected dirty content have been recorded here :

/Users/ahughes/erase/lwmdb.bfg-report/2023-06-30/11-23-15/protected-dirt/

If you *really* want this content gone, make a manual commit that removes it,
and then run the BFG on a fresh copy of your repo.


Cleaning
--------

Found 370 commits
Cleaning commits:       100% (370/370)
Cleaning commits completed in 163 ms.

Updating 13 Refs
----------------

	Ref                                              Before     After
	--------------------------------------------------------------------
	refs/heads/main                                | 63f18ff4 | a1649c52
	refs/remotes/origin/asmith-review-docs         | e8196742 | d8a0bed9
	refs/remotes/origin/fix-mitchells-import       | c9032006 | 9dc8c58b
	refs/remotes/origin/geocensus                  | dd31fd0f | 5bf21c44
	refs/remotes/origin/improve-load-json-fixtures | 513738d3 | 56e47072
	refs/remotes/origin/item-max-title-field       | 6339b3e3 | b9e2e8c9
	refs/remotes/origin/jupyterhub                 | 9e716305 | 6d7cd451
	refs/remotes/origin/kallewesterling/issue35    | c8429d77 | aec87a1c
	refs/remotes/origin/kallewesterling/issue56    | ebf57d41 | 6e04d95a
	refs/remotes/origin/main                       | 63f18ff4 | a1649c52
	refs/remotes/origin/mkdocs                     | 29b13aec | f8d69bfb
	refs/remotes/origin/production-deploy          | 738bfbab | dc84a5de
	refs/remotes/origin/thobson/issue47            | 0fed749d | 31999d4d

Updating references:    100% (13/13)
...Ref update completed in 30 ms.

Commit Tree-Dirt History
------------------------

	Earliest                                              Latest
	|                                                          |
	......................DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

	D = dirty commits (file tree fixed)
	m = modified commits (commit message or parents changed)
	. = clean commits (no changes to file tree)

	                        Before     After
	-------------------------------------------
	First modified commit | ce708d9f | e16706f4
	Last dirty commit     | c9032006 | 9dc8c58b


In total, 489 object ids were changed. Full details are logged here:

	/Users/ahughes/erase/lwmdb.bfg-report/2023-06-30/11-23-15

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
(playground) ➜  lwmdb git:(main) git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 2299, done.
Counting objects: 100% (2299/2299), done.
Delta compression using up to 10 threads
Compressing objects: 100% (2184/2184), done.
Writing objects: 100% (2299/2299), done.
Total 2299 (delta 1400), reused 589 (delta 0), pack-reused 0

@AoifeHughes
Copy link
Collaborator

I don't have permissions to write, but does this look like what you had @griff-rees I used the jar file directly from linked site.

@griff-rees
Copy link
Collaborator Author

Cool! I think I got that far, it was the push to main that failed

@griff-rees
Copy link
Collaborator Author

I need to sort your permission. And I'm going to make another merge to main, so it'll be one more checkout then have another go.

@AoifeHughes
Copy link
Collaborator

rtyley/bfg-repo-cleaner#36 (comment) - see this comment

@griff-rees
Copy link
Collaborator Author

Yeah I saw that when I hit this before. Had other urgent stuff so left it

@griff-rees
Copy link
Collaborator Author

@AoifeHughes you've got admin rights. With great power... ;)

@AoifeHughes
Copy link
Collaborator

Okay, just for reference I got the same errors as @griff-rees, I tried removing branch protections and also git push -f --set-upstream origin main couldn't get it to budge

@griff-rees
Copy link
Collaborator Author

Thanks so @AoifeHughes: really helps to reproduce that (and know I didn't miss something obvious!). There are other routes that don't use bfg... but they're hard.

@griff-rees
Copy link
Collaborator Author

@AoifeHughes
Copy link
Collaborator

AoifeHughes commented Jun 30, 2023

@griff-rees can you check if this has been done, I think I got it working?
git-filter-repo --invert-paths --path fixture-files was used for this FYI

@griff-rees
Copy link
Collaborator Author

Ah lovely! I think we need to check the history to be sure. Probably need to add to .gitignore to be safe, but I think the hardest part's done. Lovely, lovely work.

@AoifeHughes
Copy link
Collaborator

closing as data is gone 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants