Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out which images we are missing on AWS and copy them up #158

Closed
hackartisan opened this issue Oct 2, 2023 · 10 comments
Closed

Figure out which images we are missing on AWS and copy them up #158

hackartisan opened this issue Oct 2, 2023 · 10 comments

Comments

@hackartisan
Copy link
Member

hackartisan commented Oct 2, 2023

The readme says there are ~1.5 million images. This is probably just the images for the subguide cards. Figure out how many there actually are, and update that. As part of this, validate that we successfully got all the images

Do all of the following, numbers should align:

  • Some kind of command line file listing count executed on the machine that originally was storing them. (they were rsynced to the imagecat staging box and that's all we have access to now; count them there)
  • do some kind of aws list command
  • wait for the card image load to finish and then do CardImage.count.

If we do all 3 of these we can validate that we successfully copied everything to AWS and then successfully loaded all of those into the prod database.

If those 3 counts are off, figure out what's missing and get it copied / loaded.

@hackartisan
Copy link
Member Author

I did a very rough estimate based on the progressbar currently loading cardimages for guide cards, and I got 4,375,000. I think the original number reflects the number of images for subguide cards, so the total may be something like 5.5 - 6 million.

@hackartisan hackartisan self-assigned this Oct 10, 2023
@hackartisan
Copy link
Member Author

@imagecat-staging1:/var/tmp/imagecat# find . -type f -name "*.tiff" | wc -l
Gives 5,786,727

@hackartisan
Copy link
Member Author

hackartisan commented Oct 10, 2023

% aws s3 ls s3://puliiif-production/imagecat-disk | wc -l
gives 5,780,170
takes about 35 min to run.
So we have 6557 fewer on aws than in the directory on staging that was rsynced from the original machine.

@hackartisan hackartisan changed the title Get an accurate count of the total number of images Figure out which images we are missing on AWS and copy them up Oct 10, 2023
@hackartisan
Copy link
Member Author

deploy@imagecat-staging1:/var/tmp/imagecat$ for i in {1..22}; do echo disk${i}; find disk${i} -type f -name "*.tiff" | wc -l ; done
disk1
277228
disk2
275260
disk3
261405
disk4
276321
disk5
257479
disk6
264898
disk7
257121
disk8
274647
disk9
283706
disk10
263320
disk11
278984
disk12
252882
disk13
277719
disk14
279365
disk15
284386
disk16
262902
disk17
269623
disk18
252032
disk19
253633
disk20
193619
disk21
242875
disk22
247322

@hackartisan
Copy link
Member Author

from aws:
I, [2023-10-10T15:22:44.202695 #77846] INFO -- : Fetching disk 1 file list
277227
I, [2023-10-10T15:24:17.118391 #77846] INFO -- : Fetching disk 2 file list
275259
I, [2023-10-10T15:25:55.267232 #77846] INFO -- : Fetching disk 3 file list
261405
I, [2023-10-10T15:27:26.129265 #77846] INFO -- : Fetching disk 4 file list
276216
I, [2023-10-10T15:29:05.928898 #77846] INFO -- : Fetching disk 5 file list
257479
I, [2023-10-10T15:30:36.616389 #77846] INFO -- : Fetching disk 6 file list
264898
I, [2023-10-10T15:32:12.400218 #77846] INFO -- : Fetching disk 7 file list
257121
I, [2023-10-10T15:33:44.087002 #77846] INFO -- : Fetching disk 8 file list
274647
I, [2023-10-10T15:35:21.771344 #77846] INFO -- : Fetching disk 9 file list
283706
I, [2023-10-10T15:36:58.456716 #77846] INFO -- : Fetching disk 10 file list
263320
I, [2023-10-10T15:38:28.430788 #77846] INFO -- : Fetching disk 11 file list
278984
I, [2023-10-10T15:40:06.554338 #77846] INFO -- : Fetching disk 12 file list
252882
I, [2023-10-10T15:41:35.291431 #77846] INFO -- : Fetching disk 13 file list
277719
I, [2023-10-10T15:43:11.833596 #77846] INFO -- : Fetching disk 14 file list
279365
I, [2023-10-10T15:44:53.257739 #77846] INFO -- : Fetching disk 15 file list
284386
I, [2023-10-10T15:46:36.283309 #77846] INFO -- : Fetching disk 16 file list
262902
I, [2023-10-10T15:48:07.166952 #77846] INFO -- : Fetching disk 17 file list
269623
I, [2023-10-10T15:49:42.001741 #77846] INFO -- : Fetching disk 18 file list
252032
I, [2023-10-10T15:51:09.708023 #77846] INFO -- : Fetching disk 19 file list
253633
I, [2023-10-10T15:52:41.115495 #77846] INFO -- : Fetching disk 20 file list
187169
I, [2023-10-10T15:53:49.957428 #77846] INFO -- : Fetching disk 21 file list
242875
I, [2023-10-10T15:55:12.637343 #77846] INFO -- : Fetching disk 22 file list
247322

@tpendragon
Copy link
Contributor

There's a report of all the content that's in the S3 bucket which will get generated to the dls-transfer bucket, but it might take up to two days to generate. After that it's daily.

@tpendragon
Copy link
Contributor

Ran the following, ignore the ;nil, that was me running this in IRB:

files = Dir.glob("tmp/*.csv")
all_lines = files.flat_map do |f|
  File.read(f).split("\n").map do |entry|
    corrected = entry.split(",").last.gsub("imagecat-","").gsub('"', "").gsub("-", "/").gsub(/tif$/, "tiff")
    "./#{corrected}"
  end
end; nil
set = Set.new(all_lines); nil
local_files = File.read("tmp/file_list.txt").split("\n"); nil
local_files = Set.new(local_files); nil
diff = local_files - set; nil

6,557 missed files.

missing_files.txt

@tpendragon
Copy link
Contributor

tpendragon commented Oct 12, 2023

Those 6,557 have been uploaded to puliiif-production and puliiif-staging via #192. There were some weird permission things (files in tmp were owned by pulsys, but I had to run commands as deploy) so I had to do some permission stuff to get it to upload, but they're there now.

@tpendragon
Copy link
Contributor

Ah - the files weren't missing, they just aren't called imagecat-disk - they're imagecat-temp or imagecat-temp2 or imagecat-temp3, so the above s3 prefix command wasn't right.

@tpendragon
Copy link
Contributor

The reason is there are temp directories that look like old staging directories. For example: /var/tmp/imagecat/disk20/temp2/0048/A2380/0000.0045.tiff.

There's also a /var/tmp/imagecat/disk20/0048/A2380/ and they're exactly the same. I think we can safely ignore the temp dirs:

diff <(ls /var/tmp/imagecat/disk20/temp2/0048/A2380/) <(ls) (empty)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants