Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/Crash #22

Closed
grigzy28 opened this issue Aug 24, 2024 · 22 comments
Closed

Bug/Crash #22

grigzy28 opened this issue Aug 24, 2024 · 22 comments
Labels
bug Something isn't working stale windows tool was used with windows machine

Comments

@grigzy28
Copy link

Windows 11 OS

Just tried this and received the following error. Empty output directory.


./waybackup -d --csv -u http://wuarchive.wustl.edu/pub/ -o .\test12 -f --workers 1 --skip --delay 1

No CSV-file or content found to load skipable URLs

Querying snapshots...
---> wuarchive.wustl.edu/pub/*

!-- Exception: UNCAUGHT EXCEPTION
!-- File: ..............\Program Files\Python312\Lib\json\decoder.py
!-- Function: raw_decode
!-- Line: 355
!-- Segment: raise JSONDecodeError("Expecting value", s, err.value) from None
!-- Description: Expecting value: line 1 column 1 (char 0)

Exception log: .\test12\waybackup_error.log

Full traceback:
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "waybackup.exe_main
.py", line 7, in
sys.exit(main())
^^^^^^
File "..\site-packages\pywaybackup\main.py", line 22, in main
archive.query_list(config.range, config.start, config.end, config.explicit, config.mode, config.cdxbackup, config.cdxinject)
File "..\site-packages\pywaybackup\archive.py", line 158, in query_list
cdxResult = json.loads(cdxResult)
^^^^^^^^^^^^^^^^^^^^^
File "..............\Program Files\Python312\Lib\json_init
.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "..............\Program Files\Python312\Lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "..............\Program Files\Python312\Lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)


waybackup_error.log

@grigzy28
Copy link
Author

grigzy28 commented Aug 24, 2024

Subsequent run(s) (that actually worked) got these following error...


command:
./waybackup -d --csv -u http://wuarchive.wustl.edu/pub/ -o .\test13 -f --workers 1 --skip --delay 1


-----> Worker: 1 - Delay: 1 seconds

-----> Attempt: [1/1] Snapshot [3880/670904] - Worker: 1
INCOMPLETEREAD -> (1/2): reconnect in 50 seconds...

!-- Exception: Worker: 1 - Exception
!-- File: ..............\Program Files\Python312\Lib\ssl.py
!-- Function: send
!-- Line: 1180
!-- Segment: return self._sslobj.write(data)
!-- Description: TLS/SSL connection has been closed (EOF) (_ssl.c:2406)

Exception log: .\test13\waybackup_error.log

Full traceback:
Traceback (most recent call last):
File "..\site-packages\pywaybackup\archive.py", line 231, in download_loop
download_status = download(output, snapshot, connection, status_message, no_redirect)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "..\site-packages\pywaybackup\archive.py", line 271, in download
response, response_data, response_status, response_status_message = download_response(connection, encoded_download_url, headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "..\site-packages\pywaybackup\archive.py", line 343, in download_response
connection.request("GET", encoded_download_url, headers=headers)
File "..............\Program Files\Python312\Lib\http\client.py", line 1336, in request
self._send_request(method, url, body, headers, encode_chunked)
File "..............\Program Files\Python312\Lib\http\client.py", line 1382, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "..............\Program Files\Python312\Lib\http\client.py", line 1331, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "..............\Program Files\Python312\Lib\http\client.py", line 1091, in _send_output
self.send(msg)
File "..............\Program Files\Python312\Lib\http\client.py", line 1055, in send
self.sock.sendall(data)
File "..............\Program Files\Python312\Lib\ssl.py", line 1211, in sendall
v = self.send(byte_view[count:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "..............\Program Files\Python312\Lib\ssl.py", line 1180, in send
return self._sslobj.write(data)
^^^^^^^^^^^^^^^^^^^^^^^^
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF) (_ssl.c:2406)

Files downloaded: 2655
Not downloaded: 668249

waybackup_error.log
waybackup_http.wuarchive.wustl.edu.pub.csv

the following is from the test12 run that worked after the very first error in the first post:

waybackup_error.log
waybackup_http.wuarchive.wustl.edu.pub.csv

@grigzy28
Copy link
Author

actually upon inspection, it appears that both runs ended at the same spot, the files are identical

@bitdruid
Copy link
Owner

bitdruid commented Aug 25, 2024

hm ssl.SSLZeroReturnError seems not like a problem within the code... can you get the exact snapshot url which causes this error? so i could dive a bit into investigations.

and also maybe try pip update for 1.5.0

@grigzy28
Copy link
Author

Ok, looked to see if I can find the URL but it's not in the csv nor the window.

Also why does it download the files twice? In the CSV it shows that each file is downloaded twice with a status of 200 for OK....

And will try the newest version that you just put out.

@grigzy28
Copy link
Author

grigzy28 commented Aug 25, 2024

I also just noticed that the delay function isn't applying to the failed 404/301 urls but appears to only work with the 200 status ones.


Not Working Delay:

-----> Attempt: [1/1] Snapshot [801/670904] - Worker: 1
UNEXPECTED -> HTTP : 301 - Moved Permanently
-> URL : https://web.archive.org/web/20101007093523id_/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme
FAILED -> : append to failed_urls: https://web.archive.org/web/20101007093523id_/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme

-----> Attempt: [1/1] Snapshot [802/670904] - Worker: 1
UNEXPECTED -> HTTP : 301 - Moved Permanently
-> URL : https://web.archive.org/web/20101007093523id_/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme
FAILED -> : append to failed_urls: https://web.archive.org/web/20101007093523id_/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme


Working Delay:

-----> Attempt: [1/1] Snapshot [1137/670904] - Worker: 1
SUCCESS -> HTTP : 200 - OK
-> URL : https://web.archive.org/web/20100214085623id_/http://wuarchive.wustl.edu:80/pub/fedora10/media.repo
-> FILE : C:\users\shawn\appdata\roaming\python\Python312\Scripts\test14\wuarchive.wustl.edu\20100214085623\pub\fedora10\media.repo

-----> Worker: 1 - Delay: 1 seconds

-----> Attempt: [1/1] Snapshot [1138/670904] - Worker: 1
EXISTING -> HTTP : 200 - OK
-> URL : https://web.archive.org/web/20100214085623id_/http://wuarchive.wustl.edu:80/pub/fedora10/media.repo
-> FILE : C:\users\shawn\appdata\roaming\python\Python312\Scripts\test14\wuarchive.wustl.edu\20100214085623\pub\fedora10\media.repo

-----> Worker: 1 - Delay: 1 seconds


@bitdruid
Copy link
Owner

to the delay. currently the logic is, that there is a 15 seconds timeout anyway for a retry. thats why i left the delay only for successful downloads. you think it would be better to include it into any status?

for the dubplicate downloads:

check the cdx response manually:

https://web.archive.org/cdx/search/cdx?output=json&url=wuarchive.wustl.edu/pub/*&fl=timestamp,digest,mimetype,statuscode,original&limit=5&filter!=statuscode:200

so for timestamps 19980123002752 there are 2 digest (archive thinks both are not the same

for timestamp 19970101083806 however there are the same digest. so this seems to be a problem with the CDX response. funnily the param showDupeCount=true adviced by archive.org to remove duplicates from the result does not work...

https://web.archive.org/cdx/search/cdx?output=json&url=wuarchive.wustl.edu/pub/*&fl=timestamp,digest,mimetype,statuscode,original&limit=5&showDupeCount=true&filter!=statuscode:200

@bitdruid
Copy link
Owner

so i added a filter:

if a snapshot has same TIMESTAMP & URL, duplicates are removed.

however i dont know why the cdx server does respond with duplicates...

@grigzy28
Copy link
Author

to the delay. currently the logic is, that there is a 15 seconds timeout anyway for a retry. thats why i left the delay only for successful downloads. you think it would be better to include it into any status?

When I was watching it to get the URL that was failing for you earlier, it wasn't pausing the 15 seconds as there wasn't a timeout on those 301/404 codes, they were immediate responses and not timeouts. I didn't really mean to inform you about the duplicates, that just happened. Honestly didn't even know I had pasted that because I was showing the delay function. :)

Thanks, will try the latest commit now.

@grigzy28
Copy link
Author

Also, added --debug to the command but it still didn't give the URL of the original problem noted initially about the SSL/TLS error EOF... Going to see if the latest commit happens to have corrected it.

@bitdruid
Copy link
Owner

bitdruid commented Aug 25, 2024

The --debug command was removed in 1.5.0, just so you know

@grigzy28
Copy link
Author

Thanks, I also just updated the install however it isn't putting the waybackup.exe in the scripts folder like it used to. Is that something in your installer or something else?

@bitdruid
Copy link
Owner

sorry im not on windows but when i was debugging on win i just created a virtual env and installed it inside that via pip

@grigzy28
Copy link
Author

that's what I did/do but for some reason it's not creating the waybackup.exe this time

oh... I just found out it moved it from the appdata folder to the program files scripts folders... strange

@grigzy28
Copy link
Author

Okay, running 1.5.1 get this now


PS C:\users\shawn\appdata\roaming\python\Python312\Scripts> ./waybackup.exe --csv -u http://wuarchive.wustl.edu/pub/ -o
.\test15 -f --workers 1 --skip --delay 1

No CSV-file or content found to load skipable URLs

Querying snapshots...
-----> wuarchive.wustl.edu/pub/*
-----> Downloading CDX result: 12.6MB [03:00, 69.7kB/s]

!-- Exception: UNCAUGHT EXCEPTION
!-- File: ..\site-packages\requests\models.py
!-- Function: generate
!-- Line: 818
!-- Segment: raise ChunkedEncodingError(e)
!-- Description: ('Connection broken: IncompleteRead(7451 bytes read, 741 more expected)', IncompleteRead(7451 bytes read, 741 more expected))

Exception log: .\test15\waybackup_error.log

waybackup_error.log
waybackup_http.wuarchive.wustl.edu.pub.cdx.txt

@grigzy28
Copy link
Author

Updated python from 3.12.4 to 3.12.5 and it started working correctly so far. I think my internet may have been going slow as well for that error above. Will keep you updated when this last test finishes.

@grigzy28
Copy link
Author

Okay, here's the results, same TLS/SSL issue but attached are the data files.


-----> Attempt: [1/1] Snapshot [1936/335537] - Worker: 1
SUCCESS -> HTTP : 200 - OK
-> URL : https://web.archive.org/web/20081011175454id_/http://wuarchive.wustl.edu/pub/aminet/comm/xeno/frqsta11.readme
-> FILE : C:\users\shawn\appdata\roaming\python\Python312\Scripts\test16\wuarchive.wustl.edu\20081011175454\pub\aminet\comm\xeno\frqsta11.readme

-----> Worker: 1 - Delay: 1 seconds

-----> Attempt: [1/1] Snapshot [1937/335537] - Worker: 1
INCOMPLETEREAD -> (1/2): reconnect in 50 seconds...

!-- Exception: Worker: 1 - Exception
!-- File: ..............\Program Files\Python312\Lib\ssl.py
!-- Function: send
!-- Line: 1180
!-- Segment: return self._sslobj.write(data)
!-- Description: TLS/SSL connection has been closed (EOF) (_ssl.c:2406)

Exception log: .\test16\waybackup_error.log

Files downloaded: 1312
Not downloaded: 334225


test16.zip

@bitdruid
Copy link
Owner

bitdruid commented Aug 26, 2024

and the reconnect does not work? exception http.client.IncompleteRead should be a subclass of already catched http.client.HTTPException

i tried in a win vm and had no issues so far. downloading without any problems

@grigzy28
Copy link
Author

I just restarted it on the test16 folder with the csv and it started at and tried at file 1937 to download and it's doing the same thing, I disabled all AV just in case that was causing a connection issue. Not really sure what's going on.

@bitdruid
Copy link
Owner

strange. but okay give me some time. i decided to redesign the whole retry logic to include such strange exceptions and create a new connection when they occure. this MAY solve one or two issues...

however retry is not working as intended since i implemented the queue

@bitdruid
Copy link
Owner

i patched dev. you could build it from dev and have a try if your exception gets catched and retried properly. still BETA of course :)

@bitdruid bitdruid added bug Something isn't working windows tool was used with windows machine labels Aug 30, 2024
Copy link

This issue is marked as stale because there was no activity for 30 days.

@github-actions github-actions bot added the stale label Sep 30, 2024
Copy link

This issue has been closed because there has been no activity for 14 days while it was marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale windows tool was used with windows machine
Projects
None yet
Development

No branches or pull requests

2 participants