Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to scrape anything earlier than 2024 #1135

Closed
maxkostas opened this issue Oct 29, 2024 · 13 comments
Closed

Not able to scrape anything earlier than 2024 #1135

maxkostas opened this issue Oct 29, 2024 · 13 comments
Assignees

Comments

@maxkostas
Copy link

Hello!

First of all great application!
I managed to get it running and was able to successfully download my invoices for orders from 2024.

However I am not able to download anything older than that.

I am getting this error:

docudigger-1  | [0] [info] [2024-10-29 01:10:41] [scrape:all]:  runAll
docudigger-1  | [0] [info] [2024-10-29 01:10:42] [scrape:amazon]:       Only invoices since order 303-4172506-XXXXX will be gathered.
docudigger-1  | [0] [info] [2024-10-29 01:10:45] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 01:10:49] [scrape:amazon]:       First possible year: 2023
docudigger-1  | [0] [info] [2024-10-29 01:10:49] [scrape:amazon]:       Last possible year: 2023
docudigger-1  | [0] [info] [2024-10-29 01:10:49] [scrape:amazon]:       Selecting start year 2023
docudigger-1  | [0]     Error: No element found for selector: select[name="timeFilter"]
docudigger-1  | [0] docudigger scrape all exited with code 1
docudigger-1 exited with code 1

I also tried with other years and its the same outcome except for the current year.

What I would like to do is to download ALL invoices no matter the year, I tried disable the year value however that did not work. Is there maybe another option to have docudigger just download everything?

This is what I am using:

name: amazon
services:
    docudigger:
        environment:
            - AMAZON_USERNAME=XXXXX
            - AMAZON_PASSWORD=XXXXXX
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2023
            - AMAZON_PAGE_FILTER=1
            - LOG_LEVEL=info
        volumes:
            - C:/Users/XXXX/:/home/node/docudigger
        image: ghcr.io/disane87/docudigger
volumes:
    C:
        external: true
        name: C

Thank you in advance and all the best!

@maxkostas maxkostas changed the title Not able to scrape anything beyond 2024 Not able to scrape anything earlier than 2024 Oct 29, 2024
@Disane87
Copy link
Owner

Disane87 commented Oct 29, 2024

Thank you for this issue @maxkostas. Please try to use the ONLY_NEW flag mentioned in the docs. This should deactivate the AMAZON_YEAR_FILTER and AMAZON_PAGE_FILTER and gets all the pages of all years. After that run is done, the last scraped date is saved and the runs after that will start at the last date saved so that you only get the new ones.

I have running this with that compose file

services:
    docudigger:
        container_name: Docudigger
        networks:
            - homelab
        environment:
            - TZ=Europe/Berlin
            - AMAZON_USERNAME=
            - AMAZON_PASSWORD=
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2024
            - AMAZON_PAGE_FILTER=1
            - ONLY_NEW=true
        labels:
            - net.unraid.docker.managed=dockerman
        volumes:
            - /mnt/user/appdata/paperless-ngx/consume/:/home/node/docudigger/data:rw
            - /mnt/user/appdata/docudigger/logs:/home/node/docudigger/logs:rw
        image: "ghcr.io/disane87/docudigger:dev"
networks:
    homelab:
        external: true
        name: br3.11
        

If that isn't working correctly, please use one of the dev versions.
https://github.com/Disane87/docudigger/tags

@maxkostas
Copy link
Author

Hello!

Thank you very much for the quick response, I have just tried it with the following commands:

name: amazon
services:
    docudigger:
        environment:
            - AMAZON_USERNAME=XXXXX
            - AMAZON_PASSWORD=XXXX
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2008
            - AMAZON_PAGE_FILTER=1
            - ONLY_NEW=true
        volumes:
            - C:/Users/XXXX/:/home/node/docudigger
        image: ghcr.io/disane87/docudigger:dev
volumes:
    C:
        external: true
        name: C

However this is the output:

Attaching to docudigger-1
docudigger-1  | [0] [info] [2024-10-29 11:10:40] [scrape:all]:  runAll
docudigger-1  | [0] [info] [2024-10-29 11:10:40] [scrape:amazon]:       Only invoices since order 303-4172506-8435524 will be gathered.
docudigger-1  | [0] [info] [2024-10-29 11:10:44] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 11:10:47] [scrape:amazon]:       First possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 11:10:47] [scrape:amazon]:       Last possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 11:10:47] [scrape:amazon]:       Selecting start year 2008
docudigger-1  | [0]     Error: No element found for selector: select[name="timeFilter"]
docudigger-1  | [0] docudigger scrape all exited with code 1
docudigger-1 exited with code 1

@Disane87
Copy link
Owner

Do you have process.json lying in the volume? If yes, please delete it, it should now run a complete scrape.

@maxkostas
Copy link
Author

Hello! I just tried it and deleted the process.json.

It seems that it disables OnlyNew?

Attaching to docudigger-1
docudigger-1  | [0] [info] [2024-10-29 11:10:16] [scrape:all]:  runAll
docudigger-1  | [0] [warn] [2024-10-29 11:10:16] [scrape:amazon]:       process.json not found. Full run needed. OnlyNew deactivated.
docudigger-1  | [0] [info] [2024-10-29 11:10:20] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       First possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Last possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Determined order pages 57
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Page count: 57
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Processing year "2024" page 0
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Checking page 0 for orders

and in the end this is the output:

docudigger-1  | [0] [info] [2024-10-29 11:10:35] [scrape:amazon]:       Page "56" done. Skipping to next page.
docudigger-1  | [0] [info] [2024-10-29 11:10:37] [scrape:amazon]:       Year "2024" done. Skipping next years
docudigger-1  | [0] [info] [2024-10-29 11:10:37] [scrape:amazon]:       Processing done. Processed 777 invoices in 0.00 minutes.
docudigger-1  | [0] [info] [2024-10-29 11:10:37] [scrape:amazon]:       Recurring activated. Not terminating. Waiting for next run.
docudigger-1  | [0] [info] [2024-10-29 11:10:00] [scrape:all]:  runAll
docudigger-1  | [0] [info] [2024-10-29 11:10:00] [scrape:amazon]:       Only invoices since order 303-4172506-8435524 will be gathered.
docudigger-1  | [0] [info] [2024-10-29 11:10:05] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       First possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Last possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Determined order pages 57
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Page count: 57
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Processing year "2024" page 0
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Checking page 0 for orders
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Got 10 orders. Processing...
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Order number: 303-4172506-8435524
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Order date: 27. Oktober 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Order 303-4172506-8435524 already handled. Exiting.
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       3 invoices found 📃
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Processing "1" orders
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Processing done. Processed 3 invoices in 0.00 minutes.
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Recurring activated. Not terminating. Waiting for next run.

Thank you again for the quick help! I will donate to you!

@maxkostas
Copy link
Author

I have also tried on a different machine - I think that a change on the year selector on amazon is causing the issues?

Attaching to docudigger-1
docudigger-1  | [0] [info] [2024-10-29 12:10:51] [scrape:all]:  runAll
docudigger-1  | [0] [warn] [2024-10-29 12:10:52] [scrape:amazon]:       process.json not found. Full run needed. OnlyNew deactivated.
docudigger-1  | [0] [info] [2024-10-29 12:10:56] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 12:10:59] [scrape:amazon]:       First possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 12:10:59] [scrape:amazon]:       Last possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 12:10:59] [scrape:amazon]:       Selecting start year 2008
docudigger-1  | [0]     Error: No element found for selector: select[name="timeFilter"]
docudigger-1  | [0] docudigger scrape all exited with code 1
docudigger-1 exited with code 1

@Disane87
Copy link
Owner

Yeah it seems. I have that error too. Will investigate it

@maxkostas
Copy link
Author

Great to hear, thank you very much! If there is any way I can assist please let me know!

@Disane87
Copy link
Owner

Thats pretty weird. The timeFilter selectbox is there and selectable:
image

I need some time to track down this issue. Sorry!

@Disane87
Copy link
Owner

Disane87 commented Oct 29, 2024

I believe I’ve identified the issue. It appears to be a race condition combined with some rate limiting on Amazon's end. I now wait for the yearFilter selector to be present and have implemented debounce mechanisms for other actions (like retrieving invoices). These adjustments seem to prevent some issues that previously occurred under certain, as yet unclear, circumstances. Currently a full run onlyNew=true runs.

Because of the debounced gathering the invoices as PDF from amazon can be a bit slow. But that should only affect the full runs.

I will create a new dev version which you should check if it fixes your problems.

@maxkostas
Copy link
Author

Brilliant! Once the new DEV version is out I will test it right away and give further feedback!

Thank you again for your quick reactions to this issue!

@Disane87
Copy link
Owner

Disane87 commented Oct 29, 2024

Please check out the latest dev version. Since you run this on windows it should work. My tests were working properly. On linux servers the tooling pops up some random exceptions I need to investigate.

I guess I have to update puppeteer to the latest version but that will break some other stuff. So hopefully it's working on your end by now so I can focus on getting Linux servers ready

@maxkostas
Copy link
Author

maxkostas commented Oct 30, 2024

Hello!

Great work! I managed to run it and was able to download everything I needed!
For some reason I wasnt able to use the docker file. However running it directly and installing it with NPM did work.

I am running all of this in Windows

I have just tried the DEV8 - but I get this issue when running docker compose up:

[+] Running 1/0
 ✔ Container amazon-docudigger-1  Created                                                                                      0.0s
Attaching to docudigger-1
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:all.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (Use `node --trace-warnings ...` to show where the warning was created)
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:index.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:helpers:auth.helper.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:helpers:file.helper.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:helpers:selectors.helper.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] Platform:  linux
docudigger-1  | [0] [error] [2024-10-30 09:10:42] [scrape:amazon]:      unhandledRejection: Could not find Chrome (ver. 127.0.6533.88). This can occur if either
docudigger-1  | [0]  1. you did not perform an installation before running the script (e.g. `npx puppeteer browsers install chrome`) or
docudigger-1  | [0]  2. your cache path is incorrectly configured (which is: /home/node/.cache/puppeteer).
docudigger-1  | [0] For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration.
docudigger-1  | [0] Error: Could not find Chrome (ver. 127.0.6533.88). This can occur if either
docudigger-1  | [0]  1. you did not perform an installation before running the script (e.g. `npx puppeteer browsers install chrome`) or
docudigger-1  | [0]  2. your cache path is incorrectly configured (which is: /home/node/.cache/puppeteer).
docudigger-1  | [0] For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration.
docudigger-1  | [0]     at ChromeLauncher.resolveExecutablePath (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ProductLauncher.js:291:27)
docudigger-1  | [0]     at ChromeLauncher.executablePath (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ChromeLauncher.js:209:25)
docudigger-1  | [0]     at ChromeLauncher.computeLaunchArguments (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ChromeLauncher.js:89:37)
docudigger-1  | [0]     at async ChromeLauncher.launch (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ProductLauncher.js:70:28)
docudigger-1  | [0]     at async Puppeteer.setup (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/dist/classes/puppeteer.class.js:16:24)
docudigger-1  | [0]     at async Amazon.init (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/dist/classes/scrape-command.class.js:36:24)
docudigger-1  | [0]     at async Amazon._run (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/@oclif/core/lib/command.js:301:13)
docudigger-1  | [0]     at async Config.runCommand (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/@oclif/core/lib/config/config.js:425:25)
docudigger-1  | [0]     at async Object.run (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/@oclif/core/lib/main.js:95:16)
docudigger-1  | [0] docudigger scrape amazon exited with code 1
docudigger-1 exited with code 1

I am not an expert - but would it be possible to include the chrome files in the image itself?

This is the docker file that I am using:

name: amazon
services:
    docudigger:
        environment:
            - AMAZON_USERNAME=XXXXX
            - AMAZON_PASSWORD=XXXXX
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2008
            - AMAZON_PAGE_FILTER=1
            - ONLY_NEW=true
        volumes:
            - C:/Users/XXXX/:/home/node/docudigger
        image: ghcr.io/disane87/docudigger:2.0.6-dev.8
volumes:
    C:
        external: true
        name: C


@Disane87
Copy link
Owner

Disane87 commented Oct 30, 2024

Glad the NPM works for you. It seems something broke the docker image. It's pretty strange because the debug docker image works flawlessly.

But I'm working on it. I gonna close this, as the main issue here seems to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants