Not able to scrape anything earlier than 2024 #1135

maxkostas · 2024-10-29T01:34:08Z

Hello!

First of all great application!
I managed to get it running and was able to successfully download my invoices for orders from 2024.

However I am not able to download anything older than that.

I am getting this error:

docudigger-1  | [0] [info] [2024-10-29 01:10:41] [scrape:all]:  runAll
docudigger-1  | [0] [info] [2024-10-29 01:10:42] [scrape:amazon]:       Only invoices since order 303-4172506-XXXXX will be gathered.
docudigger-1  | [0] [info] [2024-10-29 01:10:45] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 01:10:49] [scrape:amazon]:       First possible year: 2023
docudigger-1  | [0] [info] [2024-10-29 01:10:49] [scrape:amazon]:       Last possible year: 2023
docudigger-1  | [0] [info] [2024-10-29 01:10:49] [scrape:amazon]:       Selecting start year 2023
docudigger-1  | [0]     Error: No element found for selector: select[name="timeFilter"]
docudigger-1  | [0] docudigger scrape all exited with code 1
docudigger-1 exited with code 1

I also tried with other years and its the same outcome except for the current year.

What I would like to do is to download ALL invoices no matter the year, I tried disable the year value however that did not work. Is there maybe another option to have docudigger just download everything?

This is what I am using:

name: amazon
services:
    docudigger:
        environment:
            - AMAZON_USERNAME=XXXXX
            - AMAZON_PASSWORD=XXXXXX
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2023
            - AMAZON_PAGE_FILTER=1
            - LOG_LEVEL=info
        volumes:
            - C:/Users/XXXX/:/home/node/docudigger
        image: ghcr.io/disane87/docudigger
volumes:
    C:
        external: true
        name: C

Thank you in advance and all the best!

The text was updated successfully, but these errors were encountered:

Disane87 · 2024-10-29T10:58:09Z

Thank you for this issue @maxkostas. Please try to use the ONLY_NEW flag mentioned in the docs. This should deactivate the AMAZON_YEAR_FILTER and AMAZON_PAGE_FILTER and gets all the pages of all years. After that run is done, the last scraped date is saved and the runs after that will start at the last date saved so that you only get the new ones.

I have running this with that compose file

services:
    docudigger:
        container_name: Docudigger
        networks:
            - homelab
        environment:
            - TZ=Europe/Berlin
            - AMAZON_USERNAME=
            - AMAZON_PASSWORD=
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2024
            - AMAZON_PAGE_FILTER=1
            - ONLY_NEW=true
        labels:
            - net.unraid.docker.managed=dockerman
        volumes:
            - /mnt/user/appdata/paperless-ngx/consume/:/home/node/docudigger/data:rw
            - /mnt/user/appdata/docudigger/logs:/home/node/docudigger/logs:rw
        image: "ghcr.io/disane87/docudigger:dev"
networks:
    homelab:
        external: true
        name: br3.11

If that isn't working correctly, please use one of the dev versions.
https://github.com/Disane87/docudigger/tags

maxkostas · 2024-10-29T11:06:39Z

Hello!

Thank you very much for the quick response, I have just tried it with the following commands:

name: amazon
services:
    docudigger:
        environment:
            - AMAZON_USERNAME=XXXXX
            - AMAZON_PASSWORD=XXXX
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2008
            - AMAZON_PAGE_FILTER=1
            - ONLY_NEW=true
        volumes:
            - C:/Users/XXXX/:/home/node/docudigger
        image: ghcr.io/disane87/docudigger:dev
volumes:
    C:
        external: true
        name: C

However this is the output:

Attaching to docudigger-1
docudigger-1  | [0] [info] [2024-10-29 11:10:40] [scrape:all]:  runAll
docudigger-1  | [0] [info] [2024-10-29 11:10:40] [scrape:amazon]:       Only invoices since order 303-4172506-8435524 will be gathered.
docudigger-1  | [0] [info] [2024-10-29 11:10:44] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 11:10:47] [scrape:amazon]:       First possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 11:10:47] [scrape:amazon]:       Last possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 11:10:47] [scrape:amazon]:       Selecting start year 2008
docudigger-1  | [0]     Error: No element found for selector: select[name="timeFilter"]
docudigger-1  | [0] docudigger scrape all exited with code 1
docudigger-1 exited with code 1

Disane87 · 2024-10-29T11:15:05Z

Do you have process.json lying in the volume? If yes, please delete it, it should now run a complete scrape.

maxkostas · 2024-10-29T11:32:28Z

Hello! I just tried it and deleted the process.json.

It seems that it disables OnlyNew?

Attaching to docudigger-1
docudigger-1  | [0] [info] [2024-10-29 11:10:16] [scrape:all]:  runAll
docudigger-1  | [0] [warn] [2024-10-29 11:10:16] [scrape:amazon]:       process.json not found. Full run needed. OnlyNew deactivated.
docudigger-1  | [0] [info] [2024-10-29 11:10:20] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       First possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Last possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Determined order pages 57
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Page count: 57
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Processing year "2024" page 0
docudigger-1  | [0] [info] [2024-10-29 11:10:23] [scrape:amazon]:       Checking page 0 for orders

and in the end this is the output:

docudigger-1  | [0] [info] [2024-10-29 11:10:35] [scrape:amazon]:       Page "56" done. Skipping to next page.
docudigger-1  | [0] [info] [2024-10-29 11:10:37] [scrape:amazon]:       Year "2024" done. Skipping next years
docudigger-1  | [0] [info] [2024-10-29 11:10:37] [scrape:amazon]:       Processing done. Processed 777 invoices in 0.00 minutes.
docudigger-1  | [0] [info] [2024-10-29 11:10:37] [scrape:amazon]:       Recurring activated. Not terminating. Waiting for next run.
docudigger-1  | [0] [info] [2024-10-29 11:10:00] [scrape:all]:  runAll
docudigger-1  | [0] [info] [2024-10-29 11:10:00] [scrape:amazon]:       Only invoices since order 303-4172506-8435524 will be gathered.
docudigger-1  | [0] [info] [2024-10-29 11:10:05] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       First possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Last possible year: 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Determined order pages 57
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Page count: 57
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Processing year "2024" page 0
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Checking page 0 for orders
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Got 10 orders. Processing...
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Order number: 303-4172506-8435524
docudigger-1  | [0] [info] [2024-10-29 11:10:08] [scrape:amazon]:       Order date: 27. Oktober 2024
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Order 303-4172506-8435524 already handled. Exiting.
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       3 invoices found 📃
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Processing "1" orders
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Processing done. Processed 3 invoices in 0.00 minutes.
docudigger-1  | [0] [info] [2024-10-29 11:10:10] [scrape:amazon]:       Recurring activated. Not terminating. Waiting for next run.

Thank you again for the quick help! I will donate to you!

maxkostas · 2024-10-29T12:02:19Z

I have also tried on a different machine - I think that a change on the year selector on amazon is causing the issues?

Attaching to docudigger-1
docudigger-1  | [0] [info] [2024-10-29 12:10:51] [scrape:all]:  runAll
docudigger-1  | [0] [warn] [2024-10-29 12:10:52] [scrape:amazon]:       process.json not found. Full run needed. OnlyNew deactivated.
docudigger-1  | [0] [info] [2024-10-29 12:10:56] [scrape:amazon]:       Logged in
docudigger-1  | [0] [info] [2024-10-29 12:10:59] [scrape:amazon]:       First possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 12:10:59] [scrape:amazon]:       Last possible year: 2008
docudigger-1  | [0] [info] [2024-10-29 12:10:59] [scrape:amazon]:       Selecting start year 2008
docudigger-1  | [0]     Error: No element found for selector: select[name="timeFilter"]
docudigger-1  | [0] docudigger scrape all exited with code 1
docudigger-1 exited with code 1

Disane87 · 2024-10-29T12:06:55Z

Yeah it seems. I have that error too. Will investigate it

maxkostas · 2024-10-29T12:08:05Z

Great to hear, thank you very much! If there is any way I can assist please let me know!

Disane87 · 2024-10-29T12:24:35Z

Thats pretty weird. The timeFilter selectbox is there and selectable:

I need some time to track down this issue. Sorry!

Disane87 · 2024-10-29T14:12:27Z

I believe I’ve identified the issue. It appears to be a race condition combined with some rate limiting on Amazon's end. I now wait for the yearFilter selector to be present and have implemented debounce mechanisms for other actions (like retrieving invoices). These adjustments seem to prevent some issues that previously occurred under certain, as yet unclear, circumstances. Currently a full run onlyNew=true runs.

Because of the debounced gathering the invoices as PDF from amazon can be a bit slow. But that should only affect the full runs.

I will create a new dev version which you should check if it fixes your problems.

maxkostas · 2024-10-29T14:32:03Z

Brilliant! Once the new DEV version is out I will test it right away and give further feedback!

Thank you again for your quick reactions to this issue!

Disane87 · 2024-10-29T23:40:40Z

Please check out the latest dev version. Since you run this on windows it should work. My tests were working properly. On linux servers the tooling pops up some random exceptions I need to investigate.

I guess I have to update puppeteer to the latest version but that will break some other stuff. So hopefully it's working on your end by now so I can focus on getting Linux servers ready

maxkostas · 2024-10-30T09:45:37Z

Hello!

Great work! I managed to run it and was able to download everything I needed!
For some reason I wasnt able to use the docker file. However running it directly and installing it with NPM did work.

I am running all of this in Windows

I have just tried the DEV8 - but I get this issue when running docker compose up:

[+] Running 1/0
 ✔ Container amazon-docudigger-1  Created                                                                                      0.0s
Attaching to docudigger-1
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:all.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (Use `node --trace-warnings ...` to show where the warning was created)
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:index.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:helpers:auth.helper.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:helpers:file.helper.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] (node:15) Warning: SyntaxError
docudigger-1  | [0] module: @oclif/core@4.0.31
docudigger-1  | [0] task: findCommand (scrape:amazon:helpers:selectors.helper.d)
docudigger-1  | [0] plugin: @disane-dev/docudigger
docudigger-1  | [0] root: /home/node/.npm-global/lib/node_modules/@disane-dev/docudigger
docudigger-1  | [0] message: Cannot use import statement outside a module
docudigger-1  | [0] See more details with DEBUG=*
docudigger-1  | [0] Platform:  linux
docudigger-1  | [0] [error] [2024-10-30 09:10:42] [scrape:amazon]:      unhandledRejection: Could not find Chrome (ver. 127.0.6533.88). This can occur if either
docudigger-1  | [0]  1. you did not perform an installation before running the script (e.g. `npx puppeteer browsers install chrome`) or
docudigger-1  | [0]  2. your cache path is incorrectly configured (which is: /home/node/.cache/puppeteer).
docudigger-1  | [0] For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration.
docudigger-1  | [0] Error: Could not find Chrome (ver. 127.0.6533.88). This can occur if either
docudigger-1  | [0]  1. you did not perform an installation before running the script (e.g. `npx puppeteer browsers install chrome`) or
docudigger-1  | [0]  2. your cache path is incorrectly configured (which is: /home/node/.cache/puppeteer).
docudigger-1  | [0] For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration.
docudigger-1  | [0]     at ChromeLauncher.resolveExecutablePath (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ProductLauncher.js:291:27)
docudigger-1  | [0]     at ChromeLauncher.executablePath (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ChromeLauncher.js:209:25)
docudigger-1  | [0]     at ChromeLauncher.computeLaunchArguments (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ChromeLauncher.js:89:37)
docudigger-1  | [0]     at async ChromeLauncher.launch (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ProductLauncher.js:70:28)
docudigger-1  | [0]     at async Puppeteer.setup (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/dist/classes/puppeteer.class.js:16:24)
docudigger-1  | [0]     at async Amazon.init (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/dist/classes/scrape-command.class.js:36:24)
docudigger-1  | [0]     at async Amazon._run (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/@oclif/core/lib/command.js:301:13)
docudigger-1  | [0]     at async Config.runCommand (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/@oclif/core/lib/config/config.js:425:25)
docudigger-1  | [0]     at async Object.run (/home/node/.npm-global/lib/node_modules/@disane-dev/docudigger/node_modules/@oclif/core/lib/main.js:95:16)
docudigger-1  | [0] docudigger scrape amazon exited with code 1
docudigger-1 exited with code 1

I am not an expert - but would it be possible to include the chrome files in the image itself?

This is the docker file that I am using:

name: amazon
services:
    docudigger:
        environment:
            - AMAZON_USERNAME=XXXXX
            - AMAZON_PASSWORD=XXXXX
            - AMAZON_TLD=de
            - AMAZON_YEAR_FILTER=2008
            - AMAZON_PAGE_FILTER=1
            - ONLY_NEW=true
        volumes:
            - C:/Users/XXXX/:/home/node/docudigger
        image: ghcr.io/disane87/docudigger:2.0.6-dev.8
volumes:
    C:
        external: true
        name: C

Disane87 · 2024-10-30T09:48:24Z

Glad the NPM works for you. It seems something broke the docker image. It's pretty strange because the debug docker image works flawlessly.

But I'm working on it. I gonna close this, as the main issue here seems to be resolved.

maxkostas changed the title ~~Not able to scrape anything beyond 2024~~ Not able to scrape anything earlier than 2024 Oct 29, 2024

Disane87 self-assigned this Oct 29, 2024

Disane87 added the test needed label Oct 29, 2024

Disane87 assigned maxkostas Oct 29, 2024

Disane87 mentioned this issue Oct 29, 2024

"Couldn't get popover #a-popover-content-3 within 2000ms" #997

Open

Disane87 closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to scrape anything earlier than 2024 #1135

Not able to scrape anything earlier than 2024 #1135

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024 •

edited

Loading

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024

maxkostas commented Oct 29, 2024

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024

Disane87 commented Oct 29, 2024 •

edited

Loading

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024 •

edited

Loading

maxkostas commented Oct 30, 2024 •

edited

Loading

Disane87 commented Oct 30, 2024 •

edited

Loading

Not able to scrape anything earlier than 2024 #1135

Not able to scrape anything earlier than 2024 #1135

Comments

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024 • edited Loading

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024

maxkostas commented Oct 29, 2024

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024

Disane87 commented Oct 29, 2024 • edited Loading

maxkostas commented Oct 29, 2024

Disane87 commented Oct 29, 2024 • edited Loading

maxkostas commented Oct 30, 2024 • edited Loading

Disane87 commented Oct 30, 2024 • edited Loading

Disane87 commented Oct 29, 2024 •

edited

Loading

Disane87 commented Oct 29, 2024 •

edited

Loading

Disane87 commented Oct 29, 2024 •

edited

Loading

maxkostas commented Oct 30, 2024 •

edited

Loading

Disane87 commented Oct 30, 2024 •

edited

Loading