`fetch` command #77

jqnatividad · 2021-10-12T15:29:33Z

fetch will allow qsv to fetch HTML or data from web pages or services, to enrich a CSV (e.g. geocoding, wikidata api, etc.)

It will support authentication, concurrent requests, thresholds, etc.

Reminiscent of OpenRefine's fetch url... (https://docs.openrefine.org/manual/columnediting#add-column-by-fetching-urls) , but optimized for the command line.

The text was updated successfully, but these errors were encountered:

mhuang74 · 2021-11-30T00:30:27Z

I would like to take a crack at this. Please let me know if you have specifics in mind already.

jqnatividad · 2021-11-30T13:41:51Z

Hi @mhuang74 , the thing I like about Docopt, is that the helptext becomes an API contract definition of sorts.

Let's start by drafting the helptext first, and once we lock that down, you can then work on the code implementing it.

Can you take a first pass at the helptext?

Also, we need to think about the underlying crates to use - should we use reqwest, actix or awc?

As their underlying capabilities will enable the options we use with fetch.

mhuang74 · 2021-12-03T14:09:54Z

Thanks @jqnatividad. Here's my first pass.

Fetch values via an URL column, and optionally store them in a new column.

This command fetches HTML/data from web pages or web services for every row in the URL column, 
and optionally stores them in a new column.

URL column must contain full and valid URL path, which can be constructed via the 'lua' command.

Usage:
    qsv fetch [options] <column> [<input>]

fetch options:
    -c, --new-column <name>    Put the fetched values in a new column instead.

Common options:
    -h, --help                 Display this message
    -o, --output <file>        Write output to <file> instead of stdout.
    -n, --no-headers           When set, the first row will not be interpreted
                               as headers. Namely, it will be sorted with the rest
                               of the rows. Otherwise, the first row will always
                               appear as the header row in the output.
    -d, --delimiter <arg>      The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)
    -q, --quiet                Don't show progress bars.

jqnatividad · 2021-12-03T18:07:46Z

Great @mhuang74 !

Now we start thinking about the implementation, which will ultimately result in additional options we need to expose on the CLI.

The first one is throttling, as qsv cannot just hit an endpoint without it, as most APIs have rate limits, and your IP can even be blacklisted if your query can easily mistaken for a DOS attack. There should be a default, conservative throttle delay (say 5000 ms like OpenRefine) that the user can override.

Also, what user agent should qsv present? It should definitely have a default one that says qsv, but the user should be presented with an option to use an alternate user agent string.

Another one is response caching. You may want to check out cached. I used it with apply geocode and it doubled performance just by prefixing the fn with a cached! macro 1-liner (before rustfmt, that is).

https://github.com/jqnatividad/qsv/blob/ead871b5dd65610e3cff25135434d1c240a54b63/src/cmd/apply.rs#L563-L569

Also, there's some work being done to allow cached to leverage a REDIS instance (jaemk/cached#68), and to store the cache to disk (jaemk/cached#20), that we can take advantage of in the future.

Finally, cached supports multi-threading with its sync_writes option, so in a future fetch version, you can really have very fast fetch jobs.

As for the underlying HTTP client library to use - https://blog.logrocket.com/the-state-of-rust-http-clients/

mhuang74 · 2021-12-04T05:36:25Z

Thanks @jqnatividad for detailed feedback. Here's my 2nd pass.

The need for >1000 ms (1 sec) throttle delay seems to be fundamentally at odds with high number of concurrent requests. But in case some servers do allow higher burst rate limits, we probably want to support multi-thread off the bat.
Storing Cookie seems important for auth
On error defaults to returning blank, but can store error
Support response caching
HTTP Header can be fully controlled via file

And more importantly, since we are really looking at fairly low request/sec throughput due to server rate limits, choice of library would be more about practicality rather than performance.

Probably would just go with reqwest, since:

supports cookie storage
supports system proxies
supports HTTPS/TLS
and sits on top of production tested hyper crate

Fetch values via an URL column, and optionally store them in a new column.

This command fetches HTML/data from web pages or web services for every row in the URL column, 
and optionally stores them in a new column.

URL column must contain full and valid URL path, which can be constructed via the 'lua' command.

To set proxy, please set env var HTTP_PROXY and HTTPS_PROXY (eg export HTTPS_PROXY=socks5://127.0.0.1:1086)

Usage:
    qsv fetch [options] [<column>] [<input>]

fetch options:
    --new-column=<name>        Put the fetched values in a new column instead.
    --threads=<value>          Number of threads for concurrent requests.                 
    --throttle-delay=<ms>      Set delay between requests in milliseconds. Recommend 1000 ms or greater (default: 5000 ms).
    --http-header-file=<file>  File containing additional HTTP Request Headers. Useful for setting Authorization or overriding User Agent.
    --cache-responses          Cache HTTP Responses to increase throughput.
    --on-error-store-error     On error, store HTTP error instead of blank value.
    --store-and-send-cookies   Automatically store and send cookies. Useful for authenticated sessions.

Common options:
    -h, --help                 Display this message
    -o, --output <file>        Write output to <file> instead of stdout.
    -n, --no-headers           When set, the first row will not be interpreted
                               as headers. Namely, it will be sorted with the rest
                               of the rows. Otherwise, the first row will always
                               appear as the header row in the output.
    -d, --delimiter <arg>      The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)
    -q, --quiet                Don't show progress bars.

jqnatividad · 2021-12-05T15:21:32Z

@mhuang74 I like your 2nd pass. Some thoughts:

The need for >1000 ms (1 sec) throttle delay seems to be fundamentally at odds with high number of concurrent requests. But in case some servers do allow higher burst rate limits, we probably want to support multi-thread off the bat.

I agree. We should bake in multi-threading. Performance and multi-threading are signature advantages of Rust, and we should leverage it while remaining cognizant of rate limits.

Storing Cookie seems important for auth

On error defaults to returning blank, but can store error

Support response caching

HTTP Header can be fully controlled via file

Agree on all the above. With all the files we need to keep around, we should consider creating a ~/.qsv directory where we can store all the files - cookies, headers, persistent cache, etc.

Probably would just go with reqwest, since:

supports cookie storage

supports system proxies

supports HTTPS/TLS

and sits on top of production tested hyper crate

And yes, reqwest seems to be the best choice.

As for the docopt usage help text:

I'd change threads to jobs to be consistent with the other multi-threaded qsv commands (frequency, split and stats).
I'd change http-header-file to just header (like curl's CLI), and consider having a default header (named default.header?) stored in ~/.qsv .
Also consider using less verbose options (throttle, cache, store-error and cookies)

Finally, you may want to look into RateLimit http header fields to bake in support for auto-adjusting rate limiters to rate-limited services.

There are some discussions about it on reqwest that's worth investigating.

mhuang74 · 2021-12-11T17:12:36Z

@jqnatividad As most API return JSON, this seems important to support. But I am struggling with:

JSON requires double quotes and commas, which means I cannot save it to CSV then run foreach with jq to parse out desired value
Believe CSV Encoding can encode comma as well. But jq can't handle it I don't think.
Also, foreach is not supported on Windows platform

Example JSON Response

)$ QSV_LOG_LEVEL=DEBUG ./target/debug/qsv fetch  3 --new-column Geo test.csv
Country,ZipCode,URL,Geo
US,90210,http://api.zippopotam.us/us/90210,"{""post code"": ""90210"", ""country"": ""United States"", ""country abbreviation"": ""US"", ""places"": [{""place name"": ""Beverly Hills"", ""longitude"": ""-118.4065"", ""state"": ""California"", ""state abbreviation"": ""CA"", ""latitude"": ""34.0901""}]}"
US,94105,http://api.zippopotam.us/us/94105,"{""post code"": ""94105"", ""country"": ""United States"", ""country abbreviation"": ""US"", ""places"": [{""place name"": ""San Francisco"", ""longitude"": ""-122.3892"", ""state"": ""California"", ""state abbreviation"": ""CA"", ""latitude"": ""37.7864""}]}"
US,92802,http://api.zippopotam.us/us/92802,"{""post code"": ""92802"", ""country"": ""United States"", ""country abbreviation"": ""US"", ""places"": [{""place name"": ""Anaheim"", ""longitude"": ""-117.9228"", ""state"": ""California"", ""state abbreviation"": ""CA"", ""latitude"": ""33.8085""}]}"

Questions:

should there be a better way to parse out desired value from JSON response than relying on foreach ? e.g. some kind of internal JSON parsing support.

jqnatividad · 2021-12-11T18:36:08Z

@mhuang74, here are some ideas for consideration:

what about saving the JSON response as escaped ndjson/JSONlines in the CSV column? In that way, you don't need to worry about commas, saving the response with escaped quotes and newline without breaking CSV parsing. Also, folks can leverage the existing jsonl qsv command downstream for further processing.
instead of foreach, use the lua command which works on Windows. Also, there are existing lua libraries for JSON parsing (check out how I used the date library in lua to compute quarter for a given date (https://github.com/jqnatividad/qsv/wiki/Cookbook#date-enrichment) that can be leveraged.
and yes, I agree there needs to be some internal support for JSON parsing. In addition to saving the JSON response as ndjson/jsonl, perhaps you can include another option to execute a json query before saving the result to the --new-column. I just found jql - it looks promising and can be used as a library.

mhuang74 · 2021-12-12T11:04:23Z

@jqnatividad

Thanks for feedback and suggestion. Directly supporting JSON parsing seems the way to go.

I was able to integrate with jql as a library.

$ QSV_LOG_LEVEL=DEBUG ./target/debug/qsv fetch  3 --new-column City --jql '."places"[0]."place name"' test.csv
Country,ZipCode,URL,City
US,90210,http://api.zippopotam.us/us/90210,Beverly Hills
US,94105,http://api.zippopotam.us/us/94105,San Francisco
US,92802,http://api.zippopotam.us/us/92802,Anaheim

$ QSV_LOG_LEVEL=DEBUG ./target/debug/qsv fetch  3 --jql '."places"[0]."place name"' test.csv
Beverly Hills
San Francisco
Anaheim

https://github.com/mhuang74/qsv/tree/fetch_poc

jqnatividad · 2021-12-13T13:23:27Z

Hi @mhuang74 !
I hope you don't mind but I went ahead and merged your changes even if fetch is not fully finished as I wanted to include it in today's release and mark it as beta, and perhaps, even as an optional feature, so other folks can test it and give us feedback.

Even in its current form, it's already very useful, and once we implement the other options, it'll be one of qsv's tentpole commands.

mhuang74 · 2021-12-14T06:29:39Z

@jqnatividad Okay, glad to hear it. Thanks for doing the merge and refactoring.

jqnatividad · 2021-12-15T00:37:32Z

@mhuang74 tweaked it a bit more as I'm planning to already use it in a pipeline I'm working on :)

added user-agent (fetch does not have a user-agent #130).
added additional tests for SSL endpoints and using a web service that requires a user-agent (Wikidata)

And some additional thoughts on the other options:

I think we can get rid of the cache option. We have the function cache by default, and if folks want more persistent, flexible caching, they can take advantage of the HTTP_PROXY env var you pointed out in the fetch helptext (and use NGINX, Squid, etc.) as beyond simple proc level cache, that's well beyond the scope of qsv.
instead of throttle, use rate-limit using the approaches described in Async Rate Limiting seanmonstar/reqwest#491

jqnatividad · 2021-12-16T18:47:31Z

Hi @mhuang74 ,
Did some more tweaks here - #132

mhuang74 · 2021-12-17T12:11:58Z

Thanks @jqnatividad. Glad to see the feature quickly filling out.

Please let me know if you still need help with fetch rate-limiting.

I can also move to something less of a critical path for your projects since I only get to work on this over weekends.

jqnatividad · 2021-12-17T13:49:09Z

Thanks @mhuang74! If you can help with rate-limiting, that'd be awesome! I think getting headers & cookies working will also be great for authenticated sessions while you're at it.

And feel free to jump into anything in the backlog that piques your interest. #46 and #60 in particular would really bulk up qsv's schema validation capabilities.

mhuang74 · 2021-12-18T15:21:47Z

Sure. Here's my plan for fetch rate-limiting:

use actix-governor to simulate a simple JSON API service in Unit Test, including rate limits for bursts and requests-per-second
play with governor to do rate-limiting for fetch as well

Fetch: support rate limiting as per #77

jqnatividad · 2022-01-07T00:32:40Z

Hi @mhuang74 , fetch is really coming together!

I tweaked test_fetch.rs and added some convenience macros to make it easier to change the test port and ran rustfmt.

Once the --header parameter is done, I'll take off the WIP label. 🚀 💯 🎉

mhuang74 · 2022-01-07T01:47:08Z

Thanks @jqnatividad.

Regarding --header, I am planning to use Java Properties format via java-properties crate. It should be easy to copy-paste Headers from Chrome.

For integration test, I am having difficulties finding an api that requires key yet does not require my credit card (eg RapidAPI).

The only one I've found is api.data.gov but it also recommend not to share the key.

Thoughts?

X-Api-Key: DEMO_KEY

jqnatividad · 2022-01-07T03:13:11Z

Thinking about it, --header can be confused with CSV headers.

Perhaps, we should rename it -httpheader?

As for integration testing, I googled this up - https://mixedanalytics.com/blog/list-actually-free-open-no-auth-needed-apis/.

After a quick scan of the list , JSONPlaceholder and HTTPBin looks promising.
nationalize.io is interesting too, as it returns rate-limit response headers too when your free API key exceeds the free daily quota.

Alternatively, perhaps you can parameterize the integration test that requires an API key and populate it from an environment variable, and if the envar is not set, the test is skipped. We just mention in the fetch documentation that they need to get their own API key.

jqnatividad · 2022-01-08T17:34:42Z

@mhuang74 with the 0.29 release, fetch is no longer WIP! 💯 🚀 🥳 🎉

I'm closing this issue, but you're welcome to keep improving fetch or take on any other qsv command as you wish.

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 12, 2021

jqnatividad changed the title ~~apifetch command~~ fetch command Oct 13, 2021

jqnatividad mentioned this issue Nov 19, 2021

"Heavy-duty" configurable geocode command #96

Closed

jqnatividad assigned mhuang74 Dec 13, 2021

jqnatividad pinned this issue Dec 23, 2021

mhuang74 mentioned this issue Dec 30, 2021

Fetch: support rate limiting #133

Merged

jqnatividad added a commit that referenced this issue Dec 30, 2021

Merge pull request #133 from mhuang74/fetch-rate-limiting

6143e27

Fetch: support rate limiting as per #77

This was referenced Jan 4, 2022

Fetch: improve jql integration #139

Merged

Fetch: enable cookies and storing error messages #141

Closed

mhuang74 mentioned this issue Jan 8, 2022

Fetch: support custom http headers #143

Merged

jqnatividad closed this as completed Jan 8, 2022

jqnatividad unpinned this issue Jan 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fetch` command #77

`fetch` command #77

jqnatividad commented Oct 12, 2021 •

edited

Loading

mhuang74 commented Nov 30, 2021

jqnatividad commented Nov 30, 2021 •

edited

Loading

mhuang74 commented Dec 3, 2021

jqnatividad commented Dec 3, 2021 •

edited

Loading

mhuang74 commented Dec 4, 2021 •

edited

Loading

jqnatividad commented Dec 5, 2021

mhuang74 commented Dec 11, 2021 •

edited

Loading

jqnatividad commented Dec 11, 2021

mhuang74 commented Dec 12, 2021 •

edited

Loading

jqnatividad commented Dec 13, 2021

mhuang74 commented Dec 14, 2021

jqnatividad commented Dec 15, 2021

jqnatividad commented Dec 16, 2021

mhuang74 commented Dec 17, 2021 •

edited

Loading

jqnatividad commented Dec 17, 2021

mhuang74 commented Dec 18, 2021

jqnatividad commented Jan 7, 2022

mhuang74 commented Jan 7, 2022 •

edited

Loading

jqnatividad commented Jan 7, 2022

jqnatividad commented Jan 8, 2022

fetch command #77

fetch command #77

Comments

jqnatividad commented Oct 12, 2021 • edited Loading

mhuang74 commented Nov 30, 2021

jqnatividad commented Nov 30, 2021 • edited Loading

mhuang74 commented Dec 3, 2021

jqnatividad commented Dec 3, 2021 • edited Loading

mhuang74 commented Dec 4, 2021 • edited Loading

jqnatividad commented Dec 5, 2021

mhuang74 commented Dec 11, 2021 • edited Loading

jqnatividad commented Dec 11, 2021

mhuang74 commented Dec 12, 2021 • edited Loading

jqnatividad commented Dec 13, 2021

mhuang74 commented Dec 14, 2021

jqnatividad commented Dec 15, 2021

jqnatividad commented Dec 16, 2021

mhuang74 commented Dec 17, 2021 • edited Loading

jqnatividad commented Dec 17, 2021

mhuang74 commented Dec 18, 2021

jqnatividad commented Jan 7, 2022

mhuang74 commented Jan 7, 2022 • edited Loading

jqnatividad commented Jan 7, 2022

jqnatividad commented Jan 8, 2022

`fetch` command #77

`fetch` command #77

jqnatividad commented Oct 12, 2021 •

edited

Loading

jqnatividad commented Nov 30, 2021 •

edited

Loading

jqnatividad commented Dec 3, 2021 •

edited

Loading

mhuang74 commented Dec 4, 2021 •

edited

Loading

mhuang74 commented Dec 11, 2021 •

edited

Loading

mhuang74 commented Dec 12, 2021 •

edited

Loading

mhuang74 commented Dec 17, 2021 •

edited

Loading

mhuang74 commented Jan 7, 2022 •

edited

Loading