Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON parser doesn't parse all inputstream if a stream contain multiple inputStream. #7

Open
hiroyuki-sato opened this issue Aug 24, 2023 · 2 comments

Comments

@hiroyuki-sato
Copy link
Member

Overview.

The following configuration gets JSON data using HTTP results
It outputs nine entries, But the current parser outputs three entries only.

in:
  type: http
  url: http://express.heartrails.com/api/json
  params:
    - {name: method, value: getStations}
    - {name: x, value: 135.0}
    - {name: y, value: "35"}
  cursor:
    request_parameter_cursor_name: name
    response_parameter_cursor_json_path: '$.response.station[0].next'
  parser:
    type: json
    root: '/response/station'
    flatten_json_array: true

out: {type: stdout}

Envnironment

  • Embulk: v0.11.0
  • embulk-input-http 16a4adf94e60caddc0ce590e16332a7111043a4b forked version.
  • embulk-parser-json: 0.4.0

The reason.

The results are separated by multiple HTTP responses. Those results are independent JSON objects like multiple files.
However, It constructs a single FileInputInputStream which contains multiple InputStreams.

embulk-parser-json just parses the first inputStream, as a result, It outputs three entires only.

It is the same issue embulk-parser-jsonpath

Most plugin's TransactionalFileInput has only one file (input stream), but the embulk specifications also supports multiple files (input streams) .
In the latter case, only the first file(input stream) is read in the current implementation.

Execution results.

embulk-input-http invoked the GET request six times.

2023-08-24 09:29:13.973 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35"
2023-08-24 09:29:15.686 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84"
2023-08-24 09:29:15.754 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E6%9C%AC%E9%BB%92%E7%94%B0"
2023-08-24 09:29:15.799 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%88%B9%E7%94%BA%E5%8F%A3"
2023-08-24 09:29:15.840 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E4%B9%85%E4%B8%8B%E6%9D%91"
2023-08-24 09:29:15.986 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%B0%B7%E5%B7%9D"

Simulate with the curl command.

% curl -Lv 'http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35'
*   Trying 35.75.165.181:80...
* Connected to express.heartrails.com (35.75.165.181) port 80 (#0)
> GET /api/json?method=getStations&x=135.0&y=35 HTTP/1.1
> Host: express.heartrails.com
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 24 Aug 2023 00:38:05 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 572
< Connection: keep-alive
< Server: nginx
< Expires: Thu, 01 Dec 1994 16:00:00 GMT
< Pragma: no-cache
< X-Runtime: 1
< ETag: "952bd603b2f475e0e56ae31927adb679"
< Cache-Control: private, max-age=0, must-revalidate
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, OPTIONS
< Access-Control-Allow-Headers: *
<
{"response":{"station":[{"name":"日本へそ公園","prefecture":"兵庫県","line":"JR加古川線","x":134.997633,"y":35.002069,"postal":"6770039","distance":"320m","prev":"比延","next":"黒田庄"},{"name":"比延","prefecture":"兵庫県","line":"JR加古川線","x":134.995733,"y":34.988773,"postal":"6770033","distance":"1310m","prev":"新西脇","next":"日本へそ公園"},{"name":"黒田庄","prefecture":"兵庫県","line":"JR加古川線","x":134.992522,"y":35.022689,"postal":"6790313","distance":"2620m","prev":"日本へそ公園","next":"本黒田"}]}}
* Connection #0 to host express.heartrails.com left intact
{
  "response": {
    "station": [
      {
        "name": "日本へそ公園",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.997633,
        "y": 35.002069,
        "postal": "6770039",
        "distance": "320m",
        "prev": "比延",
        "next": "黒田庄"
      },
      {
        "name": "比延",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.995733,
        "y": 34.988773,
        "postal": "6770033",
        "distance": "1310m",
        "prev": "新西脇",
        "next": "日本へそ公園"
      },
      {
        "name": "黒田庄",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.992522,
        "y": 35.022689,
        "postal": "6790313",
        "distance": "2620m",
        "prev": "日本へそ公園",
        "next": "本黒田"
      }
    ]
  }
}
% curl -Lv 'http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84'
*   Trying 35.75.165.181:80...
* Connected to express.heartrails.com (35.75.165.181) port 80 (#0)
> GET /api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84 HTTP/1.1
> Host: express.heartrails.com
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 24 Aug 2023 00:39:18 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 192
< Connection: keep-alive
< Server: nginx
< Expires: Thu, 01 Dec 1994 16:00:00 GMT
< Pragma: no-cache
< X-Runtime: 1
< ETag: "3be1f77accfba140aa48670d77eb6e97"
< Cache-Control: private, max-age=0, must-revalidate
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, OPTIONS
< Access-Control-Allow-Headers: *
<
{"response":{"station":[{"name":"黒田庄","prefecture":"兵庫県","line":"JR加古川線","x":134.992522,"y":35.022689,"postal":"6790313","prev":"日本へそ公園","next":"本黒田"}]}}
* Connection #0 to host express.heartrails.com left intact
{
  "response": {
    "station": [
      {
        "name": "黒田庄",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.992522,
        "y": 35.022689,
        "postal": "6790313",
        "prev": "日本へそ公園",
        "next": "本黒田"
      }
    ]
  }
}

....

Example reproduce outputs

2023-08-24 18:00:12.915 +0900 [INFO] (main): m2_repo is set as a sub directory of embulk_home: /Users/user/.embulk/lib/m2/repository
2023-08-24 18:00:12.918 +0900 [INFO] (main): gem_home is set as a sub directory of embulk_home: /Users/user/.embulk/lib/gems
2023-08-24 18:00:12.918 +0900 [INFO] (main): gem_path is set empty.
2023-08-24 18:00:12.918 +0900 [DEBUG] (main): Embulk system property "default_guess_plugin" is set to: "gzip,bzip2,json,csv"
2023-08-24 18:00:13.049 +0900 [INFO] (main): Started Embulk v0.11.0
2023-08-24 18:00:14.752 +0900 [INFO] (0001:transaction): Gem's home and path are set by system configs "gem_home": "/Users/user/.embulk/lib/gems", "gem_path": ""
2023-08-24 18:00:15.364 +0900 [INFO] (0001:transaction): Loaded JRuby runtime 9.4.2.0
2023-08-24 18:00:15.395 +0900 [INFO] (0001:transaction): Loaded plugin embulk/input/http from a load path
2023-08-24 18:00:15.487 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-stdout
2023-08-24 18:00:15.538 +0900 [INFO] (0001:transaction): Loaded plugin embulk-parser-json
2023-08-24 18:00:15.687 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=16 / output tasks 8 = input tasks 1 * 8
2023-08-24 18:00:15.724 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2023-08-24 18:00:15.860 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35"
2023-08-24 18:00:15.985 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84"
2023-08-24 18:00:16.007 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E6%9C%AC%E9%BB%92%E7%94%B0"
2023-08-24 18:00:16.029 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%88%B9%E7%94%BA%E5%8F%A3"
2023-08-24 18:00:16.050 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E4%B9%85%E4%B8%8B%E6%9D%91"
2023-08-24 18:00:16.085 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%B0%B7%E5%B7%9D"
{"prefecture":"兵庫県","distance":"320m","line":"JR加古川線","next":"黒田庄","prev":"比延","x":134.997633,"y":35.002069,"postal":"6770039","name":"日本へそ公園"}
{"prefecture":"兵庫県","distance":"1310m","line":"JR加古川線","next":"日本へそ公園","prev":"新西脇","x":134.995733,"y":34.988773,"postal":"6770033","name":"比延"}
{"prefecture":"兵庫県","distance":"2620m","line":"JR加古川線","next":"本黒田","prev":"日本へそ公園","x":134.992522,"y":35.022689,"postal":"6790313","name":"黒田庄"}
2023-08-24 18:00:16.162 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2023-08-24 18:00:16.167 +0900 [INFO] (main): Committed.
2023-08-24 18:00:16.167 +0900 [INFO] (main): Next config diff: {"in":{},"out":{}}
@dmikurube
Copy link
Member

Hmm, does the same happen in older versions?

@hiroyuki-sato
Copy link
Member Author

hiroyuki-sato commented Aug 25, 2023

Forked version of the embulk-input-http by trocco use InputStreamFileInput.IteratorProvider
(As far as I know, there is no input plugin that uses InputStreamFileInput.IteratorProvider except this plugin)

The original embulk-input-http doesn't use it. So the input plugin requested just one HTTP GET.

  • embulk: 0.9.25
  • embulk-input-httpd: 0.25.0 (rubygems version)
  • embulk-parser-jsonpath: 0.4.0
2023-08-25 10:21:30.748 +0900: Embulk v0.9.25
2023-08-25 10:21:31.645 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2023-08-25 10:21:33.555 +0900 [INFO] (main): Gem's home and path are set by default: "/home/user/.embulk/lib/gems"
2023-08-25 10:21:34.286 +0900 [INFO] (main): Started Embulk v0.9.25
2023-08-25 10:21:34.432 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-http (0.25.0)
2023-08-25 10:21:34.556 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=16 / output tasks 8 = input tasks 1 * 8
2023-08-25 10:21:34.568 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2023-08-25 10:21:34.615 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35"
{"prefecture":"兵庫県","distance":"320m","line":"JR加古川線","next":"黒田庄","prev":"比延","x":134.997633,"y":35.002069,"postal":"6770039","name":"日本へそ公園"}
{"prefecture":"兵庫県","distance":"1310m","line":"JR加古川線","next":"日本へそ公園","prev":"新西脇","x":134.995733,"y":34.988773,"postal":"6770033","name":"比延"}
{"prefecture":"兵庫県","distance":"2620m","line":"JR加古川線","next":"本黒田","prev":"日本へそ公園","x":134.992522,"y":35.022689,"postal":"6790313","name":"黒田庄"}
2023-08-25 10:21:34.936 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2023-08-25 10:21:34.941 +0900 [INFO] (main): Committed.
2023-08-25 10:21:34.941 +0900 [INFO] (main): Next config diff: {"in":{},"out":{}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants