-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support FileInputInputStream with multiple InputStreams #37
Support FileInputInputStream with multiple InputStreams #37
Conversation
…ple_input_streams
Hello, @chikamura. Thank you for creating this PR. I have some concerns.
This PR description.
[{"a":1, "b":2 }]
[{"a":2, "b":3 }]
Test result.Sample json.
[{"a":1, "b":2 }]
[{"a":2, "b":3 }] in:
type: file
path_prefix: test
parser:
type: jsonpath
# root: "$.results"
default_timezone: "Asia/Tokyo"
columns:
- { name: "a", type: long }
- { name: "b", type: long }
out:
type: stdout Expect Outputs.
Actual Outputs
|
@@ -158,8 +160,14 @@ public void run(TaskSource taskSource, Schema schema, | |||
FileInputInputStream is = new FileInputInputStream(input); | |||
while (is.nextFile()) { | |||
final JsonNode json; | |||
try { | |||
json = JsonPath.using(JSON_PATH_CONFIG).parse(is).read(jsonRoot, JsonNode.class); | |||
try (ByteArrayOutputStream bout = new ByteArrayOutputStream()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW:
How about the following solution (if you do not want to exec close()
by parse(is)
)?
Wrapping the original is
into a temporary InputStream
will avoid allocating(& copying) a new buffer to parse json.
final InputStream toParse = new InputStream() {
@Override
public int read() {
return is.read();
}
};
try {
json = JsonPath.using(JSON_PATH_CONFIG).parse(toParse).read(jsonRoot, JsonNode.class);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takumakanari
Many thanks! I think that is the better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please take care of closing is
after the operations 👍
@hiroyuki-sato
No, it isn't, each file has one json object. The problem will occur with specific implementations of a file input plugin.
I will reconsider the implementation. |
@chikamura OK. I understood the problem. Is it possible to provide reproduce code? What InputPlugin uses a multi-input stream? It helps to understand the problem. @takumakanari Thank you for your comment! |
Could you please check this repository? https://github.com/trocco-io/embulk-input-http/tree/feature/cursor config.yml
expected result
actual result in current master branch (only first http request)
|
@chikamura Thanks. I could reproduce the problem in my environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented 🖊️
@@ -152,14 +153,21 @@ public void run(TaskSource taskSource, Schema schema, | |||
final boolean stopOnInvalidRecord = task.getStopOnInvalidRecord(); | |||
|
|||
// TODO: Use Exec.getPageBuilder after dropping v0.9 | |||
try (final PageBuilder pageBuilder = new PageBuilder(Exec.getBufferAllocator(), schema, output)) { | |||
try (final PageBuilder pageBuilder = new PageBuilder(Exec.getBufferAllocator(), schema, output); | |||
final FileInputInputStream is = new FileInputInputStream(input)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion:
why don't you try the following?
try (final FileInputInputStream is = new FileInputInputStream(input)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the indentation was wrong.
Declaration of FileInputInputStream was seen outside try ().
I fixed the indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant it's better to surely close FileInputStream
, so I suggested that you declare is
in try
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like pageBuilder and is are both in try block. (One try block can include multple resources.)
Sorry if I misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, FileInputInputStream try block should include only while (is.nextFile()) block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so too 🙆♂️
while (is.nextFile()) { | ||
// parse(InputStream json) cause is.close(), so wrapping the original is into a temporary InputStream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@chikamura @takumakanari Thank you for changing and reviewing the codes. The below is just my memo. embulk-input-https fetch from data source multiple times. The results are separated by multiple HTTP responses. Those results are independent JSON objects like multiple files.
{
"response": {
"station": [
{
"name": "日本へそ公園",
"prefecture": "兵庫県",
"line": "JR加古川線",
"x": 134.997633,
"y": 35.002069,
"postal": "6770039",
"distance": "320m",
"prev": "比延",
"next": "黒田庄"
},
{
"name": "比延",
"prefecture": "兵庫県",
"line": "JR加古川線",
"x": 134.995733,
"y": 34.988773,
"postal": "6770033",
"distance": "1310m",
"prev": "新西脇",
"next": "日本へそ公園"
},
{
"name": "黒田庄",
"prefecture": "兵庫県",
"line": "JR加古川線",
"x": 134.992522,
"y": 35.022689,
"postal": "6790313",
"distance": "2620m",
"prev": "日本へそ公園",
"next": "本黒田"
}
]
}
}
{
"response": {
"station": [
{
"name": "黒田庄",
"prefecture": "兵庫県",
"line": "JR加古川線",
"x": 134.992522,
"y": 35.022689,
"postal": "6790313",
"prev": "日本へそ公園",
"next": "本黒田"
}
]
}
} .... |
@hiroyuki-sato |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@hiroyuki-sato approved. sorry for the late 🙏 |
Thanks! |
Many thanks! 🙏 |
@chikamura I'll release the new version in a week If you don't plan to create a new PR. |
@hiroyuki-sato |
Currently when FileInputInputStream has multiple InputStreams, only the first InputStream is read. For example, this test fails.
This is because the FileInputInputStream is closed when calling ParseContext.parse(InputStream json), and next FileInputInputStream.nextFile() is false.
I added a fix that does not close, so could you please check this pr?