-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which of these methods to limit how fast things get piped? #558
Comments
I'm not super familiar with the async functions spec, but I think the code you've posted will just execute as many insertIntoDB as necessary to get through your input data, all in parallel. The fact that you You're right that backpressure is what you want though. In Highland, you get it for free as long as you don't opt-out. Using To do what you want, you need to use For example, download.stream(url)
.pipe(zlib.createGunzip())
.pipe(csvStream)
.pipe(highland())
.batch(10000)
// The result of the compose is equivalent to this arrow function:
// data => highland(insertIntoDB(data))
// For every object, which is 10000 rows, call insertIntoDB, which returns a promise,
// then wrap the promise in a stream using the highland constructor. You now have a
// stream of streams.
.map(highland.compose(highland, insertIntoDB))
// Merge the stream elements together by consuming them, making sure that only 3 are being
// consumed at a time. You may, of course, replace 3 with whatever parallelism factor you
// want.
.mergeWithLimit(3)
// Consume the results and execute the callback when done.
.done(() => {
console.log('I am done.');
}); The reason this works is the laziness and backpressure features of Highland. Nothing happens until you call
|
Thank you for this explanation! Very clear and useful. Is there a way to stop piping when a certain condition is met? In the snippet below,
But when some condition is met, like for example the pages no longer exist (which would happen at some point since the iterator goes to I think I would need to put something between |
Yes, there's a way, but you'll have to write your own operator via See the implementation for slice for an example of this "early return". You can perform many arbitrary transforms using In your case, I'm going to assume you want to keep going until you see a 404, which could be done like this (assuming the use of highland(range(0, Infinity))
.consume((err, page, push, next) => {
if (err) {
push(err); // Just pass on errors.
next(); // Ask for more data
} else if (page === highland.nil) {
// End marker. Pass that on too. There is guaranteed to be no more data,
// so we don't call next().
push(null, highland.nil);
} else {
request(pageToUrl(page));
.on('response', res => {
if (res.statusCode === 200) {
// We have data, so we push a new stream.
push(null, highland(res));
next(); // Ask for more
} else {
// No more data, so we push highland.nil, which will cause
// the stream to stop. As always, laziness will prevent the original
// infinite stream from being consumed anymore.
push(null, highland.nil);
});
}
})
.parallel(...)
... |
That looks great! I can't seem to get it to work in parallel though. It takes the same amount of time regardless of what value I pass to In this code I changed it slightly so that I can call function download(data, push, next) {
console.log(`start ${data}`)
return request(url + data)
.then(async (response) => {
if (response.body.includes('last page')) {
push(null, highland.nil)
} else {
await doSomething(response)
push(null, highland(response))
next()
}
console.log(`finish ${data}`)
})
}
highland(util.range(1, 10))
.consume((err, page, push, next) => {
if (err) {
push(err)
next()
} else if (page === highland.nil) {
push(null, highland.nil)
} else {
download(page, push, next)
}
})
.parallel(10)
.done(() => {
console.log('done')
}) Gives the output
|
This is expected behavior. The handler that you pass to You should be able to see the parallelism if you do this instead function download(data, push, next) {
console.log(`start ${data}`)
return request(url + data)
.then(async (response) => {
if (response.body.includes('last page')) {
push(null, highland.nil)
} else {
await doSomething(response)
const stream = highland(response);
// This will wait until the stream completes before printing "finish"
stream.observe()
.done(() => console.log(`finish ${data}`))
push(null, stream)
next()
}
})
} As for why your code takes the same amount of time regardless of the parallelism factor, I'm not sure. It could be because of the .then((response) => {
if (response.body.includes('last page')) {
push(null, highland.nil)
} else {
const stream = highland(async (push, next) => {
// This function will only be executed when the stream is consumed.
await doSomething(response)
next(highland(response));
});
// This will wait until the stream completes before printing "finish"
stream.observe()
.done(() => console.log(`finish ${data}`))
push(null, stream)
next()
}
}) It could also be that the majority of the time is spent in establishing the connection and that the download is relatively fast. If this is the case, then maybe you don't benefit from increased parallelism. FYI, there's currently an issue right now with pushing |
You could also do some sort of speculative downloading so eliminate this bottleneck as well. Not sure if it's worth it, but I'm happy to explain more if you'd like. |
Fix released. You'll want to upgrade to 1.10.1. |
I must be doing something wrong, because I still get the same output as in my previous post, where it acts exactly as though I had set Here's a standalone example where I've included only what's necessary to demonstrate this. highland(range(1, 50))
.consume((err, data, push, next) => {
console.log(`start ${data}`)
got(`http://www.news.com.au/`, { timeout: 5000 })
.then((response) => {
const stream = highland(response)
stream.observe()
.done(() => console.log(`finish ${data}`))
push(null, stream)
next()
})
})
.parallel(10)
.done(() => {
console.log('done')
}) |
This probably happens because the download time is not much larger than the connection time. It may be that the entire response arrived in a single Socket Buffer or something like that. I was able to get the parallel behavior when I set up local servers to download larger files. Repro steps:
The result that I get is
|
It's quite strange, even if I use the example you just gave or the one from my previous post, and I use a url to a large file (in this case http://releases.ubuntu.com/16.04.1/ubuntu-16.04.1-desktop-amd64.iso), the output is |
Yep, I get parallel behavior using the Ubuntu URL. I tested on Node 7 on an Ubuntu VM and Node 6.9 on a Windows machine. |
Here's the exact file that I used: https://gist.github.com/vqvu/6174247b413db479acd43c32f9bd551c. |
Thanks for providing that gist, I'm able to see now that the issue is being caused by the way I do the request. This code in your example returns a Promise that
But if I get the Promise by doing this (tried with both
It's really confusing to me because in all cases a Promise is being returned, and promises are resolved asynchronously. Any idea why my way of doing it would block in |
I've never used Maybe try using the |
Interesting. I can go with the with way you did it with
You mentioned this earlier and I think it might simplify the way I'm doing this. Would you mind explaining it? Does it mean downloading more pages than necessary and when all in the batch have got 404s then the program terminates? I feel like the final code might be cleaner in such a setup (as it's not a problem for this use case to download more pages than necessary). |
Yeah, it's basically download more than necessary, and you can even use function download() {
let done = false;
return (err, page, push, next) => {
if (page === highland.nil || done) {
push(null, highland.nil)
} else if (err) {
push(err)
next()
} else {
console.log(`start ${page}`);
// Just try the download. If there's an error, we toggle the `done` flag and
// future attempts at downloads will end the stream.
const stream = highland(requestPromise(makeUrl(page)))
.errors((error, push) => {
if (!is404(error)) {
// Assume that 404 means no more data.
done = true;
} else {
// Pass through all other errors.
push(error);
}
});
stream.observe()
.done(() => console.log(`end ${page}`));
push(null, stream);
next();
}
};
}
highland(util.range(1, 10))
.consume(download())
.parallel(10)
.done(() => {
console.log('done')
}) Worst case, you download |
Thanks again! That is indeed simpler. Is it possible to put sets of streams together sequentially? async function doHighlandOperation(array, operationFunction) {
const range = await getRange(array)
const firstOperation = highland(range)
.consume(operationFunction())
.parallel(10)
.done()
}
async function run() {
await doHighlandOperation([1, ....... 100], doSomething())
await doHighlandOperation([2000, ....... 3000], differentThing())
} If I run this code, both |
The reason this doesn't work is because Either convert the stream into a Promise and async function doHighlandOperation(array, operationFunction) {
const range = await getRange(array)
await new Promise((res, rej) => highland(range)
.consume(operationFunction())
.parallel(10)
.stopOnError(rej)
.done(res));
} or use async function doHighlandOperation(array, operationFunction) {
const range = await getRange(array)
// Note done isn't called, so we don't start the stream
return highland(range)
.consume(operationFunction())
.parallel(10);
}
async function run() {
const s1 = await doHighlandOperation([1, ....... 100], doSomething())
const s2 = await doHighlandOperation([2000, ....... 3000], differentThing())
await new Promise((res, rej) => _([s1, s2]).sequence()
.stopOnError(rej)
.done(res));
} |
@vqvu Both solutions you suggested (converting the stream to a Promise, or using For example, here all the async function loop() {
while (true) {
await getAll()
}
}
function getAll() {
highland(getPages())
.map(page => highland(download(page)))
.parallel(5)
.done(() => {
console.log('done')
})
} |
Making |
In the below example from earlier, could you give some clarification on what the highland(range(0, 1000))
.map(page => highland(download(page)))
.parallel(5)
.done(() => {
console.log('I am done.')
}) Are both of these options fine to use as the Option 1: The function isn't async, so therefore it must return a Promise. function download(page) {
return new Promise((resolve, reject) => {
request(page, (e, response) => {
if (e) { reject(e) } else { resolve(response) }
})
})
} Option 2: The function is async, so it doesn't return anything but it must await the Promise. async function download(page) {
await new Promise((resolve, reject) => {
request(page, (e, response) => {
if (e) { reject(e) } else { resolve(response) }
})
})
} |
|
Looking at the documentation for http://highlandjs.org/#parallel, it says that it is "buffering the results until they can be returned to the consumer in their original order". So in the example Is there a way to do it where as soon as 1 of the 10 files finishes, it grabs the next one immediately? So at any given time, there are always 10 files being processed in parallel if there are still files available in the source array ( This will of course prevent the results from being returned in their original order, but it's fine given the improvement in parallelism. |
If you don't care about getting results in the original order, then you can use mergeWithLimit instead. |
I found myself wanting to consume an elasticsearch scroll as a stream. It was pretty simple to write a generator function that would continue to fetch the next batch. However, the issue I had was that the next batch would not be fetched until the current batch was finished consuming. This caused the total time to be I struggled getting this batching, flattening, and maintaining back pressure right and wanted to share my solution since it is simple and working well. 'use strict';
const highland = require('highland');
// Resolve after `n` milliseconds for simulating an HTTP get
function delay(n) {
return new Promise(resolve => setTimeout(resolve, n));
}
// This isn't really important. It's just simulating what elasticsearch does
// internally for scrolling.
function FakeCursor({
batchSize = 5,
numBatches = 10,
delay : delayMS = 1000
} = {}) {
const docs = (new Array(batchSize)).fill(0).map((_, i) => i);
return function getNext(n) {
console.log(`requested: ${n}`);
return delay(delayMS).then(() => {
console.log(`sent: ${n}`);
if (n >= numBatches) {
return {
docs : []
};
}
return {
next : n + 1,
docs : docs.map(i => `${n} - ${i}`)
};
});
};
}
// The specifics of this aren't important. This method should maintain
// internal state and return a generator method for highland. It should
// emit batched collections as single events and not an event for each item
// in order to simplify parallelism.
function FakeCursorGenerator(opts) {
const getNext = FakeCursor(opts);
let n = 0;
return (push, done) => getNext(n).then(({ next, docs }) => {
n = next;
if (docs.length) {
push(null, docs);
done();
} else {
push(null, highland.nil);
}
});
}
// If you consumption is mostly synchronous, this is critical. If you do
// not push the current set of data to the end of the event loop, your
// next fetch in the generator will be blocked until after consumption.
function AsyncStream() {
return data => highland(push => setImmediate(() => {
push(null, data);
push(null, highland.nil);
}));
}
// Simulate some synchronous processing that takes a non-trivial amount of time
function BusyWait(n) {
return doc => {
const end = Date.now() + n;
while (Date.now() < end);
return doc;
};
}
// Set up the stream from a generator. This send batches of items resulting
// from each fetch.
highland(FakeCursorGenerator())
// Push processing to the end of the event loop
.map(AsyncStream())
// Maintain back-pressure so that we are fetching concurrently with
// processing. Unless you have very eratic times, it is likely
// unnecessary to use any value other than `2`
.parallel(2)
// Flatten the events out
.sequence()
// The rest of your processing continues as normal
.map(BusyWait(100))
.each(console.log); |
In the example below, I'm downloading a gzipped file that's several hundred megabytes (millions of rows), then piping it through highland so I can use the
batch
feature. This way, instead of inserting 1 row into the db at a time (csvStream emits one 'data' event for one row), I can do it for 10000 rows at a time in the 'data' event handler you see below.But when running this I get out of memory errors and the system starts to slow down significantly. I think it's because the data is coming in too fast to the 'data' event. The csvStream's
finish
event happens in a couple of minutes but the program runs for up to another hour, which indicates that the whole csv file has been read into memory, rather than being piped downstream piece by piece as thedata
event consumes the batches.I'm new to highland and looking through the documentation I can't tell which of the various methods would be the most appropriate in this case. http://highlandjs.org/#backpressure seems like it's most relevant to this situation but I can't tell how to use it in this code. http://highlandjs.org/#parallel looks good too.
Can I configure highland so that at any time there's only for example 3 batches (where
batch
is 10000) worth of rows that have been read? And it only reads another 10000 rows when one of those 3 batches is complete.The text was updated successfully, but these errors were encountered: