You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, master pipeline is receiving and processing augment pipeline requests serially, so that only one celery worker is handling requests on both augment and master pipelines. We should also use the load-only argument to avoid loading and sending the fulltext field.
Discussion from Slack (SMD+MT):
SMD: I think we could easily speed up this process. It looks like bibcodes are sent one at a time to augment. this incurs the overhead of queueing a huge number of times. If app.request_aff_augment could handle a list of bibcodes it could package up the list of requests into a list protobuf object: https://github.com/adsabs/ADSMasterPipeline/blob/41f874a33915b1f972b938316954849e3f2f1070/adsmp/app.py#L486 https://github.com/adsabs/ADSPipelineMsg/blob/master/specs/augmentrecord.proto#L15 app.request_aff_augment call to get_record should pass the optional load_only argument since it only needs bib data and fulltext is big. If that doesn't help enough, we can request multiple database records at once. We can also have run.py simply queue batches bibcodes and use workers to read data from postgres and send off the augment request.
MT: That makes sense according to what I saw on the container: Without making use of the delay function in ADSAffil.tasks, the load was about 0.7, which sounds about right for single-threaded operation. With the delay function, load went up to about 2.2, which again makes sense if the receive, augment, and update queues are all running simultaneously. And it also makes sense that adjusting the number of workers within augment_pipeline makes no difference.
The text was updated successfully, but these errors were encountered:
Discussion with SBC & RC in early Feb 2022 suggest that not loading fulltext may provide a larger speed enhancement than bundling multiple records into List protobufs.
Currently, master pipeline is receiving and processing augment pipeline requests serially, so that only one celery worker is handling requests on both augment and master pipelines. We should also use the load-only argument to avoid loading and sending the fulltext field.
Discussion from Slack (SMD+MT):
SMD:
I think we could easily speed up this process. It looks like bibcodes are sent one at a time to augment. this incurs the overhead of queueing a huge number of times. If app.request_aff_augment could handle a list of bibcodes it could package up the list of requests into a list protobuf object: https://github.com/adsabs/ADSMasterPipeline/blob/41f874a33915b1f972b938316954849e3f2f1070/adsmp/app.py#L486 https://github.com/adsabs/ADSPipelineMsg/blob/master/specs/augmentrecord.proto#L15 app.request_aff_augment call to get_record should pass the optional load_only argument since it only needs bib data and fulltext is big. If that doesn't help enough, we can request multiple database records at once. We can also have run.py simply queue batches bibcodes and use workers to read data from postgres and send off the augment request.
MT:
That makes sense according to what I saw on the container: Without making use of the delay function in ADSAffil.tasks, the load was about 0.7, which sounds about right for single-threaded operation. With the delay function, load went up to about 2.2, which again makes sense if the receive, augment, and update queues are all running simultaneously. And it also makes sense that adjusting the number of workers within augment_pipeline makes no difference.
The text was updated successfully, but these errors were encountered: