Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: process package in queue instead of batch #656

Merged
merged 9 commits into from
Jul 12, 2021
Merged

Conversation

bodinsamuel
Copy link
Contributor

@bodinsamuel bodinsamuel commented Jul 11, 2021

Most batch were slowed down by the slowest package, which means fast packages were still not indexed.
Here, we take advantage of NodeJs being good with async/sync and indexing sooner.

The prefetcher, only prefetch the names of the packages, so it's not filling the memory but still provide a way to fill the queue rapidly.

In the queue we fetch package per package, which is a bit slower than fetching a pages of doc, but results in less memory used and gc sooner too.
The processing itself is the same as before, except we don't do batch but one by one (see #652)


Except that, I have been extensively testing in Google Compute Engine, to replace slow and expensive Heroku.
I lack a deploy script, monitoring and everything else, but it's working very fine (see comments below).
GCE being very "manual" (except the start of the docker image and the CPU tracking there is nothing 🤦🏻 ), I think I'll go for the overkill solution -> Kubernetes, but at least it has native monitoring, auto-restart and we know the stack perfectly at Algolia.

@bodinsamuel bodinsamuel self-assigned this Jul 11, 2021
@bodinsamuel bodinsamuel changed the title Feat/one by one feat: process package in queue instead of batch Jul 11, 2021
@bodinsamuel
Copy link
Contributor Author

bodinsamuel commented Jul 11, 2021

Currently processing in GCP.
At 40 of concurrency (same as before this change) we seems to improve our rate by 2

Screenshot 2021-07-11 at 18 53 47


Ressources are better used (I don't have the memory tracking unfortunately for the moment).
Almost reaching 1mb/s in download

Screenshot 2021-07-11 at 19 01 17

@bodinsamuel
Copy link
Contributor Author

Latest number are good, we did x6 increase, from ~1 pkg/s to ~6 pkg/s
At this rate we have saved a full week and will be done by the end of the day.
Screenshot 2021-07-12 at 11 01 37

There are still bottlenecks, probably due to the nature of some packages, for example here at 6am everything cache missed in jsdelivr.
Screenshot 2021-07-12 at 11 04 23


In term of processing, I'm cheating a bit with a preemptible e2-medium, but for 2days of intensive usage we are paying $1.35 which is far better than heroku. About 40% less, not even considering I have x4 times the computational power.
The same configuration would probably cost around 300$ per month vs 40-50$ now; we'll see by the end of the month.
n.b: the most expensive right now is not the CPU but the network.

@bodinsamuel bodinsamuel requested a review from Haroenv July 12, 2021 09:19
Copy link
Collaborator

@Haroenv Haroenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a solid refactor, a couple detail questions though

src/__tests__/saveDocs.test.ts Show resolved Hide resolved
src/npm/Prefetcher.ts Outdated Show resolved Hide resolved
src/npm/Prefetcher.ts Outdated Show resolved Hide resolved
src/bootstrap.ts Show resolved Hide resolved
src/bootstrap.ts Show resolved Hide resolved
Copy link
Collaborator

@Haroenv Haroenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go with this implementation, exciting!

@bodinsamuel bodinsamuel merged commit c4f2aa2 into master Jul 12, 2021
@bodinsamuel bodinsamuel deleted the feat/one-by-one branch July 12, 2021 15:09
@bodinsamuel bodinsamuel mentioned this pull request Jul 12, 2021
algolia-crawler-bot pushed a commit that referenced this pull request Jul 19, 2021
# 1.0.0 (2021-07-19)

### Bug Fixes

* 1.0.1 ([#655](#655)) ([5c2cb7f](5c2cb7f))
* add expiresAt field ([#643](#643)) ([dba5d2a](dba5d2a))
* add new worker to bootstrap ([#636](#636)) ([ebbe3df](ebbe3df))
* cache dns ([#654](#654)) ([e80d437](e80d437))
* cache total downloads ([#653](#653)) ([99be307](99be307))
* deprecated facets should be boolean ([#638](#638)) ([19d30d0](19d30d0))
* docker build ([#651](#651)) ([947058d](947058d))
* expiresAt can be a numericFilter ([#664](#664)) ([e89fd14](e89fd14))
* improve logging + remove catchup ([#647](#647)) ([cbc545d](cbc545d))
* increase mem + round downloadRatio ([#644](#644)) ([8ef8425](8ef8425))
* mini fixes ([#659](#659)) ([d34bcc1](d34bcc1))
* setup circleci ([#593](#593)) ([4472405](4472405))
* stop using unpkg ([#658](#658)) ([aae2d86](aae2d86))
* throw outside try ([#661](#661)) ([d36a77a](d36a77a))
* typo ([#637](#637)) ([94851af](94851af))
* up semantic release ([#667](#667)) ([94d8d6c](94d8d6c))
* various ([#663](#663)) ([18fea1e](18fea1e))
* **algolia:** missing config param ([#387](#387)) ([d25ea19](d25ea19))
* **alternative names:** remove prismjs -> prismjs.js ([a1bad34](a1bad34))
* **deps:** update dependency @sentry/node to v5.10.2 ([9c445b0](9c445b0))
* **deps:** update dependency @sentry/node to v5.11.0 ([a858954](a858954))
* **deps:** update dependency @sentry/node to v5.12.4 ([efd6140](efd6140))
* **deps:** update dependency @sentry/node to v5.15.4 ([965fffb](965fffb))
* **deps:** update dependency @sentry/node to v5.15.5 ([89f234e](89f234e))
* **deps:** update dependency @sentry/node to v5.17.0 ([3563f6d](3563f6d))
* **deps:** update dependency @sentry/node to v5.19.1 ([394cb8c](394cb8c))
* **deps:** update dependency @sentry/node to v5.30.0 ([56421c5](56421c5))
* **deps:** update dependency @sentry/node to v5.6.2 ([667e12f](667e12f))
* **deps:** update dependency @sentry/node to v5.7.0 ([55b410d](55b410d))
* **deps:** update dependency @sentry/node to v5.7.1 ([bec31ba](bec31ba))
* **deps:** update dependency @sentry/node to v5.9.0 ([6599c79](6599c79))
* **deps:** update dependency algoliasearch to v3.34.0 ([11f49b6](11f49b6))
* **deps:** update dependency algoliasearch to v3.35.0 ([c4faa7a](c4faa7a))
* **deps:** update dependency algoliasearch to v3.35.1 ([837ba44](837ba44))
* **deps:** update dependency algoliasearch to v4.9.3 ([#628](#628)) ([78e3617](78e3617))
* **deps:** update dependency async to v2.6.3 ([4a9cf53](4a9cf53))
* **deps:** update dependency async to v3.2.0 ([3aa436e](3aa436e))
* **deps:** update dependency bunyan to v1.8.15 ([912e7bc](912e7bc))
* **deps:** update dependency dotenv to v8.1.0 ([b785e8f](b785e8f))
* **deps:** update dependency dotenv to v8.2.0 ([ad5f3fb](ad5f3fb))
* **deps:** update dependency dtrace-provider to v0.8.8 ([4879231](4879231))
* **deps:** update dependency gravatar-url to v3.1.0 ([f66b8ee](f66b8ee))
* **deps:** update dependency hot-shots to v6.4.1 ([f84aa5f](f84aa5f))
* **deps:** update dependency hot-shots to v6.5.1 ([2bdeb8e](2bdeb8e))
* **deps:** update dependency hot-shots to v6.8.1 ([1a58429](1a58429))
* **deps:** update dependency hot-shots to v6.8.2 ([a09e193](a09e193))
* **deps:** update dependency hot-shots to v6.8.5 ([871e2e5](871e2e5))
* **deps:** update dependency hot-shots to v6.8.7 ([fc61f4b](fc61f4b))
* **deps:** update dependency lodash to v4.17.13 [security] ([ad8a7ea](ad8a7ea))
* **deps:** update dependency lodash to v4.17.14 ([10e1777](10e1777))
* **deps:** update dependency lodash to v4.17.15 ([a0f2d0d](a0f2d0d))
* **deps:** update dependency lodash to v4.17.19 [security] ([38bd4e0](38bd4e0))
* **deps:** update dependency lodash to v4.17.21 ([baf7442](baf7442))
* **deps:** update dependency ms to v2.1.3 ([b4f0289](b4f0289))
* **deps:** update dependency nano to v8.2.2 ([a4befee](a4befee))
* **deps:** update dependency nano to v8.2.3 ([2c2272c](2c2272c))
* **deps:** update dependency nice-package to v3.1.2 ([55d8953](55d8953))
* **deps:** update dependency object-sizeof to v1.5.1 ([33296d3](33296d3))
* **deps:** update dependency object-sizeof to v1.5.2 ([eeb434a](eeb434a))
* **deps:** update dependency object-sizeof to v1.6.0 ([715f2f6](715f2f6))
* **deps:** update dependency object-sizeof to v1.6.1 ([24945f3](24945f3))
* **dev:** upgrade env ([#592](#592)) ([3c66c56](3c66c56))
* **dev:** upgrade env /2 ([#595](#595)) ([a86cd71](a86cd71))
* **formatPkg:** remove non-existing versions ([c37d6d6](c37d6d6)), closes [#534](#534)
* **package.json:** add repo url ([#649](#649)) ([6b248b5](6b248b5))
* empty change ([#405](#405)) ([475e366](475e366))
* id of null ([#406](#406)) ([8e5fb1d](8e5fb1d))
* kill process regurlarly, for cache and bootstrap ([#412](#412)) ([9c778b2](9c778b2))
* **esm:** avoid errors, slightly deal with arrays ([f5eefa9](f5eefa9))
* **formatPkg:** cleaned main can be an array ([#395](#395)) ([7ef7f2f](7ef7f2f))
* **getFilesList:** call using package object ([6b954d5](6b954d5))
* **jsdelivr:** fetch just npm hits ([#375](#375)) ([25d29dd](25d29dd)), closes [#371](#371)
* **lint:** correct setup to require extension ([#381](#381)) ([29afbd5](29afbd5))
* **saveDocs:** filter out wrong docs more robustly ([bc81351](bc81351))
* **size:** more exact truncating of readme ([#559](#559)) ([f6187c1](f6187c1))
* **ts:** main can be array ([b619daa](b619daa))
* **TS:** infer definitions correctly ([#357](#357)) ([143aa06](143aa06))
* **TS:** pass correct object ([cdf334b](cdf334b))
* **TS:** support scoped packages ([#364](#364)) ([655e86a](655e86a))
* **unpkg:** remove json flag + add unit test ([#392](#392)) ([d706694](d706694))
* import correctly got ([bb11884](bb11884))
* multiple small bugs after [#379](#379) ([#380](#380)) ([0580052](0580052))
* **config:** fully correct objectIDs ([b25fd81](b25fd81))
* **config:** use allowed chars for objectID ([34f41bb](34f41bb))
* **deps:** update dependency algoliasearch to v3.27.0 ([6c87eed](6c87eed))
* **deps:** update dependency algoliasearch to v3.27.1 ([0985d20](0985d20))
* **deps:** update dependency algoliasearch to v3.28.0 ([d48ad9c](d48ad9c))
* **deps:** update dependency algoliasearch to v3.29.0 ([d6057d5](d6057d5))
* **deps:** update dependency algoliasearch to v3.30.0 ([1a571ad](1a571ad))
* **deps:** update dependency algoliasearch to v3.31.0 ([5448c89](5448c89))
* **deps:** update dependency algoliasearch to v3.32.0 ([f52c1a8](f52c1a8))
* **deps:** update dependency algoliasearch to v3.32.1 ([c93f30f](c93f30f))
* **deps:** update dependency algoliasearch to v3.33.0 ([e26d4d9](e26d4d9))
* **deps:** update dependency async to v2.6.2 ([f9a9cb3](f9a9cb3))
* **deps:** update dependency babel-preset-env to v1.7.0 ([9081d2d](9081d2d))
* **deps:** update dependency bunyan-debug-stream to v1.1.0 ([f3c9d7e](f3c9d7e))
* **deps:** update dependency bunyan-debug-stream to v1.1.1 ([deccb8b](deccb8b))
* **deps:** update dependency dotenv to v6 ([#213](#213)) ([1b40279](1b40279))
* **deps:** update dependency dotenv to v6.1.0 ([0c8cc10](0c8cc10))
* **deps:** update dependency dotenv to v6.2.0 ([a54c1eb](a54c1eb))
* **deps:** update dependency got to v8.3.1 ([2376f53](2376f53))
* **deps:** update dependency got to v8.3.2 ([fcf2550](fcf2550))
* **deps:** update dependency hosted-git-info to v2.7.1 ([751b0af](751b0af))
* **deps:** update dependency lodash to v4.17.10 ([075a877](075a877))
* **deps:** update dependency lodash to v4.17.11 ([e49680a](e49680a))
* **deps:** update dependency ms to v2.1.2 ([cb207be](cb207be))
* **deps:** update dependency nice-package to v3.0.4 ([7a2b490](7a2b490))
* **deps:** update dependency nice-package to v3.1.0 ([361d409](361d409))
* **deps:** update dependency object-sizeof to v1.3.0 ([976f0fd](976f0fd))
* **deps:** update dependency object-sizeof to v1.3.1 ([fe25f6a](fe25f6a))
* **deps:** update dependency object-sizeof to v1.4.0 ([ad57ee8](ad57ee8))
* **formatPkg:** correct name ([b8175f3](b8175f3))
* **formatPkg:** don't discard packages without author, but with owners[] ([da66fb9](da66fb9))
* **npm:** allow undefined downloads ([a0d9c5a](a0d9c5a))
* **npm:** catch errors ([483c0c4](483c0c4))
* **stage:** push correct stage to statemanager ([00b0571](00b0571))
* **ts:** no double slashes ([dd84f88](dd84f88))
* **unpkg:** catch errors ([4efcd01](4efcd01))
* set settings on bootstrap when we start ([e35c0d1](e35c0d1))
* wait for deletion to happen beore continuing ([0734436](0734436))
* **bootstrap:** move to production only in bootstrap ([#126](#126)) ([b26dce6](b26dce6))
* **changelog:** add defaults to catch errors properly ([91e6ebd](91e6ebd))
* **changelog:** fall back to master if the gitHead is undefined ([52fe6ff](52fe6ff))
* **changelogs:** guard for null and undefined ([0a0a748](0a0a748))
* **computed:** use the cleaned package to match keys ([44a839c](44a839c))
* **deletes:** handle npm deletions ([1ad5025](1ad5025))
* **dependedUpon:** encode start and en keys ([24c5fe9](24c5fe9))
* **deps:** pin dependencies ([d1c1377](d1c1377))
* **deps:** update dependency algoliasearch to v3.24.11 ([e8a61bc](e8a61bc))
* **deps:** update dependency algoliasearch to v3.24.12 ([cea8a73](cea8a73))
* **deps:** update dependency algoliasearch to v3.25.1 ([7457f4e](7457f4e))
* **deps:** update dependency algoliasearch to v3.26.0 ([6fde846](6fde846))
* **deps:** update dependency dotenv to v5.0.0 ([#107](#107)) ([e972e19](e972e19))
* **deps:** update dependency dotenv to v5.0.1 ([acc314c](acc314c))
* **deps:** update dependency got to v8.0.3 ([2717b36](2717b36))
* **deps:** update dependency got to v8.2.0 ([64c2318](64c2318))
* **deps:** update dependency got to v8.3.0 ([19efaf8](19efaf8))
* **deps:** update dependency hosted-git-info to v2.6.0 ([0091297](0091297))
* **deps:** update dependency lodash to v4.17.5 ([d07ad04](d07ad04))
* **downloads:** be resilient for 404 or downloads endpoint for a chunk ([866fbcf](866fbcf))
* **downloads:** filter out scoped packages ([76f571a](76f571a)), closes [#36](#36)
* **downloads:** set default of 0 ([d157450](d157450))
* **formatPkg:** rewrite get info into separate functions ([92706ce](92706ce))
* **gitHead:** fix bad backward compat ([dc34d24](dc34d24))
* **gitHead:** fix bad backward compat ([195108f](195108f))
* **gitHead:** put back gitHead ([92d373d](92d373d)), closes [#53](#53) [#64](#64)
* **memleak:** in watch mode, do not use promise chain ([3f2e860](3f2e860))
* **memleak:** maybe fix it ([711b830](711b830))
* **memory:** don't keep a reference of the `chain` in watch ([f973913](f973913))
* **merge:** bad merge from me ([359f498](359f498))
* **schema:** backwards-compatible ([1b24b21](1b24b21))
* **settings:** put synonyms and rules in the configure file ([#128](#128)) ([af8e709](af8e709)), closes [#123](#123)
* **stateManager:** don't assume starting at "zzz" ([c79e6cd](c79e6cd))
* **timeouts:** increase pouch timeout ([#174](#174)) ([a9ccb77](a9ccb77))
* **url:** try to fix url for good ([7da9daf](7da9daf))
* **watch:** add missing return ([30f6e43](30f6e43))
* **watch:** avoid memleak by not piling up docs ([#130](#130)) ([4522ee5](4522ee5))

### Features

* add health API ([#650](#650)) ([95587a3](95587a3))
* add methods to process a single package ([#652](#652)) ([a3c41f3](a3c41f3))
* prepare docker ([#648](#648)) ([21b5d02](21b5d02))
* process package in queue instead of batch ([#656](#656)) ([c4f2aa2](c4f2aa2))
* **babel:** add a forced keyword to babel plugins ([440f344](440f344))
* **changelog:** add changes variations ([e5ce4dc](e5ce4dc))
* **changelog:** detect /changelog.markdown ([bcf21a1](bcf21a1))
* **changelog:** get from jsDelivr filelist if possible ([#640](#640)) ([dd386d2](dd386d2))
* **data:** add "bin" ([446d212](446d212))
* **data:** add "versions" attribute ([766a9c3](766a9c3))
* **data:** add concatenated name ([72ab12e](72ab12e)), closes [#33](#33)
* **data:** add flagging of type=module ([#386](#386)) ([7cd0765](7cd0765))
* **data:** add jsDelivr hits ([#263](#263)) ([adff89d](adff89d))
* **deprecated:** add the attribute for faceting ([#160](#160)) ([afe02c8](afe02c8)), closes [#159](#159)
* **devDeps:** add devDependencies ([01058ef](01058ef))
* **faceting:** allow searching in keywords and owner ([8dd2cda](8dd2cda))
* **formatPkg:** add .js to alternative names ([#383](#383)) ([8463308](8463308)), closes [#217](#217)
* **jsDelivr:** move code, add tests, preload data correctly ([#384](#384)) ([373d341](373d341))
* **keywords:** add webpack-scaffold ([#296](#296)) ([d4e57a7](d4e57a7))
* **npm:** Include directory details from repository objects ([#320](#320)) ([ccb1766](ccb1766))
* **process:** redo bootstrap after X amount of time ([a79d999](a79d999)), closes [#20](#20)
* **quality:** add a flag for very low quality packages ([314cafb](314cafb))
* **query rules:** add filtering on attr:value ([#221](#221)) ([ebcbf56](ebcbf56))
* **ranking:** do tie breaking based on the magnitude of downloads ([#178](#178)) ([85b631f](85b631f))
* **relevance:** add some synonyms ([#192](#192)) ([760f34a](760f34a))
* **relevance:** enable alternative names query rule ([#195](#195)) ([01217e8](01217e8)), closes [#194](#194)
* **relevance:** put name, description and eywords on same level ([#188](#188)) ([ee62193](ee62193))
* **relevance:** use jsDelivr hits for ranking ([#269](#269)) ([9039f76](9039f76))
* **relevancy:** add deprecated in account when sorting ([0b2add3](0b2add3))
* **requests:** add user-agent and httpsAgent ([#646](#646)) ([5a48ad3](5a48ad3))
* **schema:** move git head into githubRepo ([5cbf4e4](5cbf4e4))
* **tracking:** save which stage is currently activated ([dbb7b98](dbb7b98))
* **ts:** allow faceting ([e19e0b0](e19e0b0))
* **ts:** use jsdelivr to check for d.ts ([#645](#645)) ([fbe2e97](fbe2e97))
* **typescript:** pre-load definitely typed pkg ([#639](#639)) ([3968726](3968726))
* add Sentry ([#390](#390)) ([8c08fd5](8c08fd5))
* experimental modules compat ([4f31ab3](4f31ab3))
* full TS migration ([#626](#626)) ([fddc2a8](fddc2a8))
* refacto (part 2) ([#396](#396)) ([2df582b](2df582b))
* **sentry:** wait for the right amount of time. ([#391](#391)) ([d2f00e2](d2f00e2))
* move algolia ([#385](#385)) ([e5d7bec](e5d7bec))
* refacto (part 1) ([#371](#371)) ([c024451](c024451))
* upgrade packages ([#374](#374)) ([3c70053](3c70053))
* **relevance:** merge all the query rules ([#194](#194)) ([9a24fcc](9a24fcc))
* **settings:** allow to make a PR which changes both the settings and the data ([#179](#179)) ([e8f7c2a](e8f7c2a))
* **tags:** add `tags` to the schema ([57a476e](57a476e))
* **third-party:** add handling of Angular CLI schematics, and rework registry subset ([#169](#169)) ([bfab179](bfab179))
* **vue-cli:** add a forced keyword to vue-cli plugins ([3d6ed42](3d6ed42))
* **yeoman:** Identify yeoman generators through computedKeywords ([#181](#181)) ([08c81af](08c81af))
* Add repository info ([#101](#101)) ([29f6fa0](29f6fa0))

### Reverts

* Revert "Revert "chore(deps): update babel monorepo to v7.6.2"" ([4cf094e](4cf094e))
* Revert "Revert "chore(deps): update dependency lint-staged to v9.4.0"" ([11bd8d6](11bd8d6))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants