Scaling Renovate Bot on self-hosted GitLab #13172

MShekow · 2021-12-17T14:42:26Z

MShekow
Dec 17, 2021

Hi,

we are running a self-hosted GitLab instance, and a scheduled GitLab CI job triggers the Renovate bot once per hour. We are using the "autodiscover" mode of Renovate, so that the other users on our GitLab server simply need to invite the bot user to their repo, for the scanning to work.

We are approaching the limit of what the bot can do within one hour, because many repos have invited the bot already. While we could simply change the schedule of the job to simply be every 2 hours (instead of running every hour), this would negatively impact the reaction time.

Is there any scaling mechanism we can use? We are thinking about implementing a self-made "sharding" mechanism, where we run several (hourly) scheduled jobs in parallel, and subdivide the repositories. E.g. CI job no. 1 could run for the first 50% of the repos returned by writeDiscoveredRepos, and CI job no. 2 runs for the second 50%.

adam-moss · 2021-12-17T21:39:35Z

adam-moss
Dec 17, 2021

We've literally just solved this issue ourselves.

What we did was run a python script to grab all the repo's via the gitlab API and then spawn a child pipeline that ran an instance of renovate against each repo individually.

This has taken our runtime from >24 hrs for 12k repos to ~4hrs.

10 replies

MShekow Dec 20, 2021
Author

I'm not sure how starting dynamic matrix jobs would work with GitLab...

secustor Dec 20, 2021
Maintainer

Example project: https://gitlab.com/gitlab-org/project-templates/jsonnet
Docs: https://docs.gitlab.com/ee/ci/pipelines/parent_child_pipelines.html#dynamic-child-pipelines

Basically you can dynamically generate a pipeline definition in json or yml by any means you see fit and then execute it.

adam-moss Dec 21, 2021

That's exactly how we do it, currently refactoring the code to optimise it and will post it once completed 👍

adam-moss Dec 23, 2021

This is undergoing a chunky refactor at the moment but the below has worked for us since Thursday:

.gitlab-ci.yml

default:
  tags:
    - docker

stages:
    - generate-child-pipelines-yml
    - trigger-intermediate-child-pipeline

generate-child-pipelines-yml:
  stage: generate-child-pipelines-yml
  image: python:3.10.1@sha256:dbbfcbf95f6b596d2be1d8f3b368016619f78f829facf6f2e361bea1151794e5
  variables:
    PIP_CONFIG_FILE: './pip.conf'
  before_script:
    - pip install -r pipeline_generator/requirements.txt
  script:
    - python3 pipeline_generator/generate_job_per_project.py > renovate-pipeline-collection.yml
    - awk 'BEGIN { RS="---" } NR==2{print > ("renovate-intermediate.gitlab-ci.yml")} NR>2{print > ("renovate-pipeline-" NR-3 ".gitlab-ci.yml")}' renovate-pipeline-collection.yml
    - ls -1q renovate-pipeline-*.yml | wc -l
  artifacts:
    paths:
      - renovate-intermediate.gitlab-ci.yml
      - renovate-pipeline-*.gitlab-ci.yml
    expire_in: 18 hours
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
      when: always
    - if: '$CI_PIPELINE_SOURCE == "api"'
      when: always
    - when: manual

trigger-intermediate-child-pipeline:
  stage: trigger-intermediate-child-pipeline
  resource_group: renovate
  trigger:
    include:
      - artifact: renovate-intermediate.gitlab-ci.yml
        job: generate-child-pipelines-yml
    strategy: depend
  variables:
    PARENT_PIPELINE_ID: $CI_PIPELINE_ID

generate_job_per_project.py

from jinja2 import Environment, FileSystemLoader, PackageLoader, select_autoescape
import gitlab
import os

token = os.environ['RENOVATE_TOKEN']


def is_project_to_renovate(project):
    """Determine whether a project should be in scope of Renovate."""

    # Careful with GitLab project vs repo distinction - projects may not have repos
    if project.attributes['empty_repo']:
        return False
    # Not running Renovate over repos it may be able to see based on user
    # token that is outside the `<redact>` GitLab namespace
    if not project.attributes['path_with_namespace'].startswith('<redact>/'):
        return False
    # Not running Renovate over repos in the attic
    if project.attributes['path_with_namespace'].startswith('<redact>/attic'):
        return False
    # We have filtered out archived projects already, but that may change
    if project.attributes['archived'] == True:
        return False
    return True


def render_jinja_pipeline_template(template, *, projects, params):
    """Render Jinja pipeline template, defining a job per repo."""

    env = Environment(
        loader=FileSystemLoader(searchpath="./templates"),
        autoescape=select_autoescape()
    )

    t = env.get_template(template)
    print(t.render(repos=projects, params=params))


def get_projects():
    """Generate all GitLab projects visible using a token.
    
    The list generated is unfiltered at this point, other than only returning private
    projects, given that all <redact> projects should have that visibility level.

    The list is sorted by most recently updated.

    Returns a generator.
    """

    gl = gitlab.Gitlab('https://gitlab.com', private_token=token)
    projects = gl.projects.list(as_list=False, visibility='private', archived=False, retry_transient_errors=True, statistics=True)

    return projects


def get_size(repo):
    return repo['size']


def generate_pipeline_yaml():
    """Generate pipeline YAML from an appropriate list of GitLab projects."""

    projects = get_projects()
    projects_to_renovate = (project for project in projects if is_project_to_renovate(project))
    # Create list of needed values from generators. This is needed because
    # the template loops over the values twice, so a generator is not suitable.
    template_values = [{'id': p.id, 'path': p.attributes['path_with_namespace'], 'size': p.attributes['statistics']['repository_size']} for p in projects_to_renovate]
    template_values.sort(key=get_size)
    # GitLab parallel:matrix config can generate maximum 50 jobs
    # More than this gives an explicit GitLab error
    params = { 'batch_size': 50, 'jobs_per_yml_file': 500, 'concurrency': 125 }
    render_jinja_pipeline_template('pipeline.yml.jinja', projects=template_values, params=params)


if __name__ == "__main__":
    generate_pipeline_yaml()

pipeline.yml.jinja

{%- set batch_size = params.batch_size -%}
{%- set jobs_per_yml_file = params.jobs_per_yml_file -%}
{%- set concurrency = params.concurrency -%}
---
default:
  tags:
    - docker

stages:
  - handle-yaml
{%- for file in repos|batch(jobs_per_yml_file) %}
  - renovate-pipeline-{{ loop.index0 }}
{%- endfor %}

handle-yaml:
  stage: handle-yaml
  needs:
    - pipeline: $PARENT_PIPELINE_ID
      job: generate-child-pipelines-yml
  script:
    - echo "This job only exists to handle the YAML"
  artifacts:
    paths:
      - renovate-pipeline-*.gitlab-ci.yml

{% for file in repos|batch(jobs_per_yml_file) -%}
trigger-child-pipeline-{{ loop.index0 }}:
  stage: renovate-pipeline-{{ loop.index0 }}
  trigger:
    include:
      - artifact: renovate-pipeline-{{ loop.index0 }}.gitlab-ci.yml
        job: handle-yaml
    strategy: depend
    {{- '\n' if not loop.last else '' }}
{%- endfor %}
{%- for file in repos|batch(jobs_per_yml_file) %}
---
default:
  tags:
    - docker

stages:
{%- for repos in file|batch(batch_size) %}
    - maintenance-{{ loop.index0 }}
{%- endfor %}

.renovate:
  variables:
    RENOVATE_CONFIG_FILE: "$CI_PROJECT_DIR/config.json"
    LOG_FORMAT: "json"
  image:
    name: renovate/renovate:31
    entrypoint: [""]
  script:
    - mkdir ~/.ssh
    - chmod 700 ~/.ssh
    - echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts
    - chmod 644 ~/.ssh/known_hosts
    - echo "$SSH_PRIVATE_KEY" > ~/.ssh/id_ed25519
    - chmod 600 ~/.ssh/id_ed25519
    - LOG_LEVEL=${LOG_LEVEL:-fatal} renovate
  artifacts:
    paths:
      - ${CI_JOB_NAME}-log.json
    when: on_failure
    expire_in: 18 hours
  allow_failure: true
  timeout: 15 minutes
{%- for repos in file|batch(batch_size) -%}
  {%- set repo_batch_loop = loop.index0 -%}
  {%- for repo in repos -%}
    {%- set counter = loop.index0 + (batch_size * repo_batch_loop) -%}
    {{- '\n' }}
renovate-{{ counter }}:
  extends: .renovate
  stage: maintenance-{{ repo_batch_loop }}
  variables:
    REPOSITORY_LIST: {{ repo.path }}
    {%- set needs = (counter % concurrency) + (((counter // concurrency) - 1) * concurrency) %}
  needs:
    {%- if needs < 0 -%}
      {{ ' ' }}[]
    {%- else %}
    - job: renovate-{{ needs }}
      artifacts: false
    {% endif %}
  {%- endfor %}
{%- endfor %}
{%- endfor %}

masterxavierfox Mar 1, 2023

@adam-moss mind sharing your refactored code?

candrews · 2022-03-11T18:48:10Z

candrews
Mar 11, 2022

My approach is to have 2 jobs: renovate-discover uses the --write-discovered-repos option to write a list of all repositories, then renovate-run which runs 10 jobs in parallel:

.template:
  variables:
    RENOVATE_CONFIG_FILE: config.js
  image: renovate/renovate:32.0.3@sha256:6ee56f7ff58fd515e5c521c57f85284a96a32b2789335d6241307af7625a8b64

renovate-discover:
  extends: .template
  stage: discover
  script:
    - renovate --base-dir .renovate --write-discovered-repos=renovate-repos.json
  artifacts:
    paths:
      - renovate-repos.json
  only:
    - schedules

renovate-run:
  parallel: 10
  extends: .template
  stage: run
  services:
    - name: docker:20.10.12-dind@sha256:6f2ae4a5fd85ccf85cdd829057a34ace894d25d544e5e4d9f2e7109297fedf8d
      alias: docker
  variables:
    DOCKER_HOST: "tcp://docker:2375"
    DOCKER_TLS_CERTDIR: ""
  script:
    - renovate --base-dir .renovate
  dependencies:
    - renovate-discover

and at the bottom of config.js, add this:

const fs = require('fs');
if (fs.existsSync("renovate-repos.json")) {
    if(! "CI_NODE_INDEX" in process.env || ! "CI_NODE_TOTAL" in process.env) {
        console.log("renovate-repos.json exists, but CI_NODE_INDEX and CI_NODE_TOTAL are not set. See https://docs.gitlab.com/ee/ci/yaml/#parallel");
        process.exit(1);
    }
    const segmentNumber = Number(process.env.CI_NODE_INDEX); // CI_NODE_INDEX is 1 indexed
    const segmentTotal = Number(process.env.CI_NODE_TOTAL);
    allRepositories = JSON.parse(fs.readFileSync("renovate-repos.json"));
    allSize = allRepositories.length;
    chunkSize = parseInt(allSize / segmentTotal);
    chunkStartIndex = chunkSize * ( segmentNumber - 1 );
    chunkEndIndex = chunkSize * segmentNumber;
    if(chunkEndIndex > allSize) {
        chunkEndIndex = allSize;
    }
    const segmentNumber = Number(process.env.CI_NODE_INDEX); // CI_NODE_INDEX is 1 indexed
    const segmentTotal = Number(process.env.CI_NODE_TOTAL);
    const allRepositories = JSON.parse(fs.readFileSync("renovate-repos.json"));
    const repositories = allRepositories.filter((_,i)=> (segmentNumber - 1) === (i % segmentTotal))
    module.exports.repositories = repositories;
    module.exports.autodiscover = false;
    console.log(`renovate-repos.json contains ${allRepositories.length} repositories. This is chunk number ${segmentNumber} of ${segmentTotal} total chunks. Processing ${repositories.length} repositories.`);
} else {
    module.exports.autodiscover = true;
}

1 reply

rndmh3ro Jul 7, 2022

With the above code, renovate would complain with two error messages:

SyntaxError: Identifier 'segmentNumber' has already been declared

FATAL: Error parsing config file due to unresolved variable(s): Cannot access 'allRepositories' before initialization

I "fixed" this (I don't have a clue about typescript) by removing the constdeclarations:

const fs = require('fs');
if (fs.existsSync("renovate-repos.json")) {
    if(! "CI_NODE_INDEX" in process.env || ! "CI_NODE_TOTAL" in process.env) {
        console.log("renovate-repos.json exists, but CI_NODE_INDEX and CI_NODE_TOTAL are not set. See https://docs.gitlab.com/ee/ci/yaml/#parallel");
        process.exit(1);
    }
    segmentNumber = Number(process.env.CI_NODE_INDEX); // CI_NODE_INDEX is 1 indexed
    segmentTotal = Number(process.env.CI_NODE_TOTAL);
    allRepositories = JSON.parse(fs.readFileSync("renovate-repos.json"));
    allSize = allRepositories.length;
    chunkSize = parseInt(allSize / segmentTotal);
    chunkStartIndex = chunkSize * ( segmentNumber - 1 );
    chunkEndIndex = chunkSize * segmentNumber;
    if(chunkEndIndex > allSize) {
        chunkEndIndex = allSize;
    }
    segmentNumber = Number(process.env.CI_NODE_INDEX); // CI_NODE_INDEX is 1 indexed
    segmentTotal = Number(process.env.CI_NODE_TOTAL);
    allRepositories = JSON.parse(fs.readFileSync("renovate-repos.json"));
    repositories = allRepositories.filter((_,i)=> (segmentNumber - 1) === (i % segmentTotal))
    module.exports.repositories = repositories;
    module.exports.autodiscover = false;
    console.log(`renovate-repos.json contains ${allRepositories.length} repositories. This is chunk number ${segmentNumber} of ${segmentTotal} total chunks. Processing ${repositories.length} repositories.`);
} else {
    module.exports.autodiscover = true;
}

setchy · 2022-08-24T13:51:47Z

setchy
Aug 24, 2022
Collaborator

This discussion was super valuable for us to work around the extremely low Bitbucket Cloud API rate limits. We're following a similar approach to that above. Thank you 👏

0 replies

lmilbaum · 2023-10-23T16:06:16Z

lmilbaum
Oct 23, 2023

Are you aware of an option to use the parallel keyword to run more the one job in parallel with different configuration files?

0 replies

hinricht · 2024-02-09T16:43:47Z

hinricht
Feb 9, 2024

I found this article quite useful: Optimizing Renovate for GitLab with 500+ Repositories

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Renovate Bot on self-hosted GitLab #13172

{{title}}

Replies: 5 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Scaling Renovate Bot on self-hosted GitLab #13172

Replies: 5 comments · 11 replies

MShekow Dec 20, 2021 Author

secustor Dec 20, 2021 Maintainer

setchy Aug 24, 2022 Collaborator

Replies: 5 comments 11 replies

MShekow Dec 20, 2021
Author

secustor Dec 20, 2021
Maintainer

setchy
Aug 24, 2022
Collaborator