Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][CircleCI Plugin] Only collecting first page of API responses #7750

Closed
2 of 3 tasks
Nickcw6 opened this issue Jul 16, 2024 · 6 comments · Fixed by #7770
Closed
2 of 3 tasks

[Bug][CircleCI Plugin] Only collecting first page of API responses #7750

Nickcw6 opened this issue Jul 16, 2024 · 6 comments · Fixed by #7770
Labels
component/plugins This issue or PR relates to plugins severity/p1 This bug affects functionality or significantly affect ux type/bug This issue is a bug

Comments

@Nickcw6
Copy link
Contributor

Nickcw6 commented Jul 16, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When running a data collection for a CircleCI connection, data only appears to be collected from the past <24 hours, irrespective of what Time Range is set to. Same behaviour observed in 'full refresh mode' & normal data collection.

Seemed to have slightly differing behaviour each time I tried - when originally raised on Slack only the last ~3 hours of data was collected, however when reproducing again to raise this issue, seems to now have data from the past ~24 hours.

E.g. time frequency set to start of the year, then checking the _tool_circleci_workflow table:

Screenshot 2024-07-16 at 10 34 16
Screenshot 2024-07-16 at 11 36 54

Only 18 workflows are identified, the earliest of which occurring at 2024-07-15 10:29:09.000. I would expect to see many more rows dating back to 2024-01-01.

CircleCI pipeline task logs:

time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] start plugin"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [api async client] creating scheduler for api \"https://circleci.com/api/\", number of workers: 13, 10000 reqs / 1h0m0s (interval: 360ms)"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] total step: 9"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask convertProjects"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [convertProjects] finished records: 1"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 1 / 9"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask collectPipelines"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] collect pipelines"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] start api collection"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] finished records: 1"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] end api collection without error"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 2 / 9"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask extractPipelines"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractPipelines] get data from _raw_circleci_api_pipelines where params={\"ConnectionId\":1,\"ProjectSlug\":\"gh/SylveraIO/web-app-mono\"} and got 20"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractPipelines] finished records: 1"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 3 / 9"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask collectWorkflows"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] collect workflows"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] start api collection"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] finished records: 1"
time="2024-07-16 09:34:28" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] finished records: 10"
time="2024-07-16 09:34:31" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] finished records: 19"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] end api collection without error"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 4 / 9"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask extractWorkflows"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractWorkflows] get data from _raw_circleci_api_workflows where params={\"ConnectionId\":1,\"ProjectSlug\":\"gh/SylveraIO/web-app-mono\"} and got 18"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractWorkflows] finished records: 1"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 5 / 9"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask collectJobs"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] collect jobs"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] start api collection"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] finished records: 1"
time="2024-07-16 09:34:35" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] finished records: 10"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] end api collection without error"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 6 / 9"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask extractJobs"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractJobs] get data from _raw_circleci_api_jobs where params={\"ConnectionId\":1,\"ProjectSlug\":\"gh/SylveraIO/web-app-mono\"} and got 162"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractJobs] finished records: 1"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 7 / 9"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask convertJobs"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [convertJobs] finished records: 1"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 8 / 9"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask convertWorkflows"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] [convertWorkflows] finished records: 1"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 9 / 9"

Also have Github and Jira data connections running within the same pipeline, and data is pulled through as expected for both of these plugins.

EDIT: What is actually happening is only 20 pipelines are being collected from the CircleCI API response (ie. the first page). This then has a knock-on effect throughout the workflows and jobs tables.

What do you expect to happen

Data is collected from the full specified time range, e.g. starting from 2024-01-01 (or whenever specified).

How to reproduce

  1. Configure a CircleCI connection using the plugin
  2. Associate this to a project
  3. Set a time range (or leave as default for 6 months)
  4. Run a data collection (either normally in or full refresh)
  5. Check the _tools_circleci_workflows, _tools_circleci_pipelines or _tools_circleci_jobs tables for expected row count, and earliest started_at or created_at timestamp (see below)

Anything else

As an aside (but potentially related) - I notice there are discrepancies between the column names across the three CircleCI tool tables, e.g.

  • On _tools_circleci_workflows - created_at is the timestamp the workflow was triggered in CircleCI. There is no other column which could represent the start of the workflow in CircleCI.
  • On _tools_circleci_jobs - created_at is the timestamp the row was created in the DevLake DB, and started_at is the CircleCI timestamp.
  • On _tools_circleci_pipelines - created_at is again the timestamp of DevLake DB creation. There is created_date, but this always seems to be NULL. As with the workflows table, there doesn't appear to be any column which represents the starting timestamp in CircleCI.

Version

v1.0.0

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Nickcw6 Nickcw6 added the type/bug This issue is a bug label Jul 16, 2024
@dosubot dosubot bot added component/plugins This issue or PR relates to plugins severity/p1 This bug affects functionality or significantly affect ux labels Jul 16, 2024
Copy link

dosubot bot commented Jul 16, 2024

Hey there, @Nickcw6! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this issue together!

The issue you're experiencing with the CircleCI plugin collecting data only from the past 24 hours, regardless of the Time Range setting, is likely due to the timeAfter parameter not being correctly set. Here are the steps to ensure the timeAfter parameter is set correctly and to collect data from the full specified time range:

  1. Create a Project: Ensure you have created a project as described in the documentation.
  2. Add a CircleCI Connection: Add the CircleCI connection to the project.
  3. Set the Sync Policy:
    • Go to the Sync Policy settings.
    • Set the Data Time Range to the desired time range instead of the default past 24 hours.

Example:

### Step 2.3 - Set the Sync Policy
There are three settings for Sync Policy:
- **Data Time Range**: You can select the time range of the data you wish to collect. The default is set to the past six months. Ensure this is set to the full time range you need.
- **Sync Frequency**: Choose how often you would like to sync your data by selecting a sync frequency option or enter a cron code to specify your preferred schedule.
- **Skip Failed Tasks**: Choose to skip failed tasks to avoid spending more time running the pipeline all over again.
  1. Run the Plugin with Correct Parameters: Ensure that the timeAfter parameter is set correctly when running the plugin. For example:
cmd := &cobra.Command{Use: "CircleCi"}
connectionId := cmd.Flags().Uint64P("connection", "c", 0, "CircleCi connection id")
projectSlug := cmd.Flags().StringP("project", "p", "", "CircleCi project slug")
timeAfter := cmd.Flags().StringP("timeAfter", "a", "", "collect data that are created after specified time, ie 2006-01-02T15:04:05Z")

cmd.Run = func(cmd *cobra.Command, args []string) {
    runner.DirectRun(cmd, args, PluginEntry, map[string]interface{}{
        "connectionId": *connectionId,
        "projectSlug":  *projectSlug,
    }, *timeAfter)
}

runner.RunCmd(cmd)

By setting the timeAfter parameter correctly and starting the data collection, the CircleCI plugin should collect data from the full specified time range [1][2][3][4].

Additionally, ensure that there are no API rate limits or other issues affecting the data collection. The CircleCI plugin handles pagination and rate limits, but if you encounter specific error messages or logs, they can provide insights into why data collection might be limited [5][6][7][8].

To continue talking to Dosu, mention @dosu.

@klesh
Copy link
Contributor

klesh commented Jul 17, 2024

The discrepancies are valid: All created_at should be "the timestamp of DevLake DB creation" and it should not be empty.

However, I can't reproduce the data missing bug. It seems to be working correctly on my end:

image

Can you check the _raw_circleci_api_workflows and see if the records match your expectations?
Maybe you could try requesting the API and see what it returns?

@Nickcw6 Nickcw6 changed the title [Bug][CircleCI Plugin] Data only collected from previous <24 hours [Bug][CircleCI Plugin] Only collecting first page of workflow & pipeline API response Jul 17, 2024
@Nickcw6
Copy link
Contributor Author

Nickcw6 commented Jul 17, 2024

Hey @klesh, thanks for your response - I think I've figured out what's happening here. I've updated the original post.

Only the first page of the Get all pipelines CircleCI API response is being collected - ie. just 20 pipelines total, which has a knock-on effect when subsequently attempting to collect the workflows & jobs. This is consistent with always seeing 20 rows in the _raw_circleci_api_pipelines table, and explains the inconsistent date range behaviour I was originally seeing.

I think the issue is on this line in the pipeline collector - it's setting the query param as page_token but according to the API docs it should be page-token. This is also the same for the workflow collector here, and the job collector.

@Nickcw6 Nickcw6 changed the title [Bug][CircleCI Plugin] Only collecting first page of workflow & pipeline API response [Bug][CircleCI Plugin] Only collecting first page of API responses Jul 17, 2024
@klesh
Copy link
Contributor

klesh commented Jul 18, 2024

@Nickcw6 Thanks for the information. It is very valuable, would you like to put up a PR to fix the problem? Thanks in advance.

@Nickcw6
Copy link
Contributor Author

Nickcw6 commented Jul 18, 2024

@klesh Happy to give it a go over the weekend - I haven't worked in Go before which is the only reason I didn't offer originally 😅

Any advice on tackling this issue in particular, or is it as straightforward as it seems?

@klesh
Copy link
Contributor

klesh commented Jul 20, 2024

@Nickcw6 Nice, i think fixing the typo you found out would be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/plugins This issue or PR relates to plugins severity/p1 This bug affects functionality or significantly affect ux type/bug This issue is a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants