Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingestion/bigquery): user exceeded quota for concurrent project.lists requests #10578

Conversation

shubhamjagtap639
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 23, 2024
page_token: Optional[str] = None
while True:
projects_iterator = self.bq_client.list_projects(
max_results=5, page_token=page_token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is max_results=5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this should be 50 as API returns max 50 projects per page. Refer link

BigqueryProject(id=p.project_id, name=p.friendly_name)
for p in projects
]
projects: List[BigqueryProject] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment explaining why this was required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Let me add the comment.
The thing is we are not able to regenerate the error but from doc we got to know two project.lists request is allowed per second and assumed that bigquery client method list_projects is not adding any limit in calling request.
Hence we are trying to add limit in requests per second externally,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I understand that, but having it as a comment in the code will help future editors of this file

@@ -165,6 +170,9 @@ def get_projects(self) -> List[BigqueryProject]:
page_token = projects_iterator.next_page_token
if page_token is None:
break
# Sleep of 30 sec to make limit of two requests per second
time.sleep(30)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a RateLimiter helper class - please use that instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -157,21 +157,19 @@ def get_projects(self) -> List[BigqueryProject]:
# Also project.list returns max 50 projects per page.
# Assuming list_projects method internally not adding any limit in requests call,
# externally we are adding limit in requests call per second.
rate_limiter = RateLimiter(max_calls=2, period=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rate_limiter = RateLimiter(max_calls=2, period=1)
rate_limiter = RateLimiter(max_calls=2, period=60)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quata for projects.list is 2 requests per second not per minute.
I know earlier I added the sleep of 30 sec but it was incorrect.

)
with rate_limiter:
projects_iterator = self.bq_client.list_projects(
max_results=30, page_token=page_token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the max number of results in a single page? can you add a link to the docs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -149,6 +150,10 @@ def get_query_result(self, query: str) -> RowIterator:
return resp.result()

def get_projects(self) -> List[BigqueryProject]:
def _should_retry(exc: BaseException) -> bool:
logger.debug(f"Exception reason: {str(exc)}. Type: {type(exc)}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here explaining what we previously tried and how we arrived at this solution

projects: List[BigqueryProject] = []
page_token: Optional[str] = None
# Bigquery API has limit in calling project.list request i.e. 2 request per second.
# Also project.list returns max 50 projects per page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to the bigquery docs

# Also project.list returns max 50 projects per page.
# Assuming list_projects method internally not adding any limit in requests call,
# externally we are adding limit in requests call per second.
rate_limiter = RateLimiter(max_calls=1, period=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is 1 request per 2 seconds, not 2 requests per second?

page_token=page_token,
retry=retry.Retry(predicate=_should_retry, timeout=20),
)
count = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this more descriptive

BigqueryProject(id=p.project_id, name=p.friendly_name)
)
count += 1
logger.debug(f"Projects count per request {str(count)}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for str

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Jun 14, 2024
@hsheth2 hsheth2 merged commit e6246c9 into datahub-project:master Jun 15, 2024
57 of 82 checks passed
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024
…ists requests (datahub-project#10578)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
yoonhyejin pushed a commit that referenced this pull request Jul 16, 2024
…ists requests (#10578)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants