Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Midsize PDF file yields Error 400: "The request's total referenced files bytes are too large to be read" #308

Open
rnckp opened this issue Feb 11, 2025 · 12 comments
Assignees
Labels
api: gemini-api documentation Improvements or additions to documentation priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@rnckp
Copy link

rnckp commented Feb 11, 2025

Environment details

  • Programming language: Python
  • OS: Mac OS 15.3.
  • Language runtime version: 3.10
  • Package version: v.1.1.0

Steps to reproduce

I have a midsize PDF file with 160MB and 127 pages. I can successfully upload the PDF with client.files.upload(file=pdf_path).

However, when I try to use the uploaded file in client.models.generate_content() I get this error:

ClientError: 400 INVALID_ARGUMENT. {'error': {'code': 400, 'message': "The request's total referenced files bytes are too large to be read", 'status': 'INVALID_ARGUMENT'}}

My code works on the same PDF, when I shorten it to say 10 pages.

How can I fix this error and use the full PDF?

Thanks in advance for any help in this matter.

@rnckp rnckp added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 11, 2025
@jamg7
Copy link
Collaborator

jamg7 commented Feb 13, 2025

Hi, @rnckp

I succeeded with a PDF of size around 20MB and it worked for me. Think it might be related to input token limitation? Can you try to get the token count of your prompt?

I used below script to get the token count:

response = client.models.count_tokens(
model='gemini-2.0-flash-001',
contents=[
"summary the document",
document
],
)
print(response)

And which model are you using in your test? Each model has different limits on input_token_limit, which you can get by calling client.models.list()

@rnckp
Copy link
Author

rnckp commented Feb 13, 2025

Hi @jamg7

Thanks for your help.

I was using gemini-2.0-flash.

In order to count tokens I tried your exact code. This yields the error below. How can I fix this?

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
Cell In[12], line 1
----> 1 response = client.models.count_tokens(
      2     model="gemini-2.0-flash-001",
      3     contents=["summary the document", sample_file],
      4 )
      5 print(response)

File ~/miniconda3/envs/google/lib/python3.10/site-packages/google/genai/models.py:4507, in Models.count_tokens(self, model, contents, config)
   4504 request_dict = _common.convert_to_dict(request_dict)
   4505 request_dict = _common.encode_unserializable_types(request_dict)
-> 4507 response_dict = self._api_client.request(
   4508     'post', path, request_dict, http_options
   4509 )
   4511 if self._api_client.vertexai:
   4512   response_dict = _CountTokensResponse_from_vertex(
   4513       self._api_client, response_dict
   4514   )

File ~/miniconda3/envs/google/lib/python3.10/site-packages/google/genai/_api_client.py:449, in ApiClient.request(self, http_method, path, request_dict, http_options)
    439 def request(
    440     self,
    441     http_method: str,
   (...)
    444     http_options: HttpOptionsOrDict = None,
    445 ):
    446   http_request = self._build_request(
    447       http_method, path, request_dict, http_options
    448   )
--> 449   response = self._request(http_request, stream=False)
    450   json_response = response.json
    451   if not json_response:

File ~/miniconda3/envs/google/lib/python3.10/site-packages/google/genai/_api_client.py:384, in ApiClient._request(self, http_request, stream)
    380   return HttpResponse(
    381       response.headers, response if stream else [response.text]
    382   )
    383 else:
--> 384   return self._request_unauthorized(http_request, stream)

File ~/miniconda3/envs/google/lib/python3.10/site-packages/google/genai/_api_client.py:407, in ApiClient._request_unauthorized(self, http_request, stream)
    398 http_session = requests.Session()
    399 response = http_session.request(
    400     method=http_request.method,
    401     url=http_request.url,
   (...)
    405     stream=stream,
    406 )
--> 407 errors.APIError.raise_for_response(response)
    408 return HttpResponse(
    409     response.headers, response if stream else [response.text]
    410 )

File ~/miniconda3/envs/google/lib/python3.10/site-packages/google/genai/errors.py:100, in APIError.raise_for_response(cls, response)
     98 status_code = response.status_code
     99 if 400 <= status_code < 500:
--> 100   raise ClientError(status_code, response)
    101 elif 500 <= status_code < 600:
    102   raise ServerError(status_code, response)

ClientError: 400 INVALID_ARGUMENT. {'error': {'code': 400, 'message': 'Request contains an invalid argument.', 'status': 'INVALID_ARGUMENT'}}

I tried this example code or yours to debug. This works fine:

response = client.models.count_tokens(
    model="gemini-2.0-flash-001",
    contents="why is the sky blue?",
)
print(response)

Output

total_tokens=7 cached_content_token_count=None

Btw: What I find confusing in your example code is that one time the sample file is given as an argument in the contents list, one time it is the file's name. Can either be used? However, neither is working in my case.

@jamg7
Copy link
Collaborator

jamg7 commented Feb 13, 2025

Hi, @rnckp

I think the problem might be caused by the type of the sample_file object. I was using the return value from the files.upload() API, as shown in below code:

sample_file = client.files.upload(file=doc_data, config={"mime_type":'application/pdf'})

@rnckp
Copy link
Author

rnckp commented Feb 13, 2025

@jamg7 I did exactly the same, to no avail. It does not work and yields the error. What can I do to fix this?

Again to avoid misunderstandings - everything works fine if I use the same PDF cut down to 10 pages. Not only can I process the smaller PDF but I can also successfully count the tokens. It does not work with the larger PDF. Then I get mentioned error.

Maybe to give you more details about the full PDF that does not work - this is the printout (slightly redacted) of the returned value after uploading the file:

File(name='files/iul834XXXXXX', display_name=None, mime_type='application/pdf', size_bytes=161704743, create_time=datetime.datetime(2025, 2, 13, 10, 11, 21, 517380, tzinfo=TzInfo(UTC)), expiration_time=datetime.datetime(2025, 2, 15, 10, 11, 21, 458824, tzinfo=TzInfo(UTC)), update_time=datetime.datetime(2025, 2, 13, 10, 11, 21, 517380, tzinfo=TzInfo(UTC)), sha256_hash='NGZjMmM5NzI2MGIxNjdmMjNkZmEzNWVjNWY3NzA5ODAzN2U1YTI5ZTM4OTViZTc0ZWU3MGJhOGI1XXXXXXXXXX==', uri='https://generativelanguage.googleapis.com/v1beta/files/iul834XXXXXX', download_uri=None, state=<FileState.ACTIVE: 'ACTIVE'>, source=<FileSource.UPLOADED: 'UPLOADED'>, video_metadata=None, error=None)

PS: I corrected my previous comment in regard to the example code and output above. Now it is the correct code snippet and the proper output.

@jamg7
Copy link
Collaborator

jamg7 commented Feb 14, 2025

Thanks for the quick response, @rnckp . I just double checked my test script. My previous test is for a PDF of size 20MB instead of 200MB. Sorry about that!

I just produced a 160MB file and got a similar error message:

google.genai.errors.ClientError: 400 INVALID_ARGUMENT. {'error': {'code': 400, 'message': 'Request contains an invalid argument.', 'status': 'INVALID_ARGUMENT', 'details': [{'@type': 'type.googleapis.com/google.rpc.DebugInfo', 'detail': '[ORIGINAL ERROR] generic::invalid_argument: Document size exceeds supported limit: 166877608 v.s 52428800'}]}}

@jamg7
Copy link
Collaborator

jamg7 commented Feb 14, 2025

The error message mentioned that the maximum file size is 52428800, can you try a pdf file under that limit and see if it works?

@rnckp
Copy link
Author

rnckp commented Feb 14, 2025

I can confirm that this works. I have created a PDF with 51231211 bytes. I now can count the tokens and OCR the file.

This finding seems to contradict your documentation here where you state that files can be up to 2GB.

You can use the File API to upload a document of any size. Always use the File API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20 MB.

Note: The File API lets you store up to 20 GB of files per project, with a per-file maximum size of 2 GB.

Is ~52MB really the size limit that I can use? How can I process PDFs (or other documents) that are larger than this?

EDIT: I tried to submit a larger PDF in parts in one prompt, each part having a file size below the limit. This doesn't work neither. The OCR stops somewhere on the second PDF part.

@jamg7
Copy link
Collaborator

jamg7 commented Feb 14, 2025

Thanks for confirming that PDF with size less than 52428800 works, @rnckp.

The document you linked talks about storage limitation, which is different from the limitation of file size that each model can handle.

I didn't find a good source of public document talking about PDF size limitation for Gemini API. So there might be a document gap here.

Meanwhile, closing this ticket as the direct issue has been addressed.

@jamg7 jamg7 closed this as completed Feb 14, 2025
@rnckp
Copy link
Author

rnckp commented Feb 14, 2025

Hi @jamg7

Thanks. However, I politely do not concur with your assessment. The documentation is about file size too and says explicitely as cited above:

Note: The File API lets you store up to 20 GB of files per project, with a per-file maximum size of 2 GB.

I'd like to ask again: Is ~52MB really the size limit that I can use? How can I process PDFs (or other documents) that are larger than this?

@pamorgan
Copy link
Collaborator

Reopening - Thank you for reporting. The service team is investigating the root cause of the issue.

@pamorgan
Copy link
Collaborator

The service currently only supports pdf file of size 50MB or less and 300 pages or less.
We will treat this issue as a missing documentation and will open an internal feature request to increase the supported pdf file size.
Thank you for raising this issue.

@pamorgan pamorgan added the documentation Improvements or additions to documentation label Feb 20, 2025
@rnckp
Copy link
Author

rnckp commented Feb 20, 2025

@pamorgan Thanks very much, Peter. I appreciate this clarification. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: gemini-api documentation Improvements or additions to documentation priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants