-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ai_extraction=True
not working locally
#11
Comments
Hi @sisyga , the ai_extraction parameter is only available from the API at the moment. When running locally on PDFs with lots of pages, I experience this problem too. That is a reasonable workaround, although I don't think it is sufficient for the reasons you mentioned. I am actually not sure what would be sufficient -- I am toying with the idea of training a page-image classifier to filter pages without visuals/tables, but this is quite demanding. If you had any additional ideas I would love to hear them! |
Hey, thanks for working to open-source the AI classifier. In the meantime, I use the following workaround: def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
chunks = []
if ai_extraction:
with open(file_path, "rb") as f:
response = requests.post(
url=API_URL,
files={'file': (file_path, f)},
data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
)
try:
response_json = response.json()
except json.JSONDecodeError:
raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
if 'error' in response_json:
raise ValueError(f"{response_json['error']}")
messages = response_json['messages']
chunks = create_chunks_from_messages(messages)
else:
import fitz
# extract text and images of each page from the PDF
with open(file_path, 'rb') as file:
doc = fitz.open(file_path)
for page in doc:
text = page.get_text()
image_list = page.get_image_info()
drawing_commands = page.get_drawings()
drawing_count = len(drawing_commands)
if text_only:
chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
elif image_list or drawing_count > 5: # only make a snapshot if there is an image or more than 5 lines drawn
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))
else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
doc.close()
return chunks Basically, I extract the number of drawing commands, and if it is higher than a threshold (here: 5, which could be implemented as an option), I make an image snapshot. This is working all right since complex formulas and table lines also count toward the drawing commands, which is what I want. |
Hi! Not sure if this is a bug or a feature, but I'd love to use the
ai_extraction
option to improve the handling of PDF documents. However, enabling this option overwrites thelocal=True
option.MWE:
Throws the error:
Failed to extract from example.pdf: No valid API key given. Visit https://thepi.pe/docs to learn more.
It works without enabling
ai_extraction,
but I don't like that it adds every page as an image to the messages because this massively increases the token count for longer PDFs.As a workaround, I adapted the
extract_pdf
function only to extract PDF pages as images if the page contains an image. It would be great to have this as an option. (I know this approach is not optimal as it misses tables and some images containing only SVG objects; maybe a better option is possible only based on thefitz
library, but I am no expert in this package).The text was updated successfully, but these errors were encountered: