Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[formrecognizer] Proposal: child element navigator function #21352

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
AccountInfo,
DocumentAnalysisError,
DocumentAnalysisInnerError,
get_document_content_elements,
)
from ._api_versions import FormRecognizerApiVersion, DocumentAnalysisApiVersion

Expand Down Expand Up @@ -123,6 +124,7 @@
"AccountInfo",
"DocumentAnalysisError",
"DocumentAnalysisInnerError",
"get_document_content_elements",
]

__VERSION__ = VERSION
Original file line number Diff line number Diff line change
Expand Up @@ -4039,3 +4039,38 @@ def from_dict(cls, data):
innererror=DocumentAnalysisInnerError.from_dict(data.get("innererror")) # type: ignore
if data.get("innererror") else None
)


class ElementNavigator(object):
"""Provides element navigation methods."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial thought is that maybe it makes sense to "pre-compute" everything in the constructor. If someone instantiates this class it means they are interested in navigating the elements so I think it could be okay to make that assumption / take the hit.

By "pre-compute" I'm wondering if we can take, for example, all the words and lines and kind of categorize them by their offset/length? This way we can jump straight to the offset of the span of the thing passed and hopefully just do a few quick calcs to include everything it contains. Untested example of the "pre-computing" I'm kind of thinking of:

eles = {}
for page in document.pages:
    for word in page.words:
        if word.span.offset not in eles:
            eles[word.span.offset] = {}
        eles[word.span.offset][word.span.length] = word

for page in document.pages:
    for line in page.lines:
        for span in line.spans:
            if span.offset not in eles:
                eles[span.offset] = {}
            eles[span.offset][span.length] = line

Maybe we keep words and lines separate, guess it depends on if we want to return a heterogeneous collection at any point. But then I think we should be able to enable a scenario like this?

poller = client.begin_analyze_document("prebuilt-document", myfile)
result = poller.result()

nav = ElementNavigator(result)
lines = nav.get_lines(result.documents[0])

Please poke holes in this since I know you've spent much more time thinking on this. :)

Also my (maybe poor) understanding is that you should be able to pass in any type that contains span or spans into the helpers (since the text/elements that they are comprised of are accessible through the AnalyzeResult.content). I see that these types all have spans, but some are kind of atomic types (like words) so maybe we would throw if somebody passed that).

AnalyzedDocument
DocumentEntity
DocumentField
DocumentKeyValueElement
DocumentLine
DocumentPage
DocumentSelectionMark
DocumentStyle
DocumentTable
DocumentTableCell
DocumentWord

(I'm still thinking about this but I'm just going to hit 'Enter' on the comment for now and come back to it) 😃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea about how to "pre-compute" the elements by their span offset! I think that after my discussion with Johan this isn't too much of a concern, but I think this is a good idea to keep in our back pocket for a future improvement depending on how the implementation goes because the elems dict might take some memory but at the same time maybe it wont ever be anything considerable that would be a problem.


def get_document_content_elements(base_element, page, search_elements):
# type: (DocumentLine, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]
result = []
for elem in search_elements:
if elem == "words":
for word in page.words:
# performance wise this is not great since it runs through ALL the words every time even if the line is very short
for span in base_element.spans:
if word.span.offset >= span.offset and (
word.span.offset + word.span.length
) <= (span.offset + span.length):
result.append(word)
elif elem == "selection_marks":
for mark in page.selection_marks:
for span in base_element.spans:
if mark.span.offset >= span.offset and (
mark.span.offset + mark.span.length
) <= (span.offset + span.length):
result.append(mark)
return result

def get_document_structure_elements(base_element, analyze_result, search_elements):
# type: (DocumentLine, AnalyzeResult, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]
# TODO implementation
return

def get_styles(element, analyze_result):
# type: (Union[DocumentContentElement, DocumentStructureElement, DocumentPageElement], AnalyzeResult) -> List[DocumentStyle]
# TODO implementation
return
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import functools
from azure.ai.formrecognizer._generated.models import AnalyzeResultOperation
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.ai.formrecognizer import AnalyzeResult
from azure.ai.formrecognizer import AnalyzeResult, get_document_content_elements
from preparers import FormRecognizerPreparer
from testcase import FormRecognizerTest
from preparers import GlobalClientPreparer as _GlobalClientPreparer
Expand All @@ -18,6 +18,18 @@

class TestDocumentFromStream(FormRecognizerTest):

@FormRecognizerPreparer()
@DocumentAnalysisClientPreparer()
def test_document_line_get_words(self, client):
with open(self.selection_form_pdf, "rb") as fd:
document = fd.read()

poller = client.begin_analyze_document("prebuilt-document", document)
result = poller.result()

res = get_document_content_elements(result.pages[0].lines[13], result.pages[0], ["words", "selection_marks"])
assert len(res) == 1

@FormRecognizerPreparer()
@DocumentAnalysisClientPreparer()
def test_document_stream_transform_pdf(self, client):
Expand Down