Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[formrecognizer] Proposal: child element navigator function #21352

Closed

Conversation

catalinaperalta
Copy link
Member

The idea here is to provide a package level function that allows users to pass in a certain type of document element and find it's children elements. For instance in the case of a DocumentStructureElement users can call the get_document_content_elements function to find the child document content elements, such as "words" and "selection_marks".

This removes the need to repeat the get children method on each model and also removes the need to store a reference to the parent in those models.

Currently this is the lazy way of performing the search which is at N^3, but this could get down to N log N if we can switch to binary search if all of the spans and document content elements are already sorted.

@catalinaperalta catalinaperalta changed the title [formrecognizer] Proposal: children navigator function [formrecognizer] Proposal: child element navigator function Oct 20, 2021
@azure-sdk
Copy link
Collaborator

API changes have been detected in azure-ai-formrecognizer. You can review API changes here

API changes

+ def azure.ai.formrecognizer.get_document_content_elements(
+         base_element, 
+         page, 
+         search_elements
+     ) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]
+  
+  

@@ -4039,3 +4039,24 @@ def from_dict(cls, data):
innererror=DocumentAnalysisInnerError.from_dict(data.get("innererror")) # type: ignore
if data.get("innererror") else None
)

def get_document_content_elements(base_element, page, search_elements):
# type: (DocumentLine, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should technically be something like (NOTE: we dont yet implement the DocumentStructureElement base class):

Suggested change
# type: (DocumentLine, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]
# type: (DocumentStructureElement, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]



class ElementNavigator(object):
"""Provides element navigation methods."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial thought is that maybe it makes sense to "pre-compute" everything in the constructor. If someone instantiates this class it means they are interested in navigating the elements so I think it could be okay to make that assumption / take the hit.

By "pre-compute" I'm wondering if we can take, for example, all the words and lines and kind of categorize them by their offset/length? This way we can jump straight to the offset of the span of the thing passed and hopefully just do a few quick calcs to include everything it contains. Untested example of the "pre-computing" I'm kind of thinking of:

eles = {}
for page in document.pages:
    for word in page.words:
        if word.span.offset not in eles:
            eles[word.span.offset] = {}
        eles[word.span.offset][word.span.length] = word

for page in document.pages:
    for line in page.lines:
        for span in line.spans:
            if span.offset not in eles:
                eles[span.offset] = {}
            eles[span.offset][span.length] = line

Maybe we keep words and lines separate, guess it depends on if we want to return a heterogeneous collection at any point. But then I think we should be able to enable a scenario like this?

poller = client.begin_analyze_document("prebuilt-document", myfile)
result = poller.result()

nav = ElementNavigator(result)
lines = nav.get_lines(result.documents[0])

Please poke holes in this since I know you've spent much more time thinking on this. :)

Also my (maybe poor) understanding is that you should be able to pass in any type that contains span or spans into the helpers (since the text/elements that they are comprised of are accessible through the AnalyzeResult.content). I see that these types all have spans, but some are kind of atomic types (like words) so maybe we would throw if somebody passed that).

AnalyzedDocument
DocumentEntity
DocumentField
DocumentKeyValueElement
DocumentLine
DocumentPage
DocumentSelectionMark
DocumentStyle
DocumentTable
DocumentTableCell
DocumentWord

(I'm still thinking about this but I'm just going to hit 'Enter' on the comment for now and come back to it) 😃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea about how to "pre-compute" the elements by their span offset! I think that after my discussion with Johan this isn't too much of a concern, but I think this is a good idea to keep in our back pocket for a future improvement depending on how the implementation goes because the elems dict might take some memory but at the same time maybe it wont ever be anything considerable that would be a problem.

@catalinaperalta
Copy link
Member Author

Closing this, since the work is being done in: #21224.

azure-sdk pushed a commit to azure-sdk/azure-sdk-for-python that referenced this pull request Feb 14, 2023
Release apimangement 2022-08 01 (Azure#22313)

* Release api management 2022 08 01 (Azure#21169)

* add base for 2022-08-01

* updates readme.md

* update api-version and examples

* resolve Go SDK breaking change

* update examples to fix model definition

Co-authored-by: Chenjie Shi <tadelesh.shi@live.cn>

* Adding Resolver Entities to documentation (Azure#21352)

* merging in resolver entities

* adding resolver updates

* fixed policy examples that were missing a policyId param

* fixed typo that added a nested properties prop

* fixed description for API Resolvers List, filter query

* fixed error in definitions for resolvers

* fixed example that had an incorrect response definition

* added missing json file references

* fix for linting errors

* ref fixes and undoing bad merge overwrites

* fix typo

* wiki for apis and products  (Azure#21595)

* wiki for apis and products

* prettier; fixed spellchecks

* Fixed spelling

* Updated path

Co-authored-by: changlong-liu <59815250+changlong-liu@users.noreply.github.com>

* code review changes; gihub checks fixes

* fixed conact name and paths

* added properties back

* added contract properties objects

* fixed lint

* changed example to match the definition

* prettier

* code review changes

* added paths and examples for wiki collections

* Added x-ms-pageable

* removed count

Co-authored-by: changlong-liu <59815250+changlong-liu@users.noreply.github.com>

* Add ConfirmConsentCode to APIM RP (Azure#22418)

* Update apimauthorizationproviders.json

Add ConfirmConsentCode endpoint

* Update definitions.json

* Create ApiManagementPostAuthorizationConfirmConsentCodeRequest.json

* Update definitions.json

Remove count property

* Update apimauthorizationproviders.json

* [2022-08-01] Fix Linter Errors (Azure#22537)

* linter errors

* prettier

* v2

* fix error response duplicate schema

* type object

* prettier

* sync data from current ga version

* remove duplicate parameter

* type object

---------

Co-authored-by: Chenjie Shi <tadelesh.shi@live.cn>
Co-authored-by: hoonality <92482069+hoonality@users.noreply.github.com>
Co-authored-by: malincrist <92857141+malincrist@users.noreply.github.com>
Co-authored-by: changlong-liu <59815250+changlong-liu@users.noreply.github.com>
Co-authored-by: Logan Zipkes <44794089+LFZ96@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants