-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[formrecognizer] Proposal: child element navigator function #21352
[formrecognizer] Proposal: child element navigator function #21352
Conversation
API changes have been detected in API changes + def azure.ai.formrecognizer.get_document_content_elements(
+ base_element,
+ page,
+ search_elements
+ ) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]
+
+ |
@@ -4039,3 +4039,24 @@ def from_dict(cls, data): | |||
innererror=DocumentAnalysisInnerError.from_dict(data.get("innererror")) # type: ignore | |||
if data.get("innererror") else None | |||
) | |||
|
|||
def get_document_content_elements(base_element, page, search_elements): | |||
# type: (DocumentLine, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should technically be something like (NOTE: we dont yet implement the DocumentStructureElement base class):
# type: (DocumentLine, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]] | |
# type: (DocumentStructureElement, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]] |
|
||
|
||
class ElementNavigator(object): | ||
"""Provides element navigation methods.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my initial thought is that maybe it makes sense to "pre-compute" everything in the constructor. If someone instantiates this class it means they are interested in navigating the elements so I think it could be okay to make that assumption / take the hit.
By "pre-compute" I'm wondering if we can take, for example, all the words and lines and kind of categorize them by their offset/length? This way we can jump straight to the offset of the span of the thing passed and hopefully just do a few quick calcs to include everything it contains. Untested example of the "pre-computing" I'm kind of thinking of:
eles = {}
for page in document.pages:
for word in page.words:
if word.span.offset not in eles:
eles[word.span.offset] = {}
eles[word.span.offset][word.span.length] = word
for page in document.pages:
for line in page.lines:
for span in line.spans:
if span.offset not in eles:
eles[span.offset] = {}
eles[span.offset][span.length] = line
Maybe we keep words and lines separate, guess it depends on if we want to return a heterogeneous collection at any point. But then I think we should be able to enable a scenario like this?
poller = client.begin_analyze_document("prebuilt-document", myfile)
result = poller.result()
nav = ElementNavigator(result)
lines = nav.get_lines(result.documents[0])
Please poke holes in this since I know you've spent much more time thinking on this. :)
Also my (maybe poor) understanding is that you should be able to pass in any type that contains span
or spans
into the helpers (since the text/elements that they are comprised of are accessible through the AnalyzeResult.content
). I see that these types all have spans, but some are kind of atomic types (like words) so maybe we would throw if somebody passed that).
AnalyzedDocument
DocumentEntity
DocumentField
DocumentKeyValueElement
DocumentLine
DocumentPage
DocumentSelectionMark
DocumentStyle
DocumentTable
DocumentTableCell
DocumentWord
(I'm still thinking about this but I'm just going to hit 'Enter' on the comment for now and come back to it) 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea about how to "pre-compute" the elements by their span offset! I think that after my discussion with Johan this isn't too much of a concern, but I think this is a good idea to keep in our back pocket for a future improvement depending on how the implementation goes because the elems dict might take some memory but at the same time maybe it wont ever be anything considerable that would be a problem.
Closing this, since the work is being done in: #21224. |
Release apimangement 2022-08 01 (Azure#22313) * Release api management 2022 08 01 (Azure#21169) * add base for 2022-08-01 * updates readme.md * update api-version and examples * resolve Go SDK breaking change * update examples to fix model definition Co-authored-by: Chenjie Shi <tadelesh.shi@live.cn> * Adding Resolver Entities to documentation (Azure#21352) * merging in resolver entities * adding resolver updates * fixed policy examples that were missing a policyId param * fixed typo that added a nested properties prop * fixed description for API Resolvers List, filter query * fixed error in definitions for resolvers * fixed example that had an incorrect response definition * added missing json file references * fix for linting errors * ref fixes and undoing bad merge overwrites * fix typo * wiki for apis and products (Azure#21595) * wiki for apis and products * prettier; fixed spellchecks * Fixed spelling * Updated path Co-authored-by: changlong-liu <59815250+changlong-liu@users.noreply.github.com> * code review changes; gihub checks fixes * fixed conact name and paths * added properties back * added contract properties objects * fixed lint * changed example to match the definition * prettier * code review changes * added paths and examples for wiki collections * Added x-ms-pageable * removed count Co-authored-by: changlong-liu <59815250+changlong-liu@users.noreply.github.com> * Add ConfirmConsentCode to APIM RP (Azure#22418) * Update apimauthorizationproviders.json Add ConfirmConsentCode endpoint * Update definitions.json * Create ApiManagementPostAuthorizationConfirmConsentCodeRequest.json * Update definitions.json Remove count property * Update apimauthorizationproviders.json * [2022-08-01] Fix Linter Errors (Azure#22537) * linter errors * prettier * v2 * fix error response duplicate schema * type object * prettier * sync data from current ga version * remove duplicate parameter * type object --------- Co-authored-by: Chenjie Shi <tadelesh.shi@live.cn> Co-authored-by: hoonality <92482069+hoonality@users.noreply.github.com> Co-authored-by: malincrist <92857141+malincrist@users.noreply.github.com> Co-authored-by: changlong-liu <59815250+changlong-liu@users.noreply.github.com> Co-authored-by: Logan Zipkes <44794089+LFZ96@users.noreply.github.com>
The idea here is to provide a package level function that allows users to pass in a certain type of document element and find it's children elements. For instance in the case of a DocumentStructureElement users can call the
get_document_content_elements
function to find the child document content elements, such as "words" and "selection_marks".This removes the need to repeat the get children method on each model and also removes the need to store a reference to the parent in those models.
Currently this is the lazy way of performing the search which is at N^3, but this could get down to N log N if we can switch to binary search if all of the spans and document content elements are already sorted.