Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When iterating over a layoutList, the page iterator moves on to layoutText after iterating over each individual list item #177

Closed
rsanaie opened this issue May 7, 2024 · 5 comments
Labels
bug Something isn't working javascript Relates to the JavaScript/TypeScript version of TRP

Comments

@rsanaie
Copy link

rsanaie commented May 7, 2024

When I'm traversing a layout object, after visiting a layout list, the iterator then moves on to individual list items, causing duplication of nodes visited. Is this intended?

This happens in .html() function as well. I know that you can use the childBlockIds of LayoutList to mark down which nodes have been visited and skip them later, but wanted to confirm this behaviour.

Thank you, you guys are awesome!

const ignoredBlocks: ApiBlockType[] = [
    ApiBlockType.LayoutHeader, 
    ApiBlockType.LayoutFooter,
    ApiBlockType.LayoutPageNumber
]

for (const page of doc.iterPages()) {
    for (const layItem of page.layout.iterItems()) {
        if (!ignoredBlocks.includes(layItem.blockType)) {
            console.log(layItem.blockType)
            console.log(layItem.text + "\n")
        }
    }
}
@athewsey athewsey added the javascript Relates to the JavaScript/TypeScript version of TRP label May 13, 2024
@athewsey
Copy link
Contributor

Hi, thanks for raising this!

Need to double-check I understand correctly here, but at first pass this is sounding like something we should fix...

It seems like the current state is:

  1. page.layout.iterItems() (and the corresponding listItems()) includes both the LAYOUT_LIST and any LAYOUT_TEXT children it links to (because they all seem to appear as CHILD blocks of the parent PAGE, which is where this list is derived from)
  2. This is confusing for anybody iterating through layout items, because they'd probably assume their code should deal with the LAYOUT_LIST all at once when it's seen - like it's iterating the top level of a tree and not a flat collection of all content nodes
  3. It's also causing page.html() to duplicate content because the .html() representation of the LAYOUT_LIST already includes the LAYOUT_TEXT items that the iterator picks up next.

Is that matching what you're seeing @rsanaie?

IMO 3/ would definitely be a bug. Agreeing on how to better handle 2/ might involve a few more edge cases...

This single-level List->Text tree is the only structure I'm aware of in current layout results, but at the service API level it seems like there's nothing to prevent adding more in future or defining deep nesting or more general graphs. I'd be really surprised if the service ever started returning circular relationships, but probably wouldn't completely rule out the possibility of shared children one day.

As a general rule, I've also been trying to drive as much parsing and iteration behaviour as possible directly from block Relationship lists (minimizing internal state like LayoutGeneric._items) to unblock future mutability of loaded documents at the expense of some runtime performance for certain repeated read operations.

...So I'm initially thinking yes we should change the behaviour of the page.layout.iterItems() iterator, and probably even make the new behaviour the default, but long-term might need to be a bit more formal in defining e.g. single-level & recursive layout child iterators across the different kinds of LayoutItem? For e.g. maybe every layout item could have child iterator methods like iterLayoutItems({ deep = false, skipBlockIds = {}}) or something - which would today always return empty except for LAYOUT_LIST, and which the top-level page.layout.iterItems() could use (in deep mode) to prevent exposing any nested children

@athewsey athewsey added the bug Something isn't working label May 13, 2024
@rsanaie rsanaie changed the title When iterating over a layoutList, the page iterator then moves on to layoutText for each individual list item When iterating over a layoutList, the page iterator moves on to layoutText after iterating over each individual list item May 14, 2024
@rsanaie
Copy link
Author

rsanaie commented May 14, 2024

Hi @athewsey yes that's the correct behavior I'm seeing.

I agree with you that #3 is a bug that needs to be fixed as the .html() is supposed to output usable data, but currently it's duplicating unnecessarily.

Using a tree analogy makes sense, maybe we should have an iterator that iterates through all nodes in some specific order, and then another iterator that visits children only (and leverage recursion)?

athewsey added a commit to athewsey/amazon-textract-response-parser that referenced this issue May 25, 2024
@athewsey
Copy link
Contributor

Hi @rsanaie

We have a draft bugfix release available at v0.4.1-alpha.1 to fix the serialization bug and provide some basic ability to:

  1. toggle page.layout.{list/iter}Items() between recursive and top-level-only behaviour
  2. access {list/iter}LayoutChildren() from any LayoutItem (although in practice today only LAYOUT_LIST should have anything, I think - rest will always be empty lists.

Maybe you'd be able to try it out and share thoughts?

Originally I was thinking we should make deep: false the default behaviour of page.layout.{iter/list}Items(), but maybe it's a breaking change for some folks... Might be worth revisiting for a v0.5 down the road?

@rsanaie
Copy link
Author

rsanaie commented May 30, 2024

It's working great as expected, thank you! It'd be ideal to specify which blocks should be skipped from html such as page number, header/footer) I'll create a new issue for that!

@athewsey
Copy link
Contributor

athewsey commented Jun 4, 2024

🙇 Thanks for the feedback!

The fix is now released to mainline version v0.4.1 on NPM with the merge of #178, so closing this issue & will refer to #179 for the other ask 👍

@athewsey athewsey closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working javascript Relates to the JavaScript/TypeScript version of TRP
Projects
None yet
Development

No branches or pull requests

2 participants