Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In the page's detected var names, dig into object variable names until no more nested objects are found #574

Closed
plocket opened this issue Jul 3, 2022 · 4 comments
Labels
! priority A combination of urgency and impact

Comments

@plocket
Copy link
Collaborator

plocket commented Jul 3, 2022

In getPossibleVarNames, we're only decoding dicts/objects down one level. Those objects may have objects of their own. Continue decoding objects in a while loop until no more objects are detected.

Can the parts of the objects get doubly encoded? If so, what does that look like? Maybe a doubly encoded obj would end up looking like this:

foo[kwf23LER[B'EDckdleEsd]]

That first part (kwf23LER) would have to be decoded again separately?

#575 might be harder, but might be useful to make this work more reliably.

@BryceStevenWilley
Copy link
Collaborator

Adding all of the info I've learned about how docassemble does these encodings.

On the "decoding" side (i.e. on the server when receiving a POST from proceeding in the interview), it uses these regexes to search for variables with brackets:

match_brackets = re.compile(r'\[[BR]?\'[^\]]*\'\]$')
match_inside_and_outside_brackets = re.compile(r'(.*)(\[[BR]?\'[^\]]*\'\])$')
match_inside_brackets = re.compile(r'\[([BR]?)\'([^\]]*)\'\]')

This answers the questions about different parts being encoded separately: the answer is no, only the element inside the last set of brackets will be encoded.

As for the question of that one part of the element being encoded more than once, I don't see any recursive encoding things on the docassemble side, meaning things are likely finite. The only example I can find is that on fields where the datatype is:

  • object
  • object_radio
  • object_multiselect
  • object_checkboxes

, there's an extra encoding level: e.g. the value for an object multiselect in the HTML <option> is b2JqZWN0X211bHRpc2VsZWN0W0InWWpKS2NWZ3lPWGRrUmpoNidd, which decodes to object_multiselect[B'YjJKcVgyOXdkRjh6'], and YjJKcVgyOXdkRjh6 decodes to b2JqX29wdF8z which decodes to obj_opt_3, the actual name of the object. The docassemble code is really weird, but some smoking gun parts are the for loop at parse.py:5896 (commit d6fdcdacf), where the selection key is decoded an extra time.

@plocket
Copy link
Collaborator Author

plocket commented Aug 4, 2022

This sounds promising I think. A couple questions:

  1. Encoding for multiple brackets: Can you describe what that might look like for foo[bar[zoo]] and foo[bar][zoo]? If not, I can try it out later.

  2. Multiple encodings: So you're saying that the double encoding will reliably only ever extend to those three levels? That is, down to b2JqX29wdF8z?

Trying to predict the changes to the code is probably also unhelpful, so depending on the above, we may choose to avoid implementing a loop.

When we're detecting fields, is it possible to know (before decoding) when we're dealing with an object_multiselect specifically? Is that somewhere in the HTML DOM?

@BryceStevenWilley
Copy link
Collaborator

BryceStevenWilley commented Aug 4, 2022

Can you describe what that might look like for foo[bar[zoo]] and foo[bar][zoo]? If not, I can try it out later.

It's everything in the final top level bracket, so bar[zoo] is multiply encoded. Here's some test code to explore:

---
mandatory: True
code: |
  obj_multi.gathered
  object_multiselect_test
---
objects:
  - obj_multi: DAList
---
code: |
  o1 = obj_multi.appendObject()
  o1.name = 'obj_multi_1'
  o2 = obj_multi.appendObject()
  o2.name = 'obj_multi_2'
  o3 = obj_multi.appendObject()
  o3.name = 'obj_multi_3'
  obj_multi.gathered = True
---
id: object multiselect
question: |
  Object Multiselect
fields:
  - Object multiselect: object_multiselect_test
    datatype: object_multiselect
    choices:
      - obj_multi[0]
      - obj_multi[1]
      - obj_multi[2]

Fairly certain that for foo[bar][zoo], only zoo will be multiply encoded, i.e. the base64 string should decode to foo[bar][B'<more base64'].

So you're saying that the double encoding will reliably only ever extend to those three levels?

That I can find in the code. Like I mentioned, the code is complicated (sometimes the base64 encoded var name is directly saved in the internal data structures, sometimes not. But there isn't anything that's decoding something in a loop or recursively.

When we're detecting fields, is it possible to know (before decoding) when we're dealing with an object_multiselect specifically? Is that somewhere in the HTML DOM?

There is; the HTML fixture I put into #581 uses a multiple="" and a class of damultiselect (which I don't know if it's subject to change).

@plocket
Copy link
Collaborator Author

plocket commented Oct 1, 2024

Done given the above assumptions.

@plocket plocket closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
! priority A combination of urgency and impact
Projects
None yet
Development

No branches or pull requests

2 participants