Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in generating data #4

Closed
jxzhn opened this issue Feb 24, 2022 · 4 comments
Closed

bug in generating data #4

jxzhn opened this issue Feb 24, 2022 · 4 comments

Comments

@jxzhn
Copy link

jxzhn commented Feb 24, 2022

I think line 112 (in function separate_dps) of cc/main/src/utils/utils.py should be

        aug_asts.append([ast[i : i + max_len], i + half_len])

instead of

        aug_asts.append([ast[i : i + max_len], half_len])
@serjtroshin
Copy link
Collaborator

serjtroshin commented Feb 24, 2022

Hi! half_len here denotes that the first half of the snippet will be processed as memory, and the last part is for loss eval and prediction, it seems there is no bug

@jxzhn
Copy link
Author

jxzhn commented Feb 26, 2022

But this is inconsistent with the comments on lines 87 to 104

    """
    Handles training / evaluation on long ASTs by splitting
    them into smaller ASTs of length max_len, with a sliding
    window of max_len / 2.

    Example: for an AST ast with length 1700, and max_len = 1000,
    the output will be:
    [[ast[0:1000], 0], [ast[500:1500], 1000], [ast[700:1700], 1500]]

    Input:
        ast : List[Dictionary]
            List of nodes in pre-order traversal.
        max_len : int

    Output:
        aug_asts : List[List[List, int]]
            List of (ast, beginning idx of unseen nodes)
    """

Also inconsistent with lines 114 to 115

    idx = max_len - (len(ast) - (i + half_len))
    aug_asts.append([ast[-max_len:], idx])

@serjtroshin
Copy link
Collaborator

Oh, I see, there is seem to be an issue with the comment. The function's intent is here https://github.com/facebookresearch/code-prediction-transformer#splitting-large-trees

The function seems to be correct, also it does not follow he comments on lines 87 to 104:
separate_dps(range(2022), 500) should produce
[[range(0, 500), 0],
[range(250, 750), 250],
[range(500, 1000), 250],
[range(750, 1250), 250],
[range(1000, 1500), 250],
[range(1250, 1750), 250],
[range(1500, 2000), 250],
[range(1522, 2022), 478]]

0, 250, 478 values denote the start of the token block for which we make a prediction.
250 means that we should use the first 250 tokens in a block as memory.
The first block range(0, 500) is used for prediction (no memory), for intermediate blocks the first half is used as memory, for the last block, which is range(1522, 2022) the model will use elements range(1522, 2000) as memory, and make predictions for the remaining elements: range(2000, 2022). I hope it is more clear, thank you @jxzhn !

@jxzhn
Copy link
Author

jxzhn commented Mar 9, 2022

get it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants