Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reprun: Correct handling of data files for _cf in Sort RNG checks #6

Closed
Tracked by #5
bbdaniels opened this issue Nov 7, 2023 · 5 comments
Closed
Tracked by #5
Assignees
Labels
bug Something isn't working reprun Relates to reprun

Comments

@bbdaniels
Copy link
Contributor

Is there any way to check if data has changed and prevent the need to save copies of the data so frequently:

https://github.com/dime-worldbank/repkit/blob/96ca085e0b972982f61a3acb2130a52dbca5ede4/src/ado/reprun_dataline.ado#L34-L48

bbdaniels added a commit that referenced this issue Nov 8, 2023
Attempt to solve #6
@bbdaniels bbdaniels self-assigned this Nov 30, 2023
@bbdaniels bbdaniels added the resolved but unpublished Solution submitted but unreleased label Nov 30, 2023
@kbjarkefur
Copy link
Contributor

Solved but then unvolved by me. Lets figure out all hard solved issues in a simple but perhaps ineffective workflow. Once we know the command handles all cases it needs to handle, then we can start thinking about making it more efficient.

@kbjarkefur kbjarkefur added wontfix This will not be worked on and removed resolved but unpublished Solution submitted but unreleased labels Jan 17, 2024
@kbjarkefur
Copy link
Contributor

wontfix label is temporary. We absolutely should fix this, but not yet.

@kbjarkefur
Copy link
Contributor

Whenever re-introducing this, implement it with a sub-command that can be switched off for debugging purposes.

@bbdaniels bbdaniels added the enhancement New feature or request label Mar 28, 2024
@luisesanmartin
Copy link
Member

Related to this, I applied reprun to an unstable code where an unstable sorting was then leading to the creation of an unstable variable based on the _n position. You can check the screenshot below, the first highlighted line has the unstable sorting and line 89 creates the unstable var. The second highlighted line (98) is where reprun detects the first instability in the data.

image

image

I have two comments on this:

  • Shouldn't the first difference in the data signature be detected in line 89 instead of 98? that's where I think the first unstable variable is generated. Line 98 also generates another unstable variable. Perhaps reprun is not checking line stability by line but after each X lines (for efficiency, I'd presume) and reports the first difference detected?
  • From a user perspective, I think it would be better if the command flagged line 86 as the start of the instability instead of line 98, since that's where the issue starts. I understand it doesn't do so because the flag is in changes in the data signature, which doesn't check the sorting of the data or a variable.

@bbdaniels
Copy link
Contributor Author

So, the issue here is that the code in Line 86 does not in fact cause a mismatching data signature. Data signature is unique up to the order of each column.

That is why I was working on a functionality such that the sort order RNG would flag this — although I think something got removed at some point. The sort order RNG “DIFF” flag should only appear when it causes data to mismatch (since every use of sort RNG should be a new sortseed, it should generally not match between runs).

In other words, the sort order flag should indicate that there is an error coming from a missing sort, generally of this type — my preferred solution here is that the sort order flag not appear till Line 86, where there is a true reproducibility issue caused by sorting.

See #6; see

* Handle data line
local output = substr(`"`datatmp'"',1,strrpos(`"`datatmp'"',".txt"))
local data = "`lnum'`looptracker'"
local data = subinstr("`data'"," ","_",.)
local data = subinstr("`data'",":","-",.)
if `run' == 1 {
cap mkdir "`output'"
save "`output'/`data'.dta" , replace emptyok
local srngcheck = _rc
}
if `run' == 2 {
local output = subinstr(`"`output'"',"run2","run1",.)
cap cf _all using "`output'/`data'.dta"
local srngcheck = _rc
}
and

repkit/src/ado/reprun.ado

Lines 670 to 681 in 96ca085

* Logic for minimal SRNG checker (still oversensitive)
local l1_srng = "`l1_srngstate'"
local pl1_srng = "`pl1_srngstate'"
if ("`l1_srngstate'" != "`pl1_srngstate'") & ("`l2_srngcheck'" != "0") {
local l2_srng = "`l2_srngstate'"
local pl2_srng = "`pl2_srngstate'"
}
else {
local l2_srng = "`l1_srngstate'"
local pl2_srng = "`pl1_srngstate'"
}
for a prior version of my solution -- @kbjarkefur why did you remove this functionality?

@kbjarkefur kbjarkefur added the reprun Relates to reprun label Apr 16, 2024
@bbdaniels bbdaniels changed the title reprun: Performance improvement possible by saving data less often? reprun: Correct handling of data files for _cf in Seed RNG checks Apr 16, 2024
@bbdaniels bbdaniels added bug Something isn't working and removed enhancement New feature or request wontfix This will not be worked on labels Apr 16, 2024
@bbdaniels bbdaniels changed the title reprun: Correct handling of data files for _cf in Seed RNG checks reprun: Correct handling of data files for _cf in Sort RNG checks Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reprun Relates to reprun
Projects
None yet
Development

No branches or pull requests

3 participants