Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unordered_map for string references #584

Merged
merged 14 commits into from
Mar 22, 2022
Merged

Add unordered_map for string references #584

merged 14 commits into from
Mar 22, 2022

Conversation

hadley
Copy link
Member

@hadley hadley commented Apr 10, 2021

@jimhester could you please take a look at this? I can't remember if this is ok or it if will leak memory.

readstat_string_ref_t is:

typedef struct readstat_string_ref_s {
    int64_t     first_v;
    int64_t     first_o;
    size_t      len;
    char        data[1]; // Flexible array; using [1] for C++98 compatibility
} readstat_string_ref_t;

@jimhester
Copy link
Contributor

It wouldn't leak memory, but I guess you could have dangling pointers if you free whatever string data the pointers you put into the map are pointing to and don't remove them from the map.

@hadley
Copy link
Member Author

hadley commented Feb 27, 2022

@gorcha do you feel comfortable reviewing/finishing this off?

@gorcha
Copy link
Member

gorcha commented Feb 28, 2022

Yeah for sure, shall have a look!

@gorcha
Copy link
Member

gorcha commented Mar 8, 2022

Hey @hadley, the writing code on the haven side looks fine but it looks like there are a couple of bugs in ReadStat.

  • All of the file offsets in the map in the file header are written in readstat_begin_writing_data(), but at this point the string refs are empty. Once the string refs are written out the file offsets for the sections after the strLs are different but the map isn't updated.
  • The variable and observation numbers in the stata file should be indexed from 1, but ReadStat is indexed from 0. readstat_insert_string_ref() doesn't apply this, so we get an off by one in the variable and observation index values for the strLs.

I'll follow up with Evan over at ReadStat.

@gorcha
Copy link
Member

gorcha commented Mar 8, 2022

After poking around a bit more I've found a workaround for the map issue. The string refs need to be added before we start writing data so the offsets in the map are written out correctly.

I've opened a PR for the off by one error (WizardMac/ReadStat#270) - without this fixed we can still roundtrip strLs successfully, but there will likely be issues reading the output files in Stata since the indexing is off.

@hadley
Copy link
Member Author

hadley commented Mar 8, 2022

@gorcha thanks for looking into this!

@gorcha
Copy link
Member

gorcha commented Mar 9, 2022

No worries! I've just added the fix for the off by one issue as well, the PR has been merged into ReadStat and it's a small change that doesn't affect anything other than string ref writing so I figured there's no need to wait for it to come through in a ReadStat release.

@hadley
Copy link
Member Author

hadley commented Mar 9, 2022

Sounds good. Do you want to merge it?

@gorcha
Copy link
Member

gorcha commented Mar 11, 2022

One final thought - should we make the threshold for strLs user controllable rather than hard coding to 500?

@hadley
Copy link
Member Author

hadley commented Mar 11, 2022

It'd be great if you wanted to do that!

@gorcha
Copy link
Member

gorcha commented Mar 16, 2022

Yep no worries! Shall do in the next couple of days.

@gorcha
Copy link
Member

gorcha commented Mar 21, 2022

Done - I've set it to 2045 by default since that mimics the default Stata behaviour.

@gorcha gorcha merged commit d3900ac into main Mar 22, 2022
@gorcha gorcha deleted the string-ref branch March 22, 2022 13:25
@hadley
Copy link
Member Author

hadley commented Mar 22, 2022

Do you think it's worth doing a release now? Or are there other changes you want to wait on?

@gorcha
Copy link
Member

gorcha commented Mar 22, 2022

I think so - the other big/impactful change in progress is fixing the encoding issue in #615, but the PR over at ReadStat isn't quite ready so I feel like it'll still be a while (we still need to actually implement the new ReadStat functionality in haven).

There are a couple of minor things to do (adding a warning when there are conflicting labels for #667, adding a cleanup script for #668) but I can sort them out today.

@hadley
Copy link
Member Author

hadley commented Mar 23, 2022

No rush, just ping me when I should start the release process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants