-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --sample to pclean #198
Conversation
Example samples generated for flights data: act_arr_time,act_dep_time,flight,sched_arr_time,sched_dep_time,src |
cxx/hirm.cc
Outdated
@@ -16,6 +17,43 @@ HIRM::HIRM(const T_schema& schema, std::mt19937* prng) { | |||
} | |||
} | |||
} | |||
*/ | |||
|
|||
HIRM::HIRM(const T_schema& _schema, std::mt19937* prng) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the rationale for this change? If it's only to detect loops, I think that might be better done as a helper function that checks the schema instead of complicating the logic in the constructor. Also, add_relation
recursively adds the base relations of a noisy relation, how come that's handled here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is copied from #197. The rationale is to prevent crashes when noisy relations have other noisy relations as their base relation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where/why does it crash? I wouldn't have expected that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you patch in this PR and then restore the commented out version of the HIRM constructor, the test test_make_pclean_samples will crash when constructing the HIRM.
The crash itself occurs in HIRM::add_relation inside the std::visit when schema.at(trel.base_relation) is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix this by defining a new method that, when called on a noisy relation spec, recursively adds base relations if necessary? I think that would be more concise and easy to reason about, and as-is, it looks to me like the recursive call in add_relation
will never execute. Alternatively, it seems fine to me to just initialize the schema
member with the schema
constructor arg instead of building it up in add_relation
, which should also solve the crash. I just want to make sure we're not adding unnecessary complexity as we're about to hand it off to the MIT folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Can we move this discussion to Add relations to HIRM in an order respecting their base_relations #197 which actually makes the change?
-
There are lots of ways you could fix this bug. I don't think the recursive solution you propose is any more concise or easy to reason about than this one is, especially if we augment this solution with removing the then unneeded recursive call from add_relation. But if you would like to code up any alternate solution, I would be happy to review it.
} | ||
for (size_t i = 0; i < domains.size(); ++i) { | ||
int id = -1; | ||
auto it = entity_assignments.find(annotated_domains[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like at each call to make_pclean_sample
the entity assignments are sampled independently, is that right? What I mean is if we've already sampled physician:Physician == 3, physician:school:School == 5
in some previous sample, in a future sample we can get physician:Physician == 3, physician:school:School == 2
, which shouldn't be possible because the first sample implies that Physician 3 went to School 5.
(If that's the case, instead of fixing it here we could punt on it until Model 7, since I'm having to deal with that in Model 7 too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the entity assignments are recreated for every row. I added a TODO to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to check this in until it's fixed, since the method currently returns invalid samples (or at least to highlight it more, maybe by renaming the function something that indicates that samples are invalid and it is a WIP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please suggest a name you would be happy with, and I'll change it to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, maybe just append WIP_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepend i mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
As part of that, adds to_csv method to csv.hh::DataFrame.
Depends on #197
Addresses #178