Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --sample to pclean #198

Merged
merged 14 commits into from
Sep 16, 2024
Merged

Conversation

ThomasColthurst
Copy link

As part of that, adds to_csv method to csv.hh::DataFrame.

Depends on #197
Addresses #178

@ThomasColthurst
Copy link
Author

Example samples generated for flights data:

act_arr_time,act_dep_time,flight,sched_arr_time,sched_dep_time,src
7:25 a.m.,7:00 a.m.,TYLM-r@d8NK"1S{8xzbK,9:45 a.m.,7wA58 pr.m.,aa
6:30 p.m.,2:@5 p.m.,+p+ZF:R7<#iqZ/wRleW,12:12 p.m.,7:25 a._m.,aa
7EGZKDl3PCe|Y<A3KrH9{(CE5}|/PY3C6,,12:12 c.m.,[l7x0-.tx+>.mhcV./K),9:28 a.m.,7:45 p.m.,aa
9j:28 a.m.,6:30 p.m.,@r:P%NJ$:!,R#aNBdugT,7:58 p.mr,3:50 p.m.,aa
7:39 p.m.,7:10 a2m.,EPARQXt,3:5L p.m.,4:09 p.m.,aa

cxx/hirm.cc Outdated
@@ -16,6 +17,43 @@ HIRM::HIRM(const T_schema& schema, std::mt19937* prng) {
}
}
}
*/

HIRM::HIRM(const T_schema& _schema, std::mt19937* prng) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the rationale for this change? If it's only to detect loops, I think that might be better done as a helper function that checks the schema instead of complicating the logic in the constructor. Also, add_relation recursively adds the base relations of a noisy relation, how come that's handled here too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is copied from #197. The rationale is to prevent crashes when noisy relations have other noisy relations as their base relation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where/why does it crash? I wouldn't have expected that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you patch in this PR and then restore the commented out version of the HIRM constructor, the test test_make_pclean_samples will crash when constructing the HIRM.

The crash itself occurs in HIRM::add_relation inside the std::visit when schema.at(trel.base_relation) is called.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix this by defining a new method that, when called on a noisy relation spec, recursively adds base relations if necessary? I think that would be more concise and easy to reason about, and as-is, it looks to me like the recursive call in add_relation will never execute. Alternatively, it seems fine to me to just initialize the schema member with the schema constructor arg instead of building it up in add_relation, which should also solve the crash. I just want to make sure we're not adding unnecessary complexity as we're about to hand it off to the MIT folks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we move this discussion to Add relations to HIRM in an order respecting their base_relations #197 which actually makes the change?

  2. There are lots of ways you could fix this bug. I don't think the recursive solution you propose is any more concise or easy to reason about than this one is, especially if we augment this solution with removing the then unneeded recursive call from add_relation. But if you would like to code up any alternate solution, I would be happy to review it.

cxx/pclean/pclean_lib.cc Outdated Show resolved Hide resolved
}
for (size_t i = 0; i < domains.size(); ++i) {
int id = -1;
auto it = entity_assignments.find(annotated_domains[i]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like at each call to make_pclean_sample the entity assignments are sampled independently, is that right? What I mean is if we've already sampled physician:Physician == 3, physician:school:School == 5 in some previous sample, in a future sample we can get physician:Physician == 3, physician:school:School == 2, which shouldn't be possible because the first sample implies that Physician 3 went to School 5.

(If that's the case, instead of fixing it here we could punt on it until Model 7, since I'm having to deal with that in Model 7 too.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the entity assignments are recreated for every row. I added a TODO to fix that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to check this in until it's fixed, since the method currently returns invalid samples (or at least to highlight it more, maybe by renaming the function something that indicates that samples are invalid and it is a WIP)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please suggest a name you would be happy with, and I'll change it to that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, maybe just append WIP_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepend i mean

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ThomasColthurst ThomasColthurst merged commit 67613d0 into master Sep 16, 2024
2 checks passed
@ThomasColthurst ThomasColthurst deleted the 091024-thomaswc-pclean_samples branch September 16, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants