Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge GenDB and SchemaHelper; use GenDB in pclean binary #212

Merged
merged 13 commits into from
Oct 2, 2024

Conversation

ThomasColthurst
Copy link
Collaborator

This pull request is not 100% ready; there are still a few failing tests in pclean_lib_test and gendb_test.

But I thought I would get your thoughts now to make sure we are on the same page about what this is supposed to do.

Will eventually fix #207 and #206

Copy link
Collaborator

@emilyfertig emilyfertig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. (If we're using pclean.cc to launch GenDB, we should probably move it up a directory and rename it, not necessarily in this PR).

cxx/gendb.hh Show resolved Hide resolved

// Run inference
std::cout << "Running inference ...\n";
inference_hirm(&prng, &hirm,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still run HIRM with a PClean-like schema, i.e. Model 6? I think that's something we should try to keep (maybe with a bool flag for HIRM vs full GenDB, and calling incorporate_observations, inference_hirm etc on gendb.hirm)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So not only can we do this, it is currently the only thing we can do -- all inference.cc::inference_gendb does right now is directly call inference_hirm.

But I agree that as we add the ability to inference_gendb to transition entity assignments, we should add flags to keep Model 5 & 6 behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Could you add those, as Model 7 inference comes together?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@ThomasColthurst
Copy link
Collaborator Author

Tests pass now.

@ThomasColthurst
Copy link
Collaborator Author

My preference, by the way, would be to move gendb.* down into the pclean directory rather than moving pclean.cc up a directory.

cxx/gendb.cc Outdated
@@ -60,6 +60,9 @@ void GenDB::incorporate(
// Incorporate the items/value into the query relation.
incorporate_query_relation(prng, query_rel, items, val);
}

// Add to the record_class's CRP.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a CRP for the Record class. What is the problem, and is there another way to fix it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the specific problem was that I assumed that there would be a CRP for the record class, and I wrote a test under that assumption, and then I added these lines to make that test pass.

But I also use the record class's CRP in pclean_lib.cc::make_pclean_sample to create a class_item for each new row. I'm open to suggestions for better ways to implement that, but I believe that sampling from the record class's CRP is what the model spec says to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For make_pclean_sample, I think we should just use a counter for class_item, or have the function take a vector of unique row IDs. I don't think the spec says to sample from the record class's CRP -- there's a one-to-one correspondence between observations and record class entities, and sampling from a record class CRP would result in multiple observations of a single record entity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. When I was talking about the spec, I was talking about the generative model described on page 7 under "PClean entity model".

@emilyfertig
Copy link
Collaborator

Since GenDB is a combination of HIRM and PClean, maybe we should do something like

cxx/
pclean/
hirm/
relation.hh
domain.hh
...
distributions/
...
gendb.hh
gendb_main.cc <- renamed from pclean.cc

In any case I think we can do better than the current structure, I'll file an issue to figure it out.

@ThomasColthurst
Copy link
Collaborator Author

Two of the integration tests are failing with crashes during inference.

Copy link
Collaborator

@emilyfertig emilyfertig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bool only_final_emissions;
bool record_class_is_clean;
std::map<std::string, std::vector<std::string>> domains;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment for "domains"?

DataFrame df;
for (int i = 0; i < num_samples; i++) {
std::map<std::string, std::string> query_values;
WIP_make_pclean_sample(hirm, schema, annotated_domains_for_relations,
prng, &query_values);
make_pclean_sample(prng, gendb, start_row + i, &query_values);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an assertion that start_row isn't already in the record class, or add a comment that explains the assumption (that entity IDs greater than or equal to start_row aren't already in the record class).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

@ThomasColthurst
Copy link
Collaborator Author

Thanks!

Integration tests now pass, post merge with #215 and using new_rows_have_unique_entities=true in pclean_lib::incorporate_observations.

@ThomasColthurst ThomasColthurst merged commit fca1b5b into master Oct 2, 2024
1 of 2 checks passed
@ThomasColthurst ThomasColthurst deleted the 240924-thomaswc-merge_gendb branch October 2, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

I/O for Model 7
2 participants