-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feasibility of integrating Tree-Cat? #216
Comments
Hi @fritzo, thanks for writing us about Tree-Cat. I looked through the model and find it very interesting and creative.
There is a straightforward path toward integrating Tree-Cat into either:
Both of these interfaces would allow Tree-Cat to be used as a modeling backend using the Metamodeling Langauge, and queried using the Bayesian Query Language in BayesDB, which is typically how we run model evaluations. Which of the two interfaces makes more sense to implements depends on what features of Tree-Cat we wish to expose to the end-user --- for example, implementing the CGpm interface would allow us to compose Tree-Cat with other models in the repository, at the expense of some overhead in query runtime; implementing the IBayesDBMetamodel interfaces makes it easier to optimize the implementations of simulate/logpdf as invoked by BayesDB, at the expense less flexibility for compositing Tree-Cat with other models. We should also consider whether Tree-Cat has any built-in multiprocessing capabilities (which the CGpm integration automatically provides, but the IBayesDBMetamodel does not).
We have run various benchmarks for cgpms although I none our evaluation suites are particularly suited for nominal/ordinal data for Tree-Cat. My sense is that we can together benchmark Tree-Cat by:
By writing the benchmark suite in BQL, which is model independent, we can logically separate the task of defining the evaluation set from the task of implementing baseline models to run the queries against. Further extensions may include comparing the performance of Tree-Cat and baselines, varying the amount model analysis and/or any tunable query parameters. Let me know if you have thought about what datasets and queries could be appropriate for benchmarking Tree-Cat. |
Thanks @fsaad for your detailed response! It looks like the CGpm interface will be the easiest for me to integrate with, so I will refactor towards that interface. Re: multiprocessing, TreeCat achieves efficient querying by batching queries and vectorizing the math internally using numpy. This has made the most sense for my use cases, e.g. crossvalidating on an entire dataset, or computing mutual information by evaluating Re: datasets, I am currently testing with two private social services datasets (20K rows, 200 features, categorical and ordinal). I could test other models and publish the test results. I think a good public dataset would be a text mining dataset like Enron emails for text mining (500K rows, 1000s of sparse boolean features). Text mining seems to be the main application for Zhang and Poon's Latent Tree Analysis, a model very similar to TreeCat. I am currently working on an Enron analysis blogpost. Could you point me towards any existing model comparison code/notebooks using CGpm, as a starting point for analyzing these datasets in a CGpm-compatible way? |
Hi all, I'm looking for a way to test Tree-Cat, a generalization of Cross-Cat to latent tree models (currently only for categorical and ordinal-as-binomial features). I'm guessing that you've built a suite of datasets and evaluation metrics on top of cgpm, so I thought an easy way to test Tree-Cat would be support a standard cgpm engine interface in a
treecat.cgpm
module, and then add a littlecgpm.treecat
integration in this repo.Thanks!
The text was updated successfully, but these errors were encountered: