How to validate if my trained CLV Model is accurate enough to be used? #557

lojacobs · 2024-02-29T16:28:09Z

lojacobs
Feb 29, 2024

Hi everyone,

I used the CLV quickstart tutorial from the pymc-marketing website (and scanned as well the Lifetimes library documentation) to do my first CLV.

Now, I would like to validate if it's accurate enough to make decisions based on it, but I cannot find a way to validate it with the current library.

I'm using also the MMMGPT to help me and I did the Introductory course of the intuitivebayes.com.

Thus, I would like to use the trace in the InferenceData Object to create a PPC plot and do a WAIC as well to help me see if my trained model is good enough as well as to improve by iteration. My issue is that the trace isn't available with the BetaGeoModel model or its idata. (I suppose that I will get the same issue with the Gamma-Gamma model because it doesn't have the trace as well in its idata object)

What should be my next steps? Is there another way to validate those models?

PS: I'm more an hacker than a coder and I'm more a business guy than a Data Scientist... but I'm willing to learn!

Thanks

Answered by ColtAllen

Mar 1, 2024

For what I understood, the Pareto is when you have a true indicator (ex: the end of a subscription).

No; both models are for the same use case. In fact, the whole reason the BG/NBD model was originally developed back in 2005 was because the math under the hood for Pareto/NBD was too complex to implement until fairly recently.

Pareto/NBD takes more time to fit than BG/NBD, but has more functionality. Both models will perform similarly if you're only interested in expected_purchases, but BG/NBD assumes all one-time customers are 100% still active. If this is not a valid assumption for your use case, use Pareto/NBD, which also has a parameter to predict if customers will still be active x …

View full answer

wd60622 · 2024-02-29T17:23:38Z

wd60622
Feb 29, 2024
Maintainer

Hello @lojacobs
Both of those models should have the InferenceData available in the idata attributes after a model fit. Are you saying that they are not available?

I'm not sure if there is full support az.compare but it'd be worth a try!
My understanding is that those models serve a bit different purposes. Are you trying to compare the two?

1 reply

lojacobs Feb 29, 2024
Author

Hello @wd60622 ,

Here is the code that I wrote and the error that I got:

model_config = {
    'a_prior': {'dist': 'HalfNormal',
                'kwargs': {'sigma': 100}},
    'b_prior': {'dist': 'HalfNormal',
                'kwargs': {'sigma': 100}},
    'alpha_prior': {'dist': 'HalfNormal',
                'kwargs': {'sigma': 100}},
    'r_prior': {'dist': 'HalfNormal',
                'kwargs': {'sigma': 100}},
}

bgm_train = clv.BetaGeoModel(
    data = data_summary_rfm_train,
    model_config = model_config,
)
bgm_train.build_model()

bgm_train.fit()

az.plot_trace(bgm_train.idata)

bgm_train.fit_summary()

This part worked as expected.

And then I tried to add the sample posterior predictive & sample prior predictive to the idata Object to check how close they're with the observed data (with the az.plot_ppc(idata,group="posterior") ):

I got the "prior" that I didn't have before (as well as a warning that I don't know how to handle it), but I don't have the posterior_predictive & prior_predictive like I have when using only the PYMC library:

I suppose that I'm doing something wrong because the pymc-marketing is on top of the pymc library, so I assume that maybe the way the model is created and/or the output of it in the idata object isn't the same as with only the pymc library.

My end goal is just to validate & iterate if need be my fitted CLV model. If you have another way to reach that, I'm open to any suggestion (with some practical ways to implement it on my side or a good reference)

Thanks for your quick response! I wasn't expecting a response as fast.

ColtAllen · 2024-03-01T03:11:14Z

ColtAllen
Mar 1, 2024
Maintainer

Hey @lojacobs,

This is related to #352.

The only CLV models supporting PPCs are ParetoNBDModel and ShiftedBetaGeoModelIndividual because the likelihood functions of both models are encapsulated in distribution blocks, which also contain the sim_data methods required to simulate data for a PPC.

BetaGeoModel and GammaGammaModel both have their likelihood functions wrapped in a pymc.Potential, which does not assign the likelihood values required for a PPC to the idata output. This is also the reason for the UserWarnings you're seeing.

A distribution block needs to be built for BetaGeoModel to support PPCs. GammaGammaModel however, is an oddball and I don't quite think PPCs are applicable here. The output of this model is a weighted average of the underlying Gamma distributions by purchase frequency (i.e., if a customer has only made one purchase, their expected spend is the population average, but if a customer has made many purchase, their expected spend will be biased towards their own historical average).

PPCs for CLV models are also rather nuanced. I'm planning to open a PR soon for this, but in the meantime if you wish to hack out a PPC for ParetoNBDModel:

import arviz as az
import pandas as pd
from pymc-marketing import clv


model = clv.ParetoNBDModel(data)
model.build_model()
model.fit()

ppc_freq = model.distribution_customer_population(random_seed=45)[0][0][...,1]
obs_freq = model.idata.observed_data['likelihood'][...,1]

pd.DataFrame(
    {
        "model estimations": ppc_freq.to_pandas().value_counts().sort_index(), 
        "observed": obs_freq.to_pandas().value_counts().sort_index()
        }
        ).head(15).plot(kind="bar", title = "Histogram of Purchase Counts per Customer")

These arviz plots are also a cool way to show the cumulative equivalent and confidence bands around estimated and observed:

az.plot_ecdf(ppc_freq,obs_freq, confidence_bands = True).set_title( "Posterior Predictive ECDF Plot")
az.plot_ecdf(ppc_freq,obs_freq, confidence_bands = True, difference=True).set_title("Posterior Predictive Difference Plot")

1 reply

lojacobs Mar 1, 2024
Author

Thanks @ColtAllen for such a complet response!

The only thing that bothers me about switching from the BG/NBD model to the Pareto/NBD model is that I don't have an explicit churn indicator. I'm learning to use this model by implementing it for a little e-commerce website, so I cannot determine whether the customer is simply dormant but still active, or if they are actually 'dead'.

Is there a way to validate the BG/NBD model (& Gamma-Gamma because they're both used by the CLV model if I understood it correctly)? If not, do you know a way to use the Pareto/NBD model for my context (& I suppose that the Gamma-Gamma will still be used by the CLV model, so we would need a way to validate it as well)?

I apologize for continuing to request help, but I assume I'm not the only one wondering how to validate your CLV model to use it confidently. So, I hope this discussion will benefit others as well!

ColtAllen · 2024-03-01T14:57:37Z

ColtAllen
Mar 1, 2024
Maintainer

Thanks @ColtAllen for such a complet response!

I don't have an explicit churn indicator.

What do you mean? both the BG/NBD and Pareto/NBD models have probability_alive methods.

Is there a way to validate the BG/NBD model (& Gamma-Gamma because they're both used by the CLV model if I understood it correctly)? If not, do you know a way to use the Pareto/NBD model for my context (& I suppose that the Gamma-Gamma will still be used by the CLV model, so we would need a way to validate it as well)?

I'm working on a PR for an rfm_train_test_split function that can allow these models to be evaluated in the conventional ML train/test fasshion. I hope to get it posted next week.

1 reply

lojacobs Mar 1, 2024
Author

What do you mean? both the BG/NBD and Pareto/NBD models have probability_alive methods.

For what I understood, the Pareto is when you have a true indicator (ex: the end of a subscription) whereas the BG is when you are using the repetition of the purchases to predict if the customer is still alive or not.

I'm working on a PR for an rfm_train_test_split function that can allow these models to be evaluated in the conventional ML train/test fasshion. I hope to get it posted next week.

Great! It would be very helpful!

ColtAllen · 2024-03-01T16:23:49Z

ColtAllen
Mar 1, 2024
Maintainer

For what I understood, the Pareto is when you have a true indicator (ex: the end of a subscription).

No; both models are for the same use case. In fact, the whole reason the BG/NBD model was originally developed back in 2005 was because the math under the hood for Pareto/NBD was too complex to implement until fairly recently.

Pareto/NBD takes more time to fit than BG/NBD, but has more functionality. Both models will perform similarly if you're only interested in expected_purchases, but BG/NBD assumes all one-time customers are 100% still active. If this is not a valid assumption for your use case, use Pareto/NBD, which also has a parameter to predict if customers will still be active x time periods into the future.

For true indicators, you would use ShiftedBetaGeoModelIndividual.

1 reply

lojacobs Mar 1, 2024
Author

Oh! Thanks for the correction! I will try to use the Pareto/NBD then with your earlier code to validate that model.

Have a great weekend and thanks again for the sharing of all your knowledge (on top of developing this awesome library!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to validate if my trained CLV Model is accurate enough to be used? #557

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to validate if my trained CLV Model is accurate enough to be used? #557

lojacobs Feb 29, 2024

Replies: 4 comments · 4 replies

wd60622 Feb 29, 2024 Maintainer

lojacobs Feb 29, 2024 Author

ColtAllen Mar 1, 2024 Maintainer

lojacobs Mar 1, 2024 Author

ColtAllen Mar 1, 2024 Maintainer

lojacobs Mar 1, 2024 Author

ColtAllen Mar 1, 2024 Maintainer

lojacobs Mar 1, 2024 Author

lojacobs
Feb 29, 2024

Replies: 4 comments 4 replies

wd60622
Feb 29, 2024
Maintainer

lojacobs Feb 29, 2024
Author

ColtAllen
Mar 1, 2024
Maintainer

lojacobs Mar 1, 2024
Author

ColtAllen
Mar 1, 2024
Maintainer

lojacobs Mar 1, 2024
Author

ColtAllen
Mar 1, 2024
Maintainer

lojacobs Mar 1, 2024
Author