Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add aisample notebooks into community folder #1606

Merged
merged 15 commits into from
Aug 19, 2022
Merged

docs: add aisample notebooks into community folder #1606

merged 15 commits into from
Aug 19, 2022

Conversation

thinkall
Copy link
Contributor

@thinkall thinkall commented Aug 9, 2022

Related Issues/PRs

None

What changes are proposed in this pull request?

Add two notebooks for aisample into notebooks/community/aisample folder.

  • AIsample - Book Recommendation.ipynb
  • AIsample - Fraud Detection.ipynb

How is this patch tested?

  • I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

  • No. You can skip this section.
  • Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • No. You can skip this section.
  • Yes. Make sure you have added samples following below steps.
  1. Find the corresponding markdown file for your new feature in website/docs/documentation folder.
    Make sure you choose the correct class estimators/transformers and namespace.
  2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
  3. Make sure the DocTable points to correct API link.
  4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
  5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
  6. Make sure the WebsiteSamplesTests job pass in the pipeline.

AB#1920095

@thinkall thinkall requested a review from mhamilton723 as a code owner August 9, 2022 06:33
@github-actions
Copy link

github-actions bot commented Aug 9, 2022

Hey @thinkall 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.
We appreciate your patience and contributions 💯!

@thinkall thinkall changed the title Add aisample notebooks into community folder doc: add aisample notebooks into community folder Aug 9, 2022
@thinkall thinkall changed the title doc: add aisample notebooks into community folder docs: add aisample notebooks into community folder Aug 9, 2022
@mhamilton723
Copy link
Collaborator

Hey @thinkall welcome to the project! I added you to the SynapseML team so you can now queue build with the magical /azp run comment. But before you queue a few questions and requests

  1. Do you intend for this notebook to run on platforms like Synapse and Databricks? I know there's a dependence on the lakehouse, but we can just as easily move this data to our public blob so that the demo works on all the platforms. If so we'll want to make some changes so that we can test this notebook as part of our build system and Serena might be able to help you make these changes

  2. Can you run black on these notebooks so that they get properly formatted into multiple lines to make the code review process easier?

  3. Can you clear the output of the cells prior to checkin? To show key results, include them as an image in the markdown so that folks can compare.

Appreciate these contributions and from the start these notebooks look like they have some great content!

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

@thinkall
Copy link
Contributor Author

Hi @mhamilton723 , thanks a lot for the suggestions. I've formatted the notebooks with black and cleared the output of cells. As for the cross-platform issue, we're focusing on just one platform at the moment, and we prefer not to introduce concepts of other platforms. I guess we can add some code to make it runnable in Synapse/Databricks after MVP. Does this make sense?

@thinkall thinkall requested a review from mhamilton723 August 11, 2022 02:07
Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, great content just needs some simplifications. I've added my comments for Book Recommendation, consider applying the learnings from this review to both notebooks. In general we'll want to tighten these notebooks up and make them as simple as possible so users dont have to think too hard. Once you apply these to both notebooks ill go through the other one and give another round of comments. Thanks for your patience :)

Title and Intro

  • "E2E solution of recommendation system" -> Creating, Evaluating, and Deploying a Recommendation System
  • In step 1: load the data you have some ASCII art showing the data, consider making this a markdown table

Loading Data

  • No need for params around ITEM_INFO COl and others
  • With the downloading data, you could remove this whole section by instead putting the CSV files on our pubvlic blob, then you can just read them directly

Read data from lakehouse

  • Consider caching the item, user, and user dataframe in the beginning to make this demo smoother

EDA

  • No need to coalesce before adding monotonic IDs
  • Dont need to sort before showing.
  • Dont need to drop to rdds or collect in df_ratings.select(RATING_COL).distinct().rdd.flatMap(lambda x: x).collect(). Just select distinct and show
  • Why do these user ids start with an "_" consider simplifying your computations to avoid these hidden vars

Merge Data

  • Dont need to re-import functions as F

For the following code

df_tmp = df_ratings.join(df_users, USER_ID_COL, "inner")
df_all = df_tmp.join(df_items, ITEM_ID_COL, "inner")
df_all_columns = df_all.columns
df_all_columns.remove("_user_id")
df_all_columns.remove("_item_id")
df_all_columns.remove(RATING_COL)
df_all = df_all.select(["_user_id", "_item_id", RATING_COL] + df_all_columns)
df_all = df_all.withColumn("id", F.monotonically_increasing_id())

This is a bit unwildy to look at. 1, consider using the "dot-chaining syntax within a parentheses like you did in the readers. 2. no need to get the columns explicitly, just select "*" if you want all the columns. 3, do you need to add another id?

  • Cache the df_all dataframe if you dont want to keep redoing this computation on every display

Plotting

  • Dont filter warnings, what was the warning you are trying to avoid?

Model development and deploy

  • "We have explored the dataset, got some insights from the exploratory data analysis and even achieved most popular recommendation." ->
  • " So far we have explored the dataset, added unique ids to our users and items, and plotted top items. Next, we'll train an Alternating Least Squares (ALS) recommender to give users personalized recommendations"
  • its not good fur users to have to iunteract with things that start with an underscore as this is generally used for private vars in classes. Consider removing that. Also if the IS_SAMPLE flag is false there will be no _df_all and it will fail
  • This code looks sketchy, consider simplifying with some for comprehensions:
fractions_train = {0: 0}
fractions_test = {0: 0}
for i in ratings:
    if i == 0:
        continue
    fractions_train[i] = 0.8
    fractions_test[i] = 1
  • Consider describing these cells that joiun and sample to make training and testing sets. Its a bit opaque right now
  • Cast and add columns before splitting so you dont need to do this twice
  • Dont use .rdd its deprecated

Hyperparameter tuning

  • make the question of which hyperparameter tuner you have could result in a hyper_optimzer variable you could use in lower code without needing additional if statements
  • This code looks like it could be cleaned up:
tmp_list = []
for params in models.getEstimatorParamMaps():
    tmp = ""
    for k in params:
        tmp += k.name + "=" + str(params[k]) + " "
    tmp_list.append(tmp)

Model Eval

  • Consider dot chaining here:
predictions = model.transform(train)
predictions = predictions.withColumn(
    "prediction", predictions.prediction.cast("double")
)
predictions.select("_user_id", "_item_id", RATING_COL, "prediction").limit(10).show()
  • I think you can eval all the metrics at the same time which will be much faster

Model saving anf loading

  • Loading is commented out, not sure if this is intentional

@thinkall
Copy link
Contributor Author

Hey @mhamilton723 , first of all, a huge thank you to you. Your detailed comments are so helpful. I modified our notebooks following most of your comments. For the rest, please let me explain why we did it in that way.

The purpose of these notebooks is to showcase the functionalities of our new platform with some e2e examples in different scenarios. For users who are not familiar with the techniques (ML, Spark, etc.), they can just copy&paste the notebooks, run it directly without changing anything. They can also apply a notebook to their own datasets by simply uploading the dataset to the lakehouse and modifying the params in the first code cell accordingly, then click run! For experienced users, they can also use these notebooks as references.

Loading Data

To achieve these goals, we choose to load arbitrary dataset and reorder/rename some of its columns to make the code in other parts simple. So we hope to keep the load data part as it is. The downloading data part is also needed, using lakehouse is actually one of the feature we want to showcase.

EDA

  • "coalesce before adding monotonic IDs" is needed to keep the ids continously.
  • "_" is used to avoid naming conflicts as much as possible.

Merge Data

  • "get the columns explicitly" is for reordering the columns.

Model Eval

  • It seems that Evaluator from pyspark.ml can't eval all the metrics at the same time, Metrics from pyspark.mlib can do that but it only accept rdd inputs. Moreover, the current eval time is only a few seconds, so I guess it's acceptable.

Model saving and loading

  • Yes, it is intentional. We'll add code using mlflow plugin for model saving and loading.

Plus, I adjusted the first cell, with comments aligned. It's different from what black wants, but I think it looks better for this cell. Is it OK for you?

@thinkall thinkall requested a review from mhamilton723 August 12, 2022 10:16
@mhamilton723
Copy link
Collaborator

Will give this another full review tonight, one thing i noticed is that the files are still very large due to random synapse data held in the notebooks, can you remove those entries in the json so these notebooks are smaller files?

@thinkall
Copy link
Contributor Author

Will give this another full review tonight, one thing i noticed is that the files are still very large due to random synapse data held in the notebooks, can you remove those entries in the json so these notebooks are smaller files?

Thanks @mhamilton723, all the unnecessary random metedata has been removed.

@mhamilton723
Copy link
Collaborator

Li, thanks so much for your fixes and changes! Apologies in advance for any annoying comments i leave you andi appreciate your patience in working with me for these fixes. I think its become alot simpler since the first iteration and im hoping we can continue to distill this into its simplest and easiest form for users.

Book Recs

EDA

  • "coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

  • "_" is used to avoid naming conflicts as much as possible. - I think we should still remove these underscore names and instead save _names for private members as is the standard. Consider simplifying the logic of your queries to avoid the need to use these underscores. Also consider this for the variable _df_all as its used alot in your code and hence shouldnt be marked as a private variable. In general, _ should be reserved for private variables/methods in python classes as opposed to used objects and names in a public demo.

Model Training

  • I still think this can be simplified with a smaller if statement. Consider the following
if model_tuning_method == "CrossValidator":
    tuner = CrossValidator(
        estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5
    )
elif model_tuning_method == "TrainValidationSplit":

    tuner = TrainValidationSplit(
        estimator=als,
        estimatorParamMaps=param_grid,
        evaluator=evaluator,
        # 80% of the data will be used for training, 20% for validation.
        trainRatio=0.8,
    )
else:
    raise ValueError(f"Unknown model_tuning_method: {model_tuning_method}")

Wioth this downstream code becomes:

models = tuner.fit(train)
model = models.bestModel

Likewise the next if can be simplified

if model_tuning_method == "CrossValidator":
    metric = models.avgMetrics
elif model_tuning_method == "TrainValidationSplit":
    metric = models.validationMetrics
else:
    raise ValueError(f"Unknown model_tuning_method: {model_tuning_method}")

And the other stuff can be factored out.

  • the code here:
# collect metrics
tmp_list = []
for params in models.getEstimatorParamMaps():
    tmp = ""
    for k in params:
        tmp += k.name + "=" + str(params[k]) + " "
    tmp_list.append(tmp)

Can be simplified to something like

# collect metrics
param_strings = []
for param_map in models.getEstimatorParamMaps():
    param_strings.append(" ".join(f"{k}={v}" for (k,v) in param_map.items())

Model Evaluation

  • predictions=... can be cleaned up with dot chaining
  • The evaluation section hasalot of repeated code both in terms of the experiment duplicated for train and test, and the same snippets "evaluator.evaluate..." consider making these functions or lambdas to tighten this up a bit

Style

  • "It's different from what black wants" - unfortunately our build error this if its not black compliant, consider shortening the comments so black doesent need to bump to new line

Fraud Detection

Cast columns

  • Consider dot chaining to neaten this section

Model evaluation

  • mmlspark is deprecated namespace change to synapse.ml

  • Same comment with duplication on eval code with smote vs non-smote consider making a function or a loop

Saving model

  • We should comment that saveNativeModel is useful if you want to use it in other lightGBM context, otherwise for usage later with spark use the sparkML save load on LGBM Classifier. For this demo it might be simplest to use SparkML model serialization instead of LGBM native

Thanks again for all your patience. I know its no fun getting comments and I really appreciate you!

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment

@thinkall
Copy link
Contributor Author

@mhamilton723 , thanks a lot for all the suggestions. All but one are accepted.

"coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

There are two reasons for us to apply coalesce.

  1. The new generated _user_id is used for recommendation models, it's better to be continous integers to get smaller model size, faster inference speed for some models. So it's more than just GUID.

  2. We used ALS model in this demo, the spark implementation only supports values in Integer range, if we don't apply coalesce, for big dataset, the generated id may exceed the range. In fact, I added coalesce after I met this issue.

@thinkall thinkall requested a review from mhamilton723 August 17, 2022 10:59
@mhamilton723
Copy link
Collaborator

mhamilton723 commented Aug 17, 2022

@mhamilton723 , thanks a lot for all the suggestions. All but one are accepted.

"coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

There are two reasons for us to apply coalesce.

  1. The new generated _user_id is used for recommendation models, it's better to be continous integers to get smaller model size, faster inference speed for some models. So it's more than just GUID.
  2. We used ALS model in this demo, the spark implementation only supports values in Integer range, if we don't apply coalesce, for big dataset, the generated id may exceed the range. In fact, I added coalesce after I met this issue.

Thanks @thinkall, i see what you are saying. I think in this case we should instead use the SparkML StringIndexer as this is designed to learn a mapping from strings representing IDs to integers/longs. This will automatically make sure the ids are contiguous, but it wont have the aforementioned scalability issues. Also i still see alot of _ column names. Sorry to be a pain but can we try to name those without _. Your use of _evaluate is OK though!

@thinkall
Copy link
Contributor Author

All done.

@thinkall
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2022

Codecov Report

Merging #1606 (1c108d1) into master (ac40e5a) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1606      +/-   ##
==========================================
- Coverage   83.79%   83.75%   -0.05%     
==========================================
  Files         292      292              
  Lines       15473    15473              
  Branches      752      752              
==========================================
- Hits        12966    12959       -7     
- Misses       2507     2514       +7     
Impacted Files Coverage Δ
...crosoft/azure/synapse/ml/io/http/HTTPClients.scala 67.64% <0.00%> (-7.36%) ⬇️
.../execution/streaming/continuous/HTTPSourceV2.scala 92.08% <0.00%> (-0.72%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@thinkall thinkall requested a review from serena-ruan as a code owner August 18, 2022 06:15
@thinkall
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit 7e90d19 into microsoft:master Aug 19, 2022
@thinkall thinkall deleted the aisample branch August 19, 2022 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants