[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

gsemet · 2016-08-26T11:55:26Z

This is a set of files that has been formatted by the script defined in #14567.

Not all files are formatted, only the documentation examples, for information sake.

This Pull Request can be merged alone, but it makes more sens to merge it once #14567 is accepted and merged (comes on top of it)

SparkQA · 2016-08-26T12:08:32Z

Test build #64473 has finished for PR 14830 at commit 2e28dd6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-26T12:18:13Z

Test build #64471 has finished for PR 14830 at commit 493ae5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-08-26T12:51:57Z

It's a lot of change but I tend to favor biting the bullet and standardizing, especially if we can enforce it going forward. Thoughts @davies @holdenk (time permitting) @MLnick

holdenk · 2016-08-27T07:53:39Z

examples/src/main/python/ml/binarizer_example.py

So we might want to move the $example off$ tag/comment up above this so that we keep the example text the same.

ok. what does this tag do ?

Some of the examples files are used in generating the website documentation, and the "example on" and "example off" tags are used to determine which parts get pulled in to the website (in this case this is done since we don't want to have the same boiler plate imports for each example - rather showing the ones specific to that). You can take a look at ./docs/ml-features.md which includes this file to see how its used in markdown and the generated website documentation at http://spark.apache.org/docs/latest/ml-features.html#binarizer .

The instructions for building the docs locally are located at ./docs/README.md - let me know if you need any help with that - the documentation build is sometimes a bit overlooked since many of the developers don't build it manually often.

yes I see, makes perfectly sense !

So we probably want to fix that here and in other places.

holdenk · 2016-08-27T08:02:02Z

Thank for taking the time to do this @stibbons I think its great progress. Doing a quick skim it seems like there are a number of places where the import reordering may have inadvertanly changed what the users will see in the examples we have in the documentation - which is probably not what was intended.

I've left line comments in some of the places where I noticed them but there are probably quite a few others since it was just a quick first skim.

I'd suggest doing a quick audit yourself and then consider building the documentation to verify that it hasn't changed in any unintended ways by your change.

Once again thanks for taking on this task! :)

gsemet · 2016-08-27T09:36:29Z

yes, i will try to understand how it works and make it beautiful. The goal is to move toward an automation of such code housework, but it may take some time. I'll continue to submit part of this code style work next week, so we can see "small" changes like this.

I really like "yapf", a formatting tool from google that almost do the job, better that autopep8. it works a bit aggressively, that why I do not recommend to enforce using it, but it helps identifying and rework most pep8 errors in Python.

gsemet · 2016-08-29T12:34:06Z

examples/src/main/python/ml/aft_survival_regression.py

I actually prefer this line be in the doc

In that case, move the # $example on$ comment up above the from pyspark.ml.linalg import Vectors

gsemet · 2016-08-29T13:27:55Z

Here is a new proposal. I've taken into account your remark, hope all $on/$off things are ok, and added some minor rework with the multiline syntax (I find using \ weird and inelegant, using parenthesis "()" make is more readable, TMHO).

Tell me what you think about this

holdenk · 2016-08-29T13:37:53Z

For what its worth pep8 says:

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.

So this sounds like keeping in line with the general more pep8ification of the code - but I am a little concerned about just how many files this touches now that it isn't just an autogenerated change*, but I'll try and set aside some time this week to review it (I'm currently ~13 hours off my regular timezone so my review times may be a little erratic).

gsemet · 2016-08-29T13:40:36Z

Cool I wasn't sure of it.

No pbl, I can even split it into several PR

gsemet · 2016-08-29T13:42:22Z

examples/src/main/python/als.py

I have not changed all this initilization lines, since they do not appear most of the time in the documentation

SparkQA · 2017-01-09T12:09:00Z

Test build #71079 has finished for PR 14830 at commit 78b66d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

It seems I let this slip my radar (sorry). Some minor comments, but if you're ok with updating this to master I can now merge Python PRs and it would be nice to have our examples cleaned up in this way. Sorry @stibbons for the delay.

holdenk · 2016-12-28T05:22:26Z

examples/src/main/python/mllib/bisecting_k_means_example.py

Whats this for?

holdenk · 2016-12-28T05:26:59Z

examples/src/main/python/logistic_regression.py

Why did you remove the double newlines after the end of the imports?

gsemet

Fixed your remarks. The extra line has been emptied (no need for the '#'). It is the pep8 recomendation to have 2 empty lines after imports.

I have fixed the other remark as well

thanks

SparkQA · 2017-02-14T08:03:51Z

Test build #72861 has finished for PR 14830 at commit 31cea6d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-14T08:31:24Z

Test build #72862 has finished for PR 14830 at commit 582c822.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-02-14T17:55:22Z

Great, thanks for updating this :) Would be good to see if @HyukjinKwon has anything to say otherwise I'll do another pass through this tomorrow and hopefully its really close :)

HyukjinKwon · 2017-02-14T22:53:22Z

Thank you for cc'ing me @holdenk. Let me try to take a look within tomorrow too at my best.

HyukjinKwon

I left several comments. In general, I think we should minimise the changes as possible as we can. Could we check if they really are recommended changes all (at least the ones I commented)?

I know it sounds a bit demanding but I a bit suspect some changes are not really explicitly required/recommended and some removed lines are not explicitly discouraged. I worry if it is worth sweeping all.

HyukjinKwon · 2017-02-15T10:23:40Z

examples/src/main/python/ml/cross_validator.py

It'd great if we have some references or quotes.

HyukjinKwon · 2017-02-15T11:23:36Z

examples/src/main/python/ml/count_vectorizer_example.py

+        [
+            (0, "a b c".split(" ")),
+            (1, "a b b c a".split(" "))
+        ],


Could you double check if it really does not follow pep8? I have seen the removed syntax more often (e.g., numpy).

Indeed, this is a recommendation not an obligation. I see it to be more looking like Scala multi-line code, and I prefer it. It is a personal opinion, and I don't think there is a pylint/pep8 check to prevent using .

HyukjinKwon · 2017-02-15T11:27:20Z

examples/src/main/python/ml/decision_tree_classification_example.py


    # Select (prediction, true label) and compute test error
-    evaluator = MulticlassClassificationEvaluator(
-        labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")


Hm.. dose pep8 has a different argument location rule for class and function? It seems this one is already fine and seems inconsistent with https://github.com/apache/spark/pull/14830/files#diff-82fe155d22aaaf433e949193d262c736R43

pep8 tool does this automatically if line is > 100 char. There is indeed no preference between this format and:

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

I would say both are equivalent. I tend to prefere this one (the latter)

HyukjinKwon · 2017-02-15T11:30:50Z

examples/src/main/python/streaming/network_wordjoinsentiments.py

-        .transform(lambda rdd: rdd.sortByKey(False))
+    happiest_words = (word_counts
+                      .map(lambda word_tuples: (word_tuples[0],
+                                                float(word_tuples[1][0]) * word_tuples[1][1]))


(Personally, I think it is not more readable..)

I agree, if you prefer I can change all at once. But like I said, I don't know any autoformat that does it automatically

HyukjinKwon · 2017-02-15T11:39:08Z

examples/src/main/python/mllib/streaming_linear_regression_example.py

-from pyspark.mllib.regression import LabeledPoint
-from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
+from pyspark.mllib.regression import (LabeledPoint,
+                                      StreamingLinearRegressionWithSGD)


This does not exceed 100 line length? Up to my knowledge Spark limits it 100 (not default 80).

I actually prefer having a single import per line (this simplifies a lot file management, multi branch merges,...). I can revert this change

HyukjinKwon · 2017-02-15T11:42:14Z

examples/src/main/python/mllib/naive_bayes_example.py

 from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
 from pyspark.mllib.util import MLUtils
-
-


Could I ask to check if the example rendered in doc still complies pep8?

If you happen to be not able to build the python doc, I will check tomorrow to help.

yes, because the 2 empty lines are after

# $example off$

HyukjinKwon · 2017-02-15T12:06:43Z

@stibbons are there maybe some options in autopep8 to minimise the changes? (just in case I believe we ignore some rules such as E402,E731,E241,W503 and E226 in Spark).

gsemet · 2017-02-15T12:15:00Z

Hello. This is actually the execution of the pylint/autopep8 config proposed in #14963. I can minimize a little bit more this PR by ignoring indeed more rules.

HyukjinKwon · 2017-02-15T12:18:12Z

Thanks @stibbons. FWIW, I won't stay against but am just neutral. Let me please defer to @holdenk and @srowen.

holdenk · 2017-02-24T22:24:15Z

lets do a jenkins re-run just to make sure everything is up to date and I'll try and get a final pass done soon. I think it would be good to improve our examples to be closer to pep8 style for the sake of readability for people coming from different Python code bases trying to learn PySpark.

holdenk · 2017-02-24T22:24:22Z

Jenkins retest this please.

SparkQA · 2017-02-24T22:38:54Z

Test build #73445 has finished for PR 14830 at commit 582c822.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-04-09T11:12:21Z

Jenkins retest this please.

SparkQA · 2017-04-09T11:28:21Z

Test build #75632 has finished for PR 14830 at commit 582c822.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gsemet · 2017-04-09T11:31:55Z

I guess a rebased will be welcomed, I can do it by tomorow if you want

holdenk · 2017-04-11T19:15:40Z

Sure, if you have a chance to rebase & check if any other changes are needed that would be useful.

ueshin · 2017-06-26T23:44:12Z

Hi, are you still working on this?

holdenk · 2017-07-02T02:14:23Z

Gentle follow up ping. I've got some bandwith next week.

gsemet · 2017-07-02T11:04:01Z

Hello. Sadly I cannot work on this we are in a middle of a big restructuration at work.

## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#18780 from HyukjinKwon/close-prs.

gsemet mentioned this pull request Aug 26, 2016

[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567

Closed

gsemet force-pushed the python_import_reorg_plus_exec branch from 7848c92 to 493ae5c Compare August 26, 2016 11:57

gsemet changed the title ~~[SPARK-16992][PYSPARK] [DO NOT MERGE] #14567 execution example~~ [SPARK-16992][PYSPARK] autopep8 on documentation example Aug 26, 2016

gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 54c5fdf to 2e28dd6 Compare August 26, 2016 12:05

gsemet changed the title ~~[SPARK-16992][PYSPARK] autopep8 on documentation example~~ [SPARK-16992][PYSPARK] autopep8 on documentation examples Aug 26, 2016

holdenk reviewed Aug 27, 2016
View reviewed changes

gsemet reviewed Aug 29, 2016
View reviewed changes

gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 50fc56e to 2635dcb Compare August 29, 2016 13:21

gsemet changed the title ~~[SPARK-16992][PYSPARK] autopep8 on documentation examples~~ [SPARK-16992][PYSPARK] PEP8 on documentation examples Aug 29, 2016

gsemet reviewed Aug 29, 2016
View reviewed changes

gsemet force-pushed the python_import_reorg_plus_exec branch from ff6aabf to 78b66d8 Compare January 9, 2017 11:44

holdenk reviewed Feb 13, 2017

View reviewed changes

gsemet added 2 commits February 14, 2017 08:53

Undo add extra space

4af5966

Fix with 2 empty lines after import

ef1306e

gsemet commented Feb 14, 2017

View reviewed changes

Merge branch 'master' into python_import_reorg_plus_exec

31cea6d

Fix pep8 empty lines

582c822

HyukjinKwon requested changes Feb 15, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

		from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
		from pyspark.mllib.util import MLUtils

[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

Uh oh!

Conversation

gsemet commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

srowen commented Aug 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Aug 27, 2016

Uh oh!

gsemet commented Aug 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet commented Aug 29, 2016

Uh oh!

holdenk commented Aug 29, 2016

Uh oh!

gsemet commented Aug 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 9, 2017

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 14, 2017

Uh oh!

SparkQA commented Feb 14, 2017

Uh oh!

holdenk commented Feb 14, 2017

Uh oh!

HyukjinKwon commented Feb 14, 2017

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet Feb 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gsemet commented Aug 26, 2016 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

gsemet Feb 15, 2017 •

edited

Loading

HyukjinKwon Feb 15, 2017 •

edited

Loading

HyukjinKwon Feb 15, 2017 •

edited

Loading