Skip to content

Conversation

@gsemet
Copy link
Contributor

@gsemet gsemet commented Aug 26, 2016

This is a set of files that has been formatted by the script defined in #14567.

Not all files are formatted, only the documentation examples, for information sake.

This Pull Request can be merged alone, but it makes more sens to merge it once #14567 is accepted and merged (comes on top of it)

@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch from 7848c92 to 493ae5c Compare August 26, 2016 11:57
@gsemet gsemet changed the title [SPARK-16992][PYSPARK] [DO NOT MERGE] #14567 execution example [SPARK-16992][PYSPARK] autopep8 on documentation example Aug 26, 2016
@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 54c5fdf to 2e28dd6 Compare August 26, 2016 12:05
@gsemet gsemet changed the title [SPARK-16992][PYSPARK] autopep8 on documentation example [SPARK-16992][PYSPARK] autopep8 on documentation examples Aug 26, 2016
@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64473 has finished for PR 14830 at commit 2e28dd6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64471 has finished for PR 14830 at commit 493ae5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 26, 2016

It's a lot of change but I tend to favor biting the bullet and standardizing, especially if we can enforce it going forward. Thoughts @davies @holdenk (time permitting) @MLnick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we might want to move the $example off$ tag/comment up above this so that we keep the example text the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. what does this tag do ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the examples files are used in generating the website documentation, and the "example on" and "example off" tags are used to determine which parts get pulled in to the website (in this case this is done since we don't want to have the same boiler plate imports for each example - rather showing the ones specific to that). You can take a look at ./docs/ml-features.md which includes this file to see how its used in markdown and the generated website documentation at http://spark.apache.org/docs/latest/ml-features.html#binarizer .

The instructions for building the docs locally are located at ./docs/README.md - let me know if you need any help with that - the documentation build is sometimes a bit overlooked since many of the developers don't build it manually often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I see, makes perfectly sense !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we probably want to fix that here and in other places.

@holdenk
Copy link
Contributor

holdenk commented Aug 27, 2016

Thank for taking the time to do this @stibbons I think its great progress. Doing a quick skim it seems like there are a number of places where the import reordering may have inadvertanly changed what the users will see in the examples we have in the documentation - which is probably not what was intended.

I've left line comments in some of the places where I noticed them but there are probably quite a few others since it was just a quick first skim.

I'd suggest doing a quick audit yourself and then consider building the documentation to verify that it hasn't changed in any unintended ways by your change.

Once again thanks for taking on this task! :)

@gsemet
Copy link
Contributor Author

gsemet commented Aug 27, 2016

yes, i will try to understand how it works and make it beautiful. The goal is to move toward an automation of such code housework, but it may take some time. I'll continue to submit part of this code style work next week, so we can see "small" changes like this.

I really like "yapf", a formatting tool from google that almost do the job, better that autopep8. it works a bit aggressively, that why I do not recommend to enforce using it, but it helps identifying and rework most pep8 errors in Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer this line be in the doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, move the # $example on$ comment up above the from pyspark.ml.linalg import Vectors

@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 50fc56e to 2635dcb Compare August 29, 2016 13:21
@gsemet
Copy link
Contributor Author

gsemet commented Aug 29, 2016

Here is a new proposal. I've taken into account your remark, hope all $on/$off things are ok, and added some minor rework with the multiline syntax (I find using \ weird and inelegant, using parenthesis "()" make is more readable, TMHO).

Tell me what you think about this

@holdenk
Copy link
Contributor

holdenk commented Aug 29, 2016

For what its worth pep8 says:

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.

So this sounds like keeping in line with the general more pep8ification of the code - but I am a little concerned about just how many files this touches now that it isn't just an autogenerated change*, but I'll try and set aside some time this week to review it (I'm currently ~13 hours off my regular timezone so my review times may be a little erratic).

@gsemet
Copy link
Contributor Author

gsemet commented Aug 29, 2016

Cool I wasn't sure of it.

No pbl, I can even split it into several PR

@gsemet gsemet changed the title [SPARK-16992][PYSPARK] autopep8 on documentation examples [SPARK-16992][PYSPARK] PEP8 on documentation examples Aug 29, 2016
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not changed all this initilization lines, since they do not appear most of the time in the documentation

@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch from ff6aabf to 78b66d8 Compare January 9, 2017 11:44
@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71079 has finished for PR 14830 at commit 78b66d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I let this slip my radar (sorry). Some minor comments, but if you're ok with updating this to master I can now merge Python PRs and it would be nice to have our examples cleaned up in this way. Sorry @stibbons for the delay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats this for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the double newlines after the end of the imports?

Copy link
Contributor Author

@gsemet gsemet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed your remarks. The extra line has been emptied (no need for the '#'). It is the pep8 recomendation to have 2 empty lines after imports.

I have fixed the other remark as well

thanks

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72861 has finished for PR 14830 at commit 31cea6d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72862 has finished for PR 14830 at commit 582c822.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Feb 14, 2017

Great, thanks for updating this :) Would be good to see if @HyukjinKwon has anything to say otherwise I'll do another pass through this tomorrow and hopefully its really close :)

@HyukjinKwon
Copy link
Member

Thank you for cc'ing me @holdenk. Let me try to take a look within tomorrow too at my best.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left several comments. In general, I think we should minimise the changes as possible as we can. Could we check if they really are recommended changes all (at least the ones I commented)?

I know it sounds a bit demanding but I a bit suspect some changes are not really explicitly required/recommended and some removed lines are not explicitly discouraged. I worry if it is worth sweeping all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd great if we have some references or quotes.

[
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check if it really does not follow pep8? I have seen the removed syntax more often (e.g., numpy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is a recommendation not an obligation. I see it to be more looking like Scala multi-line code, and I prefer it. It is a personal opinion, and I don't think there is a pylint/pep8 check to prevent using .


# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. dose pep8 has a different argument location rule for class and function? It seems this one is already fine and seems inconsistent with https://github.com/apache/spark/pull/14830/files#diff-82fe155d22aaaf433e949193d262c736R43

Copy link
Contributor Author

@gsemet gsemet Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 tool does this automatically if line is > 100 char. There is indeed no preference between this format and:

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel",
                                              predictionCol="prediction",
                                              metricName="accuracy")

I would say both are equivalent. I tend to prefere this one (the latter)

.transform(lambda rdd: rdd.sortByKey(False))
happiest_words = (word_counts
.map(lambda word_tuples: (word_tuples[0],
float(word_tuples[1][0]) * word_tuples[1][1]))
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Personally, I think it is not more readable..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if you prefer I can change all at once. But like I said, I don't know any autoformat that does it automatically

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
from pyspark.mllib.regression import (LabeledPoint,
StreamingLinearRegressionWithSGD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not exceed 100 line length? Up to my knowledge Spark limits it 100 (not default 80).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer having a single import per line (this simplifies a lot file management, multi branch merges,...). I can revert this change

from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.util import MLUtils


Copy link
Member

@HyukjinKwon HyukjinKwon Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I ask to check if the example rendered in doc still complies pep8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you happen to be not able to build the python doc, I will check tomorrow to help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because the 2 empty lines are after

 # $example off$

@HyukjinKwon
Copy link
Member

@stibbons are there maybe some options in autopep8 to minimise the changes? (just in case I believe we ignore some rules such as E402,E731,E241,W503 and E226 in Spark).

@gsemet
Copy link
Contributor Author

gsemet commented Feb 15, 2017

Hello. This is actually the execution of the pylint/autopep8 config proposed in #14963. I can minimize a little bit more this PR by ignoring indeed more rules.

@HyukjinKwon
Copy link
Member

Thanks @stibbons. FWIW, I won't stay against but am just neutral. Let me please defer to @holdenk and @srowen.

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

lets do a jenkins re-run just to make sure everything is up to date and I'll try and get a final pass done soon. I think it would be good to improve our examples to be closer to pep8 style for the sake of readability for people coming from different Python code bases trying to learn PySpark.

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73445 has finished for PR 14830 at commit 582c822.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 9, 2017

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Apr 9, 2017

Test build #75632 has finished for PR 14830 at commit 582c822.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet
Copy link
Contributor Author

gsemet commented Apr 9, 2017

I guess a rebased will be welcomed, I can do it by tomorow if you want

@holdenk
Copy link
Contributor

holdenk commented Apr 11, 2017

Sure, if you have a chance to rebase & check if any other changes are needed that would be useful.

@ueshin
Copy link
Member

ueshin commented Jun 26, 2017

Hi, are you still working on this?

@holdenk
Copy link
Contributor

holdenk commented Jul 2, 2017

Gentle follow up ping. I've got some bandwith next week.

@gsemet
Copy link
Contributor Author

gsemet commented Jul 2, 2017

Hello. Sadly I cannot work on this we are in a middle of a big restructuration at work.

@asfgit asfgit closed this in 3a45c7f Aug 5, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory …
Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage.
Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation
Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers
Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key…
Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples
Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python
Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage
Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins
Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP]
Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job
Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable
Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator
Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset
Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns
Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work
Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly
Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column
Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService
Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication
Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone
Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000)
Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos
Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table
Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit…
Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex
Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable
Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting
Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages
Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery
Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException
Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer
Closes apache#18585 - SPARK-21359
Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala

Added:
Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I…
Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0
Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to …
Closes apache#18667 - Fix the simpleString used in error messages
Closes apache#18782 - Branch 2.1

Added:
Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads

Added:
Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread
Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable
Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server

Added:
Closes apache#18827 - Merge pull request 1 from apache/master

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#18780 from HyukjinKwon/close-prs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants