-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns #4498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #27188 has started for PR 4498 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this change wrong? sampling should be off if ratio > 0.99?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, seems the code in master is wrong.
|
Test build #27188 has finished for PR 4498 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin Is there a better name for this? createDataFrame is still too long (longer than 'applySchema')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked offline, we did figure out a better name than it.
|
Test build #27229 has started for PR 4498 at commit
|
|
Test build #27229 has finished for PR 4498 at commit
|
|
Test FAILed. |
|
Test build #596 has started for PR 4498 at commit
|
|
@rxin this should be ready, please give another look. |
|
Test build #27239 has started for PR 4498 at commit
|
|
LGTM. @yhuai please take a look at the type inference stuff. |
|
Test build #596 has finished for PR 4498 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Map at here a scala.collection.Map or Predef.Map (scala.collection.immutable.Map)? (We should use scala.collection.Map at here.)
|
Test build #27239 has finished for PR 4498 at commit
|
|
Test PASSed. |
|
After talk to @yhuai offline, he suggested that we could hold on for Scala API for createDataFrame(rdd, columns), it's not so useful right now. We can revisit it later. |
|
Test build #27255 has started for PR 4498 at commit
|
|
@pwendell I think this PR is ready to go, just wait for jenkins or not. (The last commit just remove a API and the test for it), sorry for the late. |
|
Test build #27255 has finished for PR 4498 at commit
|
|
Test PASSed. |
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns. Author: Davies Liu <davies@databricks.com> Closes #4498 from davies/create and squashes the following commits: 08469c1 [Davies Liu] remove Scala/Java API for now c80a7a9 [Davies Liu] fix hive test d1bd8f2 [Davies Liu] cleanup applySchema 9526e97 [Davies Liu] createDataFrame from RDD with columns (cherry picked from commit ea60284) Signed-off-by: Michael Armbrust <michael@databricks.com>
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional
schemato create an DataFrame from an RDD. Theschemacould be StructType or list of names of columns.