Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spark): Refactoring datasources #514

Merged

Conversation

SemyonSinchenko
Copy link
Member

@SemyonSinchenko SemyonSinchenko commented Jun 5, 2024

Reason for this PR

By moving datasources under org.apache.spark.sql we are able to access private Spark API. Last time when I was trying to fully migrate datasources to V2 it was a blocker. Detailed motivation is in #493

What changes are included in this PR?

Mostly refactoring.

Are these changes tested?

Unit tests are passed

I manually checked the generated JARs:
image

Are there any user-facing changes?

Mostly not because GarDataSource was left under the same package.

Close #493

 On branch 493-datasources-spark-refactoring
 Changes to be committed:
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/GarDataSource.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/GarCommitProtocol.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarCommitProtocol.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/GarScan.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarScan.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/GarScanBuilder.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarScanBuilder.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/GarTable.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarTable.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/GarWriterBuilder.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarWriteBuilder.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/csv/CSVWriterBuilder.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/csv/CSVWriteBuilder.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/orc/OrcOutputWriter.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/orc/OrcOutputWriter.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/orc/OrcWriteBuilder.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/orc/OrcWriteBuilder.scala
	renamed:    maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/parquet/ParquetWriterBuilder.scala -> maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/parquet/ParquetWriteBuilder.scala
@SemyonSinchenko
Copy link
Member Author

@acezen What do you think about this? For me it looks like there are no changes for public API

@acezen
Copy link
Contributor

acezen commented Jun 6, 2024

@acezen What do you think about this? For me it looks like there are no changes for public API

Hi, Sem, thanks for the draft, change to org.apache.spark.sql is good to me. But if we upload the binary package to maven, is that possible to upload all the binary packages to org.apache.graphar?

@SemyonSinchenko
Copy link
Member Author

@acezen What do you think about this? For me it looks like there are no changes for public API

Hi, Sem, thanks for the draft, change to org.apache.spark.sql is good to me. But if we upload the binary package to maven, is that possible to upload all the binary packages to org.apache.graphar?

Yes, it should be. For example, check this:

  1. Download delta-saprk: https://mvnrepository.com/artifact/io.delta/delta-spark_2.13/3.2.0
  2. Open it with archive manager
  3. You will see org-apache-spark classes inside:
    image

but coordinates are still io.delta:delta-spark_2.13:3.2.0

So, if you OK with it I will update 32 too and fix license headers.

@acezen
Copy link
Contributor

acezen commented Jun 6, 2024

@acezen What do you think about this? For me it looks like there are no changes for public API

Hi, Sem, thanks for the draft, change to org.apache.spark.sql is good to me. But if we upload the binary package to maven, is that possible to upload all the binary packages to org.apache.graphar?

Yes, it should be. For example, check this:

  1. Download delta-saprk: https://mvnrepository.com/artifact/io.delta/delta-spark_2.13/3.2.0
  2. Open it with archive manager
  3. You will see org-apache-spark classes inside:
    image

but coordinates are still io.delta:delta-spark_2.13:3.2.0

So, if you OK with it I will update 32 too and fix license headers.

Looks NICE! The change is OK for me. Thanks for the work.

 On branch 493-datasources-spark-refactoring
 Changes to be committed:
	modified:   licenserc.toml
	modified:   maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarDataSource.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarCommitProtocol.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/GarCommitProtocol.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarScan.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/GarScan.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarScanBuilder.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/GarScanBuilder.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarTable.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/GarTable.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarWriterBuilder.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/GarWriteBuilder.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/csv/CSVWriterBuilder.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/csv/CSVWriteBuilder.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/orc/OrcOutputWriter.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/orc/OrcOutputWriter.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/orc/OrcWriteBuilder.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/orc/OrcWriteBuilder.scala
	renamed:    maven-projects/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/parquet/ParquetWriterBuilder.scala -> maven-projects/spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar/parquet/ParquetWriteBuilder.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/graphar/datasources/GarDataSource.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarCommitProtocol.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarScan.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarScanBuilder.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarTable.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarWriteBuilder.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/csv/CSVWriteBuilder.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/orc/OrcOutputWriter.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/orc/OrcWriteBuilder.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/parquet/ParquetWriteBuilder.scala
@SemyonSinchenko SemyonSinchenko marked this pull request as ready for review June 6, 2024 10:40
@SemyonSinchenko
Copy link
Member Author

Ready for review.

@SemyonSinchenko SemyonSinchenko changed the title [DRAFT][Spark] Refactoring datasources [Spark] Refactoring datasources Jun 6, 2024
@SemyonSinchenko SemyonSinchenko changed the title [Spark] Refactoring datasources feat(spark): Refactoring datasources Jun 6, 2024
Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@acezen acezen merged commit d288b7e into apache:main Jun 7, 2024
4 checks passed
SemyonSinchenko added a commit that referenced this pull request Jun 13, 2024
* feat(spark): Refactoring datasources (#514)

### Reason for this PR
By moving datasources under `org.apache.spark.sql` we are able to access private Spark API. Last time when I was trying to fully migrate datasources to V2 it was a blocker. Detailed motivation is in #493 

### What changes are included in this PR?
Mostly refactoring.

### Are these changes tested?
Unit tests are passed

I manually checked the generated JARs:
![image](https://github.com/apache/incubator-graphar/assets/29755009/1b094516-88b1-490a-a2ea-8dcd092a3b1d)

### Are there any user-facing changes?
Mostly not because `GarDataSource` was left under the same package.


Close #493

* feat(dev): Add release and verify scripts (#507)

Reason for this PR
Add scripts for developer or release manager to easily release version or verify a version.

What changes are included in this PR?
Add release and verify scripts
related document is updated to website, see Update the release and verify document, and add development document incubator-graphar-website#18
Are these changes tested?
yes

Are there any user-facing changes?
no
---------

Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

* chore: Bump to version v0.12.0 (Round 1) (#517)


Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

* chore: Add CHANGELOG.md (#513)


Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

* Introduce buf

- v2
- buf.gen
- buf

 On branch format-definition-dev
 Your branch is up to date with 'origin/format-definition-dev'.

 Changes to be committed:
	new file:   buf.gen.yaml
	new file:   buf.yaml
	modified:   format/adjacent_list.proto
	modified:   format/edge_info.proto
	modified:   format/graph_info.proto
	modified:   format/property_group.proto
	modified:   format/types.proto
	modified:   format/vertex_info.proto

---------

Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>
Co-authored-by: Weibin Zeng <qiaozi.zwb@alibaba-inc.com>
acezen added a commit that referenced this pull request Jun 13, 2024
* feat(spark): Refactoring datasources (#514)

### Reason for this PR
By moving datasources under `org.apache.spark.sql` we are able to access private Spark API. Last time when I was trying to fully migrate datasources to V2 it was a blocker. Detailed motivation is in #493 

### What changes are included in this PR?
Mostly refactoring.

### Are these changes tested?
Unit tests are passed

I manually checked the generated JARs:
![image](https://github.com/apache/incubator-graphar/assets/29755009/1b094516-88b1-490a-a2ea-8dcd092a3b1d)

### Are there any user-facing changes?
Mostly not because `GarDataSource` was left under the same package.


Close #493

* feat(dev): Add release and verify scripts (#507)

Reason for this PR
Add scripts for developer or release manager to easily release version or verify a version.

What changes are included in this PR?
Add release and verify scripts
related document is updated to website, see Update the release and verify document, and add development document incubator-graphar-website#18
Are these changes tested?
yes

Are there any user-facing changes?
no
---------

Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

* chore: Bump to version v0.12.0 (Round 1) (#517)


Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

* chore: Add CHANGELOG.md (#513)


Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

* Introduce buf

- v2
- buf.gen
- buf

 On branch format-definition-dev
 Your branch is up to date with 'origin/format-definition-dev'.

 Changes to be committed:
	new file:   buf.gen.yaml
	new file:   buf.yaml
	modified:   format/adjacent_list.proto
	modified:   format/edge_info.proto
	modified:   format/graph_info.proto
	modified:   format/property_group.proto
	modified:   format/types.proto
	modified:   format/vertex_info.proto

---------

Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>
Co-authored-by: Weibin Zeng <qiaozi.zwb@alibaba-inc.com>
Elssky pushed a commit to Elssky/incubator-graphar that referenced this pull request Oct 8, 2024
### Reason for this PR
By moving datasources under `org.apache.spark.sql` we are able to access private Spark API. Last time when I was trying to fully migrate datasources to V2 it was a blocker. Detailed motivation is in apache#493 

### What changes are included in this PR?
Mostly refactoring.

### Are these changes tested?
Unit tests are passed

I manually checked the generated JARs:
![image](https://github.com/apache/incubator-graphar/assets/29755009/1b094516-88b1-490a-a2ea-8dcd092a3b1d)

### Are there any user-facing changes?
Mostly not because `GarDataSource` was left under the same package.


Close apache#493
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spark] Move datasources under org.apache.spark package
2 participants