Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

com.crealytics.spark.excel doesn't read directly from ADL #125

Open
Mathyaku opened this issue May 29, 2019 · 5 comments
Open

com.crealytics.spark.excel doesn't read directly from ADL #125

Mathyaku opened this issue May 29, 2019 · 5 comments
Labels
bug cloud Usage of spark-excel on cloud storage & platform

Comments

@Mathyaku
Copy link

I'm getting the following error:

[2019-05-29 18:25:21,894] {init.py:1580} ERROR - An error occurred while calling o77.load.
: java.io.IOException: Password fs.adl.oauth2.client.id not found
at org.apache.hadoop.fs.adl.AdlFileSystem.getPasswordString(AdlFileSystem.java:950)
at org.apache.hadoop.fs.adl.AdlFileSystem.getConfCredentialBasedTokenProvider(AdlFileSystem.java:289)

ex1- DOESN'T WORK:

spark = sparkSession....
spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "false")
.option("skipFirstRows","15")
.load("adl://test.azuredatalakestore.net/teste.xls")

PS:

If I try to read any file from my adl with that sparkSession and then read the .xls everything works.

ex2 - WORKS:

spark = sparkSession....
spark.read.format("csv")
.option("useHeader", "false")
.option("skipFirstRows","15")
.load("adl://test.azuredatalakestore.net/teste2.csv")

spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "false")
.option("skipFirstRows","15")
.load("adl://test.azuredatalakestore.net/teste.xls")

@nightscape
Copy link
Owner

Hmm, I'm quite clueless, what we'd have to do to support ADL properly. Would you be willing to contribute a PR or dig out the corresponding documentation?
We don't have this use case and can't spend much time on this...

@aravish
Copy link

aravish commented Jul 10, 2019

We are trying below from Databricks, per them this is the update.

This is because the Spark reader used to load the excel file does not honor the configs given as Hadoop configuration and it does not load the same

Repro code, Data lake store (ADL) is an Azure storage platform, the problem is only when you reference a full path like below. But when you mount the storage platform as a mount point on Databricks, problem does not occur.

dayreportfullpath = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load("adl://aravishdatalake.azuredatalakestore.net/external/Test.xlsx")

IllegalArgumentException: 'No value for dfs.adls.oauth2.access.token.provider found in conf file.'

IllegalArgumentException Traceback (most recent call last)
in ()
----> 1 dayreportfullpath = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load("adl://aravishdatalake.azuredatalakestore.net/external/Test.xlsx")

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
164 self.options(**options)
165 if isinstance(path, basestring):
--> 166 return self._df(self._jreader.load(path))
167 elif path is not None:
168 if type(path) != list:

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco

IllegalArgumentException: 'No value for dfs.adls.oauth2.access.token.provider found in conf file.'

@brickfrog
Copy link

brickfrog commented Oct 18, 2019

For anyone else having this issue, you need to use the RDD context. You can also mount it, but in some cases you may be averse to mounting (like my use case).

spark.sparkContext.hadoopConfiguration.set(...

This is what worked for me earlier today. Was able to read from ADLS without mounting.

@axen22
Copy link

axen22 commented Oct 23, 2019

@brickfrog mounting does not work for me, I still get the mentioned error. Can you provide your code as an example?
also what do you mean by use RDD context? can you provide an example?

@quanghgx quanghgx added bug cloud Usage of spark-excel on cloud storage & platform labels Oct 3, 2021
@divyavanmahajan
Copy link

For someone who comes here - looking for a Pyspark solution
Spark 3.1.2
Cannot read abyss:// url with spark-excel

Use
com.crealytics:spark-excel_2.12:0.13.7
and set the Azure OAuth parameters with
spark._jsc.hadoopConfiguration().set(key, value)
in addition to
spark.conf.set(key, value)

@brickfrog - thanks for pointing us in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cloud Usage of spark-excel on cloud storage & platform
Projects
None yet
Development

No branches or pull requests

7 participants