Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1337235 SHOW <TABLES, VIEWS> has inconsistent JSON responses that can't be converted to Arrow correctly #1110

Open
davidhcoe opened this issue Apr 19, 2024 · 17 comments
Assignees
Labels
enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team

Comments

@davidhcoe
Copy link

Please answer these questions before submitting your issue.
In order to accurately debug the issue this information is required. Thanks!

  1. What version of GO driver are you using?

1.9

  1. What operating system and processor architecture are you using?

Windows x64

  1. What version of GO are you using?
    run go version in your console

go version go1.21.3 windows/amd64

4.Server version:* E.g. 1.90.1
You may get the server version by running a query:

SELECT CURRENT_VERSION();

8.15.0

  1. What did you do?

    If possible, provide a recipe for reproducing the error.
    A complete runnable program is good.

The issue was first noticed and logged as an ADBC issue on apache/arrow-adbc#1454.

The problem seems to be that when a SHOW command is run that contains a large number of results, the error arrow/ipc: could not read message schema: arrow/ipc: could not read message metadata: unexpected EOF is received.

Upon digging in to this further, the behavior of the call is inconsistent. If the call is included with a LIMIT command to restrict the number of records to, say, 100, then things work as expected:

image

And the 100 values are returned in the JSON:

image

There are no Arrow record batches returned from the ArrowStreamLoader class because it is nil. The downstream calls all succeed as expected.

However, a larger number of records (of indeterminate size) in the result causes that same request to fail. For example, if limited to 700, only 562 records are returned:

image

image

And those records end with a "," and not EOF because the JSON string isn't loaded correctly.

Now, ArrowStreamLoader reports there is 1 record batch. This leads to downstream callers failing:

image

Because the string can't be finished being read in

image

  1. What did you expect to see?

    The correct number of tables from SHOW TABLES should be sent, no matter how many there are.

  2. Can you set logging to DEBUG and collect the logs?

    https://community.snowflake.com/s/article/How-to-generate-log-file-on-Snowflake-connectors

  3. What is your Snowflake account identifier, if any? (Optional)

I included screenshots that contain QueryIDs. Hopefully that helps.

@davidhcoe davidhcoe added the bug Erroneous or unexpected behaviour label Apr 19, 2024
@sfc-gh-dszmolka
Copy link
Contributor

hi and thank you for submitting this issue, i'll take a look sometimes early next week. If the issue is very urgent or impactful, as suggested during the issue creation: please create a case with Snowflake Support instead

as a quick preliminary analysis:

  1. i'm already quite certain we can rule this out, but to be absolutely sure the issue is not caused by a theoretical connectivity issue to the cloud storage from where the query results are coming, can you please run SELECT SYSTEM$ALLOWLIST() (or ALLOWLIST_PRIVATELINK() if that's your use-case), and make sure that all endpoints returned in STAGE category, are in fact reachable from the Windows host you have the problem?
    Alternatively you can test this by running something like SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER LIMIT 400000 with gosnowflake and see if the result is successfully returned (= as arrow, that is, since this is just testing the connectivity and not arrow vs. json resultset and their size)

  2. if you have a runnable reproduction code which when executed exhibits the problem you're seeing, that would be truly great and appreciated and likely speed up the debug effort a lot

  3. took a quick look on the linked ADBC issue and user zeroshade might be right here:

[..]snowflake doesn't return the results of SHOW TABLES using Arrow, it returns it as JSON that we have to parse and manipulate into Arrow. This is likely an issue that needs to be addressed on the server side of Snowflake

if the issue will be indeed proven to be at the server side, then it's independent from the gosnowflake (or , any other) driver and it might take a longer period to fix.
But I don't want to go there yet, and instead understand the issue better when I have a chance early next week.

@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage Issue is under initial triage label Apr 20, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Apr 20, 2024
@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Apr 22, 2024

did a test on Debian 11 to see how gosnowflake handles the large amount of resultset for SHOW TABLES because it was very weird that no one else reported such an issue (so far) for such a basic functionality.

If there's any issue in the JSON resultset parsing, it should be visible regardless of OS, but to be sure I'll try to test this on Windows too.
edit: same (below) script works the same on Windows 10 as well, using nothing but the vanilla gosnowflake 1.9.0

Reproduction:

  1. created a new account on the same deployment as your account in the example, and 3000 tables with 1 int column
  2. using following simple script with the vanilla gosnowflake 1.9.0, without ADBC:
# cat gosnowflake1110.go 
package main

import (
	"database/sql"
	"flag"
	"fmt"
	"log"

	sf "github.com/snowflakedb/gosnowflake"
)

func main() {
	if !flag.Parsed() {
		flag.Parse()
	}

	cfg, err := sf.GetConfigFromEnv([]*sf.ConfigParam{
		{Name: "Account", EnvName: "SNOWFLAKE_TEST_ACCOUNT", FailOnMissing: true},
		{Name: "Role", EnvName: "SNOWFLAKE_TEST_ROLE", FailOnMissing: true},
		{Name: "User", EnvName: "SNOWFLAKE_TEST_USER", FailOnMissing: true},
		{Name: "Password", EnvName: "SNOWFLAKE_TEST_PASSWORD", FailOnMissing: true},
		{Name: "Warehouse", EnvName: "SNOWFLAKE_TEST_WAREHOUSE", FailOnMissing: true},
		{Name: "Database", EnvName: "SNOWFLAKE_TEST_DATABASE", FailOnMissing: true},
		{Name: "Schema", EnvName: "SNOWFLAKE_TEST_SCHEMA", FailOnMissing: true},
	})
	if err != nil {
		log.Fatalf("failed to create Config, err: %v", err)
	}
	//cfg.Tracing = "DEBUG"
	dsn, err := sf.DSN(cfg)
	if err != nil {
		log.Fatalf("failed to create DSN from Config: %v, err: %v", cfg, err)
	}

	db, err := sql.Open("snowflake", dsn)
	if err != nil {
		log.Fatalf("failed to connect. %v, err: %v", dsn, err)
	}
	defer db.Close()
	query := "SHOW TABLES LIMIT 2999;"
	rows, err := db.Query(query) // no cancel is allowed
	if err != nil {
		log.Fatalf("failed to run a query. %v, err: %v", query, err)
	}
	defer rows.Close()
	var v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23 sql.NullString
	for rows.Next() {
		err := rows.Scan(&v1, &v2, &v3, &v4, &v5, &v6, &v7, &v8, &v9, &v10, &v11, &v12, &v13, &v14, &v15, &v16, &v17, &v18, &v19, &v20, &v21, &v22, &v23)
		if err != nil {
			log.Fatalf("failed to get result. err: %v", err)
		}
		fmt.Println(v2)
	}
	if rows.Err() != nil {
		fmt.Printf("ERROR: %v\n", rows.Err())
		return
	}
	fmt.Printf("Congrats! You have successfully run %v with Snowflake DB!\n", query)
}

Result: expected tables are returned:

# go run gosnowflake1110.go
{TESTTABLE1 true}
{TESTTABLE10 true}
{TESTTABLE100 true}
{TESTTABLE1000 true}
{TESTTABLE1001 true}
{TESTTABLE1002 true}
{TESTTABLE1003 true}
..
{TESTTABLE997 true}
{TESTTABLE998 true}
Congrats! You have successfully run SHOW TABLES LIMIT 2999; with Snowflake DB!

when adding cfg.Tracing = "DEBUG" to see verbose logs, it's visible that resultset now uses cloud storage (since its a fairly big query result), indeed json format:

time="2024-04-22T08:28:11Z" level=debug msg="chunks: 1, total bytes: 427543" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:11Z" level=debug msg="add chunk to channel ChunksChan: 1, URL: https://eevofusfcb1stg.blob.core.windows.net/results/<queryid>%2Fmain%2Fdata_1?s..&rsce=gzip, RowCount: 2381, UncompressedSize: 427543, ChunkResultFormat: json" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"

which chunk, as the console output suggests, is successfully downloaded and parsed

time="2024-04-22T08:28:12Z" level=debug msg="response returned chunk: 1 for URL: https://eevofusfcb1stg.blob.core.windows.net/results/<queryid>d_0%2Fmain%2Fdata_1?..&rsce=gzip" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=debug msg="decoded 2381 rows w/ 427543 bytes in 18.907967ms (chunk 1)" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=debug msg="ready: chunk 1" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=info msg="no more download" func="gosnowflake.(*defaultLogger).Info" file="log.go:216"

As of next step, could you please:

  1. use gosnowflake without ADBC to see if the issue still reproduces for you, maybe even with the above simple repro script
  2. kindly provide the runnable reproduction (+ if necessary, environment setup) which when run, leads to the issue you're seeing

Asking this because for now, only using the Snowflake Go Driver, the SHOW TABLES does not seem to have any problems even with a big resultset / large number of tables

@sfc-gh-dszmolka sfc-gh-dszmolka added the status-information_needed Additional information is required from the reporter label Apr 22, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka changed the title SHOW <TABLES, VIEWS> has inconsistent JSON responses that can't be converted to Arrow correctly SNOW-1337235 SHOW <TABLES, VIEWS> has inconsistent JSON responses that can't be converted to Arrow correctly Apr 22, 2024
@davidhcoe
Copy link
Author

I can't run it as is, I get the error:

2024/04/22 11:35:08 failed to get result. err: sql: expected 22 destination arguments in Scan, not 23

But I can run it if I drop v23.

@davidhcoe
Copy link
Author

But, that call is also just a JSON call with no Arrow context to it. I will try to get a closer re-creation.

@sfc-gh-dszmolka
Copy link
Contributor

I can't run it as is, I get the error:

2024/04/22 11:35:08 failed to get result. err: sql: expected 22 destination arguments in Scan, not 23

But I can run it if I drop v23.

apparently then you have one column less in your response for some reason; maybe we're in different versions or there's some settings different between our accounts.

being able to run it without the last column means it is possible to retrieve the 'big' resultset for SHOW TABLES in your account, when using the gosnowflake library without ADBC.
indeed it's a JSON resultset and that's we wanted to see.

if would be very much appreciated if you could please provide a repro; as you see the issue does not reproduce with gosnowflake-only test - at least for now but I would like to see how it breaks for you.

@davidhcoe
Copy link
Author

not quite the repro, but this is closer to the scenario:

package main

import (
	"context"
	"database/sql"
	"flag"
	"fmt"
	"log"

	"github.com/apache/arrow/go/v15/arrow/memory"
	"github.com/snowflakedb/gosnowflake"
	sf "github.com/snowflakedb/gosnowflake"
)

func main() {
	if !flag.Parsed() {
		flag.Parse()
	}

	cfg, err := sf.GetConfigFromEnv([]*sf.ConfigParam{
		{Name: "Account", EnvName: "SNOWFLAKE_TEST_ACCOUNT", FailOnMissing: true},
		{Name: "Role", EnvName: "SNOWFLAKE_TEST_ROLE", FailOnMissing: true},
		{Name: "User", EnvName: "SNOWFLAKE_TEST_USER", FailOnMissing: true},
		{Name: "Password", EnvName: "SNOWFLAKE_TEST_PASSWORD", FailOnMissing: true},
		{Name: "Warehouse", EnvName: "SNOWFLAKE_TEST_WAREHOUSE", FailOnMissing: true},
		{Name: "Database", EnvName: "SNOWFLAKE_TEST_DATABASE", FailOnMissing: true},
		{Name: "Schema", EnvName: "SNOWFLAKE_TEST_SCHEMA", FailOnMissing: true},
	})
	if err != nil {
		log.Fatalf("failed to create Config, err: %v", err)
	}

	//cfg.Tracing = "DEBUG"
	dsn, err := sf.DSN(cfg)
	if err != nil {
		log.Fatalf("failed to create DSN from Config: %v, err: %v", cfg, err)
	}

	db, err := sql.Open("snowflake", dsn)
	if err != nil {
		log.Fatalf("failed to connect. %v, err: %v", dsn, err)
	}
	defer db.Close()

	ctx := context.Background()
	ctx = gosnowflake.WithArrowAllocator(
		gosnowflake.WithArrowBatches(ctx), memory.DefaultAllocator)

	query := "SHOW TABLES LIMIT 2999;"
	rows, err := db.QueryContext(ctx, query)
	if err != nil {
		log.Fatalf("failed to run a query. %v, err: %v", query, err)
		return
	}
	defer rows.Close()
	var v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22 sql.NullString

	for rows.Next() {
		err := rows.Scan(&v1, &v2, &v3, &v4, &v5, &v6, &v7, &v8, &v9, &v10, &v11, &v12, &v13, &v14, &v15, &v16, &v17, &v18, &v19, &v20, &v21, &v22)
		if err != nil {
			log.Fatalf("failed to get result. err: %v", err)
		}
		fmt.Println(v2)
	}
	if rows.Err() != nil {
		fmt.Printf("ERROR: %v\n", rows.Err())
		return
	}
	fmt.Printf("Congrats! You have successfully run %v with Snowflake DB!\n", query)
}

results in the error:

panic: close of nil channel

goroutine 1 [running]:
github.com/snowflakedb/gosnowflake.(*snowflakeChunkDownloader).next(0xc000002300)
	C:/Users/davidcoe/go/pkg/mod/github.com/snowflakedb/gosnowflake@v1.9.0/chunk_downloader.go:221 +0x20b
github.com/snowflakedb/gosnowflake.(*snowflakeRows).Next(0xc000508e70, {0xc00047c160, 0x16, 0x16})
	C:/Users/davidcoe/go/pkg/mod/github.com/snowflakedb/gosnowflake@v1.9.0/rows.go:178 +0xb9
database/sql.(*Rows).nextLocked(0xc00079cd80)
	C:/Program Files/Go/src/database/sql/sql.go:3019 +0x1f4
database/sql.(*Rows).Next.func1()
	C:/Program Files/Go/src/database/sql/sql.go:2994 +0x2e
database/sql.withLock({0x1a8a840, 0xc00079cdb8}, 0xc0000d7690)
	C:/Program Files/Go/src/database/sql/sql.go:3502 +0xc2
database/sql.(*Rows).Next(0xc00079cd80)
	C:/Program Files/Go/src/database/sql/sql.go:2993 +0xf7
main.main()
	c:/dev/test/AdbcTestDefinitions/Snowflake/SHOW_ISSUE/Repro-Show.go:58 +0x1409

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Apr 23, 2024

thank you for the repro and details!

ctx = gosnowflake.WithArrowAllocator(
		gosnowflake.WithArrowBatches(ctx), memory.DefaultAllocator)
..
rows, err := db.QueryContext(ctx, query)

will not work with SHOW queries (and some other queries which return JSON resultset like ls @stagename

This is documented in the Supported Data Types section of the driver documentation:

[..]Limitations:

  1. For some queries Snowflake may decide to return data in JSON format (examples: SHOW PARAMETERS or ls @stage). You cannot use JSON with Arrow batches context.

You seem to be hitting this expected behaviour.

edited for possible way forward:

  1. using WithArrowBatches but then GetArrowBatches needs to be used to get the result, instead of 'regular' getting them. Example (again from the documentation):
batches, err := rows.(sf.SnowflakeRows).GetArrowBatches()
  1. or don't force Arrow on the JSON batches and get the rows as you're trying to
rows, err := db.QueryContext(ctx, query)

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team question Issue is a usage/other question rather than a bug and removed bug Erroneous or unexpected behaviour status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter labels Apr 23, 2024
@davidhcoe
Copy link
Author

The problem is that the call isn't consistent. When the call has the lower number, it runs correctly, and the JSON can be parsed.

// Some queries from Snowflake do not return Arrow data regardless
indicates that the JSONData should be populated, and that is parsed in https://github.com/apache/arrow-adbc/blob/93dda626daf52a15673cc08cdd1ef69b85601ecc/go/adbc/driver/snowflake/record_reader.go#L363. Tagging @zeroshade if he has some more thoughts here.

@zeroshade
Copy link
Contributor

I think I know what's happening.

If you have enough rows in your output, snowflake will chunk the data, returning the first chunk with the initial response and requiring you to fetch the remaining results as separate requests. It looks like snowflakeArrowStreamChunkDownloader isn't handling this scenario properly in order to report this to the consumer.

In the ADBC code that @davidhcoe linked to, we check the number of batches returned by GetBatches because the downloader will just skip the first batch if it was solely JSON. But since there are subsequent expected batches, just as JSON data, it still reports the extra batches and ArrowStreamBatch's assumption of receiving Arrow data is no longer valid for those batches.

For consistency, I think the ideal scenario here would be for gosnowflake's ArrowStreamBatch to be updated to simply check if the batch is JSON and not arrow, and just convert the JSON to arrow like we do in the ADBC code that was linked. This would remove the burden from consumers to have to check whether or not they actually got Arrow data back based on the query and just seamlessly provide the Arrow data if that's what the consumer wanted.

@sfc-gh-pfus
Copy link
Collaborator

Hi @zeroshade ! My input from driver developer perspective:

  1. QueryArrowStream should be used only for arrow requests. It shouldn't be used if JSON is returned (JSON is set in session as default response data format, we ask for metadata like SHOW queries, we use multistatements that don't support Arrow).
  2. We don't want to convert JSON to Arrow in drivers. Currently drivers don't create arrow on their own (we only reuse existing Arrow arrays). This has some important drawbacks:
    a) We would have to copy backend logic for conversion to Arrow of all types. This is a huge effort having in mind all, sometimes complicated, types (like timestamps or upcoming structured types).
    b) We would have to follow all new types that are introduced every now and then and keep drivers updated.
    c) We would have to think about points above in all drivers, which are 7 now.

@zeroshade
Copy link
Contributor

@sfc-gh-pfus

  1. Since ADBC just forwards query strings to gosnowflake, there's no way for us to know whether or not a query is a metadata "SHOW" query etc. Which means that, at minimum, QueryArrowStream is going to need something added to it which will allow indicating that the results are JSON and not Arrow. The current attempt I made by just checking the len of the batches slice returned from GetBatches isn't sufficient since JSON can return chunked batches too (which is the cause of the initial problem)
  2. Do all the drivers currently allow exporting Arrow data from them? It was my understanding that this driver and the python connector were the only ones that make Arrow externally accessible from them.

@sfc-gh-pfus
Copy link
Collaborator

@zeroshade
Ad 1. ArrowStreamLoader which is a return value of QueryArrowStream has a function JSONData() [][]*string. Maybe you can decide on these values? If not, we can add additional function returning some info about types.
Ad 2. Not yet, but we are planing to add this feature to remaining drivers probably later this year. Customers ask for it.

@sfc-gh-dszmolka sfc-gh-dszmolka added enhancement The issue is a request for improvement or a new feature and removed question Issue is a usage/other question rather than a bug labels Apr 25, 2024
@zeroshade
Copy link
Contributor

Ad 1. ArrowStreamLoader which is a return value of QueryArrowStream has a function JSONData() [][]*string. Maybe you can decide on these values? If not, we can add additional function returning some info about types.

The problem there is that it doesn't correctly manage when the JSON data is chunked. It returns the JSON data from the initial response, and it's still difficult to tell when you need to call JSONData vs using the Arrow interface since it is entirely dependent upon the query that was sent in which the caller doesn't know. This inconsistency where metadata queries can't return Arrow when requested is problematic because there's no way for callers like ADBC to know if the query supported Arrow or not.

Ad 2. Not yet, but we are planing to add this feature to remaining drivers probably later this year. Customers ask for it.

You will likely run into this problem again when you do that as customers are going to almost certainly dislike the inconsistency. Personally, the preferred solution here would be to add support to return Arrow for these metadata queries.

But for now, at minimum something should be added to ArrowStreamLoader to explicitly be able to denote whether or not a consumer needs to use JSONData or not and whether they should expect JSON data from the batches

@ianmcook
Copy link

FWIW: In general, converting a result set from JSON format to Arrow format is a nontrivial operation for a client or driver library to have to perform. It requires using dependencies which can be heavy and can behave subtly differently depending on what language implementation you're using. Client-side conversion also creates greater risk of inconsistency because the logic for converting Snowflake types to Arrow types can be complicated. +1 to Matt's suggestion to always convert to Arrow on the server side whenever possible. This is better for keeping connectors lightweight and for ensuring consistency across connectors.

@sfc-gh-pfus
Copy link
Collaborator

@ianmcook thanks for you input! However the problem on backend is (to my best knowledge) similar to the problem in drivers - there is one component that prepares the actual table data (and this component can serialise arrow) and another one which returns metadata (and this one probably can't - but I'm not sure).

@zeroshade I analysed code a bit and I'm wondering about one thing. Currently the process looks like this:

  1. Call query.
  2. Check if JSONData is empty -> it means that everything should be available in GetBatches.
  3. Call GetBatches and iterate for each batch downloading batch content using GetStream. This function returns just bytes.
  4. On ADBC side - read bytes to arrow records.

What do you think about such approach:

  1. Call query.
  2. Check if JSONData is empty -> if it is not, it means batches should also be JSON.
  3. Same as above.
  4. On ADBC side - get JSONData and remaining batches bytes deserialise to JSON instead of arrow?
    What do you think?

@zeroshade
Copy link
Contributor

On ADBC side - get JSONData and remaining batches bytes deserialise to JSON instead of arrow?
What do you think?

I was toying with this a bit, and I can do this, but this is a terrible solution for the general case. Mostly because the chunks aren't actually valid JSON. They are a series of JSON arrays separated by commas:

["a", "b", "c", "d"], ["e", "f", "g", "h"], ....

With no top-level structure as per the JSON spec. Looking at the gosnowflake code, there's an entire section devoted to custom parsing this JSON because you can't just deserialize it as normal JSON.

It also means that if snowflake ever changes how it formats the JSON for the large chunks, it'll break us as we'll now be relying on an undocumented, non-standard way of representing the JSON data. While this is likely a viable workaround for this situation, I don't think this is a viable long-term solution.

lidavidm pushed a commit to apache/arrow-adbc that referenced this issue Apr 30, 2024
…itations (#1790)

Workaround to fix #1454 until snowflake addresses
snowflakedb/gosnowflake#1110 with a better
solution (hopefully by having the server actually return Arrow...)
@sfc-gh-pfus
Copy link
Collaborator

Sorry for late reply, I was absent in the previous week.
I'm glad that you implemented a workaround and it works! I fully agree that this is not a good solution long-term, but having this workaround we can plan proper changes on backend (I believe so). We will discuss it internally and when I have some info, I'll let you know.
One last thought - having so filthy and ugly current format, this is still not going to be changed in any foreseeable future in my opinion, so you workaround is safe for now ;)

cocoa-xu pushed a commit to meowcraft-dev/arrow-adbc that referenced this issue May 8, 2024
…itations (apache#1790)

Workaround to fix apache#1454 until snowflake addresses
snowflakedb/gosnowflake#1110 with a better
solution (hopefully by having the server actually return Arrow...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

6 participants