Skip to content

Latest commit

 

History

History
2223 lines (2178 loc) · 236 KB

CHANGELOG.md

File metadata and controls

2223 lines (2178 loc) · 236 KB

Change log

Generated on 2022-06-17

Release 22.06

Features

#5451 [FEA] Update Spark2 explain code for 22.06
#5261 [FEA] Create MIG with Cgroups on YARN Dataproc scripts
#5476 [FEA] extend concat on arrays to all nested types.
#5113 [FEA] ANSI mode: Support CAST between types
#5112 [FEA] ANSI mode: allow casting between numeric type and timestamp type
#5323 [FEA] Enable floating point by default
#4518 [FEA] Add support for escaped unicode hex in regular expressions
#5405 [FEA] Support map_concat function
#5547 [FEA] Regexp: Can we transpile \W and \D to Java's definition so we can support on GPU?
#5512 [FEA] Qualification tool, hook up final output and output execs table
#5507 [FEA] Support GpuRaiseError
#5325 [FEA] Support spark.sql.mapKeyDedupPolicy=LAST_WIN for TransformKeys
#3682 [FEA] Use conventional jar layout in dist jar if there is only one input shim
#1556 [FEA] Implement ANSI mode tests for string to timestamp functions
#4425 [FEA] Support line anchor $ and string anchors \z and \Z in regexp_replace
#5176 [FEA] Qualification tool UI
#5111 [FEA] ANSI mode: CAST between ANSI intervals and IntegralType
#4605 [FEA] Add regular expression support for new character classes introduced in Java 8
#5273 [FEA] Support map_filter
#1557 [FEA] Enable ANSI mode for CAST string to date
#5446 [FEA] Remove hasNans check for array_contains
#5445 [FEA] Support reading Int as Byte/Short/Date from parquet
#5449 [FEA] QualificationTool. Add speedup information to AppSummaryInfo
#5322 [FEA] remove hasNans for Pivot
#4800 [FEA] Enable support for more regular expressions with \A and \Z
#5404 [FEA] Add Shim for the Spark version shipped with Cloudera CDH 7.1.7
#5226 [FEA] Support array_repeat
#5229 [FEA] Support arrays_zip
#5119 [FEA] Support ANSI mode for SQL functions/operators
#4532 [FEA] Re-enable support for \Z in regular expressions
#3985 [FEA] UDF-Compiler: Translation of simple predicate UDF should allow predicate pushdown
#5034 [FEA] Implement ExistenceJoin for BroadcastNestedLoopJoin Exec
#4533 [FEA] Re-enable support for $ in regular expressions
#5263 [FEA] Write out operator mapping from plugin to CSV file for use in qualification tool
#5095 [FEA] Support collect_set on struct in reduction context
#4811 [FEA] Support ANSI intervals for Cast and Sample
#2062 [FEA] support collect aggregations
#5060 [FEA] Support Count on Struct of [ Struct of [String, Map(String,String)], Array(String), Map(String,String) ]
#4528 [FEA] Add support for regular expressions containing \s and \S
#4557 [FEA] Add support for regexp_replace with back-references

Performance

#5148 Add the MULTI-THREADED reading support for avro
#5304 [FEA] Optimize remote Avro reading for a PartitionFile
#5257 [FEA][Audit] - [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
#5149 Add the COALESCING reading support for avro

Bugs Fixed

#5769 [BUG] arithmetic ops tests failing on Spark 3.3.0
#5785 [BUG] Tests module build failed in OrcEncryptionSuite for 321cdh
#5765 [BUG] Container decimal overflow when casting float/double to decimal
#5246 Verify Parquet columnar encryption is handled safely
#5770 [BUG] test_buckets failed
#5733 [BUG] Integration test test_orc_write_encryption_fallback fail
#5719 [BUG] test_cast_float_to_timestamp_ansi_for_nan_inf failed in spark330
#5739 [BUG] Spark 3.3 build failure - QueryExecutionErrors package scope changed
#5670 [BUG] Job failed when parsing "java.lang.reflect.InvocationTargetException: org.apache.spark.sql.catalyst.parser.ParseException:"
#4860 [BUG] GPU writing ORC columns statistics
#5717 [BUG] div_by_zero test is failing on Spark 330 on 22.06
#5632 [BUG] udf_cudf tests failed: EOFException DataInputStream.readInt(DataInputStream.java:392)
#5672 [BUG] Read exception occurs when clipped schema is empty
#5694 [BUG] Inconsistent behavior with Spark when reading a non-existent column from Parquet
#5562 [BUG] read ORC file with various file schemas
#5654 [BUG] Transpiler produces regex pattern that cuDF cannot compile
#5655 [BUG] Regular expression pattern [&&1] produces incorrect results on GPU
#4862 [FEA] Add support for regular expressions containing octal digits inside character classes , eg[\0177]
#5615 [BUG] GpuBatchScanExec only reports output row metrics
#4505 [BUG] RegExp parse fails to parse character ranges containing escaped characters
#4865 [BUG] Add support for regular expressions containing hexadecimal digits inside character classes, eg [\x7f]
#5513 [BUG] NoClassDefFoundError with caller classloader off in GpuShuffleCoalesceIterator in local-cluster
#5530 [BUG] regexp: \d, \w inconsistencies with non-latin unicode input
#5594 [BUG] 3.3 test_div_overflow_exception_when_ansi test failures
#5596 [BUG] Shim service provider failure when using jar built with -DallowConventionalDistJar
#5582 [BUG] Nightly CI failed with : 'dist/target/rapids-4-spark_2.12-22.06.0-SNAPSHOT.jar' not exists
#5577 [BUG] test_cast_neg_to_decimal_err failing in databricks
#5557 [BUG] dist jar does not contain reduced pom, creates an unnecessary jar
#5474 [BUG] Spark 3.2.1 arithmetic_ops_test failures
#5497 [BUG] 3 tests in IntervalSuite are faling on 330
#5544 [BUG] GpuCreateMap needs to set hasSideEffects in some cases
#5469 [BUG] NPE during serialization for shuffle in array-aggregation-with-limit query
#5496 [BUG] avg literals bools is failing on 330
#5511 [BUG] orc_test failures on 321cdh
#5439 [BUG] Encrypted Parquet writes are being replaced with a GPU unencrypted write
#5108 [BUG] GpuArrayExists encounters a CudfException on an input partition consisting of just empty lists
#5492 [BUG] com.nvidia.spark.rapids.RegexCharacterClass cannot be cast to com.nvidia.spark.rapids.RegexCharacterClassComponent
#4818 [BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream
#5481 [BUG] test_parquet_check_schema_compatibility failed in databricks runtimes
#5482 [BUG] test_cast_string_date_invalid_ansi_before_320 failed in databricks runtime
#5457 [BUG] 330 AnsiCastOpSuite Unit tests failed 22 cases
#5098 [BUG] Harden calls to RapidsBuffer.free
#5464 [BUG] Query failure with java.lang.AssertionError when using partitioned Iceberg tables
#4746 [FEA] Add support for regular expressions containing octal digits in range \200 to 377
#5200 [BUG] More detailed logs to show which parquet file and which data type has mismatch.
#4866 [BUG] Add support for regular expressions containing hexadecimal digits greater than 0x7f
#5140 [BUG] NPE on array_max of transformed empty array
#5444 [BUG] build failed on Databricks
#5357 [BUG] Spark 3.3 cache_test test_passing_gpuExpr_as_Expr[failures
#5429 [BUG] test_cache_expand_exec fails on Spark 3.3
#5312 [BUG] The coalesced AVRO file may contain different sync markers if the sync marker varies in the avro files being coalesced.
#5415 [BUG] Regular Expressions: matching the dot . doesn't fully exclude all unicode line terminator characters
#5413 [BUG] Databricks 321 build fails - not found: type OrcShims320untilAllBase
#5286 [BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns
#5351 [BUG] Build fails for Spark 3.3 due to extra arguments to mapKeyNotExistError
#5260 [BUG] map_test failures on Spark 3.3.0
#5189 [BUG] Reading from iceberg table will fail.
#5130 [BUG] string_split does not respect spark.rapids.sql.regexp.enabled config
#5267 [BUG] markdown link check failed issue
#5295 [BUG] Build fails for Spark 3.3 due to extra arguments to mapKeyNotExistError
#5264 [BUG] Delete unused generic type.
#5275 [BUG] rlike cannot run on GPU because invalid or unsupported escape character ']' near index 14
#5278 [BUG] build 311cdh failed: unable to find valid certification path to requested target
#5211 [BUG] csv_test:test_basic_csv_read FAILED
#5244 [BUG] Spark 3.3 integration test failures logic_test.py::test_logical_with_side_effect
#5041 [BUG] Implement hasSideEffects for all expressions that have side-effects
#4980 [BUG] window_function_test FAILED on PASCAL GPU
#5240 [BUG] EGX integration test_collect_list_reductions failures
#5242 [BUG] Executor falls back to cudaMalloc if the pool can't be initialized
#5215 [BUG] Coalescing reading is not working for v2 parquet/orc datasource
#5104 [BUG] Unconditional warning in UDF Plugin "The compiler is disabled by default"
#5099 [BUG] Profiling tool should not sum gettingResultTime
#5182 [BUG] Spark 3.3 integration tests arithmetic_ops_test.py::test_div_overflow_exception_when_ansi failures
#5147 [BUG] object LZ4Compressor is not a member of package ai.rapids.cudf.nvcomp
#4695 [BUG] Segfault with UCX and ASYNC allocator
#5138 [BUG] xgboost job failed if we enable PCBS
#5135 [BUG] GpuRegExExtract is not align with RegExExtract
#5084 [BUG] GpuWriteTaskStatsTracker complains for all writes in local mode
#5123 [BUG] Compile error for Spark330 because of VectorizedColumnReader constructor added a new parameter.
#5133 [BUG] Compile error for Spark330 because of Spark changed the method signature: QueryExecutionErrors.mapKeyNotExistError
#4959 [BUG] Test case in OpcodeSuite failed on Spark 3.3.0

PRs

#5861 [Doc]Add Spark3.3 support in doc for 22.06 branch[skip ci]
#5851 Update 22.06 changelog to include new commits [skip ci]
#5848 Update spark330shim to use released lib
#5840 [DOC] Updated RapidsConf to reflect the default value of spark.rapids.sql.improvedFloatOps.enabled [skip ci]
#5816 Update 22.06.0 changelog to latest [skip ci]
#5795 Update FAQ to include local jar deployment via extraClassPath [skip ci]
#5802 Update spark-rapids-jni.version to release 22.06.0
#5798 Fall back to CPU for RoundCeil and RoundFloor expressions
#5791 Remove ORC encryption test from 321cdh
#5766 Fix the overflow of container type when casting floats to decimal
#5786 Fix rounds over decimal in Spark 330+
#5761 Throw an exception when attempting to read columnar encrypted Parquet files on the GPU
#5784 Update the error string for test_cast_neg_to_decimal_err on 330
#5781 Correct the exception string for test_mod_pmod_by_zero on Spark 3.3.0
#5764 Add test for encrypted ORC write
#5760 Enable avrotest in nightly tests [skip ci]
#5746 Init 22.06 changelog [skip ci]
#5716 Disable Avro support when spark-avro classes not loadable by Shim classloader
#5737 Remove the ORC encryption tests
#5753 [DOC] Update regexp compatibility for 22.06 [skip ci]
#5738 Update Spark2 explain code for 22.06
#5731 Throw SparkDateTimeException for InvalidInput while casting in ANSI mode
#5742 Spark-3.3 build fix - Move QueryExecutionErrors to sql package
#5641 [Doc]Update 22.06 documentation[skip ci]
#5701 Update docs for qualification tool to reflect recommendations and UI [skip ci]
#5283 Add documentation for MIG on Dataproc [skip ci]
#5728 Qualification tool: Add test for stage failures
#5681 Branch 22.06 nvcomp notice binary [skip ci]
#5713 Fix GpuCast losing the timezoneId during canonicalization
#5715 Update GPU ORC statistics write support
#5718 Update the error message for div_by_zero test
#5604 ORC encrypted write should fallback to CPU
#5674 Fix reading ORC/PARQUET over empty clipped schema
#5676 Fix ORC reading over different schemas
#5693 Temporarily allow 3.3.1 for 3.3.0 shims.
#5591 Enable regular expressions by default
#5664 Fix edge case where one side of regexp choice ends in duplicate string anchors
#5542 Support arrays of arrays and structs for concat on arrays
#5677 Qualification tool Enable UI by default
#5575 Regexp: Transpile \D, \W to Java's definitions
#5668 Add user as CI owner [skip ci]
#5627 Install locales and generate en_US.UTF-8
#5514 ANSI mode: allow casting between numeric type and timestamp type
#5600 Qualification tool UI cosmetics and CSV output changes
#5658 Fallback to CPU when && found in character class
#5644 Qualification tool: Enable UDF reporting in potential problems
#5645 Add support for octal digits in character classes
#5643 Fix missing GpuBatchScanExec metrics in SQL UI
#5441 Enable optional float confs and update docs mentioning them
#5532 Support hex digits in character classes and escaped characters in character class ranges
#5625 [DOC]update links for 2206 release[skip ci]
#5623 Handle duplicates in negated character classes
#5533 Support GpuMapConcat
#5614 Move HostConcatResultUtil out of unshimmed classes
#5612 Qualification tool: update SQL Df value used and look at jobs in SQL
#5526 Fix whitespace \s and \S tests
#5541 Regexp: Transpile \d, \w to Java's definitions
#5598 Qualification tool: Update RunningQualificationApp tests
#5601 Update test_div_overflow_exception_when_ansi test for Spark-3.3
#5588 Update Databricks build scripts
#5599 Move ShimServiceProvider file re-init/truncate
#5531 Filter rows with null keys when coalescing due to reaching cuDF row limits
#5550 Qualification tool hook up final output based on per exec analysis
#5540 Support RaiseError
#5505 Support spark.sql.mapKeyDedupPolicy=LAST_WIN for TransformKeys
#5583 Disable spark snapshot shims build for pre-merge
#5584 Enable automerge from branch-22.06 to 22.08 [skip ci]
#5581 nightly CI to install and deploy cuda11 classifier dist jar [skip ci]
#5579 Update test_cast_neg_to_decimal_err to work with Databricks 10.4 where exception is different
#5578 Fix unfiltered partitions being used to create GpuBatchScanExec RDD
#5560 Minor: Clean up the tests of concat_list
#5528 Enable build and test with JDK11
#5571 Update array_min and array_max to use new cudf operations
#5558 Fix target file for update from extra-resources in dist module
#5556 Move FsInput creation into AvroFileReader
#5483 Don't distinguish between types of ArithmeticException for Spark 3.2.x
#5539 Fix IntervalSuite cases failure
#5421 Support multi-threaded reading for avro
#5538 Add tests for string to timestamp functions in ANSI mode
#5546 Set hasSideEffects correctly for GpuCreateMap
#5529 Fix failing bool agg test in Spark 3.3
#5500 Fallback parquet reading with merged schema and native footer reader
#5534 MVN_OPT to last, as it is empty in most cases
#5523 Enable forcePositionEvolution for 321cdh
#5501 Build against specified spark-rapids-jni snapshot jar [skip ci]
#5489 Fallback to the CPU if Parquet encryption keys are set
#5527 Fix bug with character class immediately following a string anchor
#5506 Fix ClassCastException in regular expression transpiler
#5519 Address feedback in "string anchors regexp replace" PR
#5520 [DOC] Remove Spark from our naming of Tools [skip ci]
#5491 Enables $, \z, and \Z in REGEXP_REPLACE on the GPU
#5470 Qualification tool support UI code generation
#5353 Supports casting between ANSI interval types and integral types
#5487 Add limited support for captured vars and athrow
#5499 [DOC]update doc for emr6.6[skip ci]
#5485 Add cudaStreamSynchronize when a new device buffer is added to the spill framework
#5477 Add support for \h, \H, \v, \V, and \R character classes
#5490 Qualification tool: Update speedup factor for few operators
#5494 Fix databrick Shim to support Ansi mode when casting from string to date
#5498 Enable 330 unit tests for nightly
#5504 Fix printing of split information when dumping debug data
#5486 Fix regression in AnsiCastOpSuite with Spark 3.3.0
#5436 Support map_filter operator
#5471 Add implicit safeFree for RapidsBuffer
#5465 Fix query planning issue when Iceberg is used with DPP and AQE
#5459 Add test cases for casting string to date in ANSI mode
#5443 Add support for regular expressions containing octal digits greater than \200
#5468 Qualification tool: Add support for join, pandas, aggregate execs
#5473 Remove hasNan check over array_contains
#5434 Check schema compatibility when building parquet readers
#5442 Add support for regular expressions containing hexadecimal digits greater than 0x7f
#5466 [Doc] Change the picture of the query plan to text format. [skip ci]
#5310 Use C++ to parse and filter parquet footers.
#5454 QualificationTool. Add speedup information to AppSummaryInfo
#5455 Moved ShimCurrentBatchIterator so it's visible to db312 and db321
#5354 Plugin should throw same arithmetic exceptions as Spark part1
#5440 Qualification tool support for read and write execs and more, add mapping stage times to sql execs
#5431 [DOC] Update the ubuntu repo key [skip ci]
#5425 Handle readBatch changes for Spark 3.3.0
#5438 Add tests for all-null data for array_max
#5428 Make the sync marker uniform for the Avro coalescing reader
#5432 Test case insensitive reading for Parquet and CSV
#5433 [DOC] Removed mention of 30x from shims.md [skip ci]
#5424 Exclude all unicode line terminator characters from matching dot
#5426 Qualification tool: Parsing Execs to get the ExecInfo #2
#5427 Workaround to fix cuda repo key rotation in ubuntu images [skip ci]
#5419 Append my id to blossom-ci whitelist [skip ci]
#5422 xfail tests for spark 3.3.0 due to changes in readBatch
#5420 Qualification tool: Parsing Execs to get the ExecInfo #1
#5418 Add GpuEqualToNoNans and update GpuPivotFirst to use to handle PivotFirst with NaN support enabled on GPU
#5306 Support coalescing reading for avro
#5410 Update docs for removal of 311cdh
#5414 Add 320+-noncdh to Databricks to fix 321db build
#5349 Enable some repetitions for \A and \Z
#5346 ADD 321cdh shim to rapids and remove 311cdh shim
#5408 [DOC] Add rebase mode notes for databricks doc [skip ci]
#5348 Qualification tool: Skip GPU event logs
#5400 Restore test_computation_in_grpby_columns and test_struct_self_join
#5399 Update New Issue template to recommend a Discussion or Question [skip ci]
#5293 Support array_repeat
#5359 Qualification tool base plan parsing infrastructure
#5360 Revert "skip failing tests for Spark 3.3.0 (#5313)"
#5326 Update GCP doc and scripts [skip ci]
#5352 Fix spark330 build due to mapKeyNotExistError changed
#5317 Support arrays_zip
#5316 Support ANSI mode for ToUnixTimestamp, UnixTimestamp, GetTimestamp, DateAddInterval
#5319 Re-enable support for \Z in regular expressions on the GPU
#5315 Simplify conditional catalyst expressions generated by udf-compiler
#5301 Support existence join type for broadcast nested loop join
#5313 skip failing tests for Spark 3.3.0
#5311 Add information about the discussion board to the README and FAQ [skip ci]
#5308 Remove unused ColumnViewUtil
#5289 Re-enable dollar ($) line anchor in regular expressions in find mode
#5274 Perform explicit UnsafeRow projection in ColumnarToRow transition
#5297 GpuStringSplit now honors thespark.rapids.sql.regexp.enabled configuration option
#5307 Remove compatibility guide reference to issue #4060
#5298 Qualification tool: Operator mapping from plugin to CSV file
#5266 Update Outdated GCP getting started guide[skip ci]
#5300 Fix DIST_JAR PATH in coverage-report [skip ci]
#5290 Add documentation about reporting security issues [skip ci]
#5277 Support multiple datatypes in TypeSig.withPsNote()
#5296 Fix spark330 build due to removal of isElementAt parameter from mapKeyNotExistError
#5291 fix dead links in shims.md [skip ci]
#5276 fix markdown check issue[skip ci]
#5270 Include dependency of common jar in tools jar
#5265 Remove unused generic types
#5288 Temporarily xfail tests to restore premerge builds
#5287 Fix nightly scripts to deploy w/ classifier correctly [skip ci]
#5134 Support division on ANSI interval types
#5279 Add test case for ANSI pmod and ANSI Remainder
#5284 Enable support for escaping the right square bracket
#5280 [BUG] Fix incorrect plugin nightly deployment and release [skip ci]
#5249 Use a bundled spark-rapids-jni dependency instead of external cudf dependency
#5268 [BUG] When ASYNC is enabled GDS needs to handle cudaMalloced bounce buffers
#5230 Update csv float tests to reflect changes in precision in cuDF
#5001 Add fuzzing test for JSON reader
#5155 Support casting between day-time interval and string
#5247 Fix test failure caused by change in Spark 3.3 exception
#5254 Fix the integration test of collect_list_reduction
#5243 Throw again after logging that RMM could not intialize
#5105 Support multiplication on ANSI interval types
#5171 Fix the bug COALESCING reading does not work for v2 parquet/orc datasource
#5157 Update the log warning of UDF compiler
#5213 Support sample on ANSI interval types
#5218 XFAIL tests that are failing due to issue 5211
#5202 Profiling tool: Remove gettingResultTime from stages & jobs aggregation
#5201 Fix merge conflict from branch-22.04
#5195 Refactor Spark33XShims to avoid code duplication
#5185 Fix test failure with Spark 3.3 by looking for less specific error message
#4992 Support Collect-like Reduction Aggregations
#5193 Fix auto merge conflict 5192 [skip ci]
#5020 Support arithmetic operators on ANSI interval types
#5174 Fix auto merge conflict 5173 [skip ci]
#5168 Fix auto merge conflict 5166
#5151 Remove NvcompLZ4CompressionCodec single-buffer APIs
#5132 Add count support for all types
#5141 Upgrade to UCX 1.12.1 for 22.06
#5143 Fix merge conflict with branch-22.04
#5144 Adapt to storage-partitioned join additions in SPARK-37377
#5139 Make mvn-verify check name more descriptive [skip ci]
#5136 Fix GpuRegExExtract about inconsistent to Spark
#5107 Fix GpuFileFormatDataWriter failing to stat file after commit
#5124 Fix ShimVectorizedColumnReader construction for recent Spark 3.3.0 changes
#5047 Change Cast.toString as "cast" instead of "ansi_cast" under ANSI mode
#5089 Enable regular expressions containing \s and \S
#5087 Add support for regexp_replace with back-references
#5110 Appending my id (mattahrens) to the blossom-ci whitelist [skip ci]
#5090 Add nvtx ranges around pre, agg, and post steps in hash aggregate
#5092 Remove single-buffer compression codec APIs
#5093 Fix leak when GDS buffer store closes
#5067 Premerge databricks CI autotrigger [skip ci]
#5083 Remove EMRShimVersion
#5076 Unshim cache serializer and other 311+-all code
#5074 Make ASYNC the default allocator for 22.06
#5073 Add in nvtx ranges for parquet filterBlocks
#5077 Change Scala style continuation indentation to be 2 spaces to match guide [skip ci]
#5070 Fix merge from 22.04 to 22.06
#5046 Init 22.06.0-SNAPSHOT
#5059 Fix merge from 22.04 to 22.06
#5036 Unshim many expressions
#4993 PCBS and Parquet support ANSI year month interval type
#5031 Unshim many SparkShim interfaces
#5027 Fix merge of branch-22.04 to branch-22.06
#5022 Unshim many Pandas execs
#5013 Unshim GpuRowBasedScalaUDF
#5012 Unshim GpuOrcScan and GpuParquetScan
#5010 Unshim GpuSumDefaults
#5007 Remove schema utils, case class copying, file partition, and legacy statistical aggregate shims
#4999 Enable automerge from branch-22.04 to branch-22.06 [skip ci]

Release 22.04

Features

#4734 [FEA] Support approx_percentile in reduction context
#1922 [FEA] Support ORC forced positional evolution
#123 [FEA] add in support for dayfirst formats in the CSV parser
#4863 [FEA] Improve timestamp support in JSON and CSV readers
#4935 [FEA] Support reading Avro: primitive types
#4915 [FEA] Drop support for Spark 3.0.1, 3.0.2, 3.0.3, Databricks 7.3 ML LTS
#4815 [FEA] Support org.apache.spark.sql.catalyst.expressions.ArrayExists
#3245 [FEA] GpuGetMapValue should support all valid value data types and non-complex key types
#4914 [FEA] Support for Databricks 10.4 ML LTS
#4945 [FEA] Support filter and comparisons on ANSI day time interval type
#4004 [FEA] Add support for percent_rank
#1111 [FEA] support spark.sql.legacy.timeParserPolicy when parsing CSV files
#4849 [FEA] Support parsing dates in JSON reader
#4789 [FEA] Add Spark 3.1.4 shim
#4646 [FEA] Make JSON parsing of NaN and Infinity values fully compatible with Spark
#4824 [FEA] Support reading decimals from JSON and CSV
#4814 [FEA] Support element_at with non-literal index
#4816 [FEA] Support org.apache.spark.sql.catalyst.expressions.GetArrayStructFields
#3542 [FEA] Support str_to_map function
#4721 [FEA] Support regular expression delimiters for str_to_map
#4791 Update Spark 3.1.3 to be released
#4712 [FEA] Allow to partition on Decimal 128 when running on the GPU
#4762 [FEA] Improve support for reading JSON integer types
#4696 [FEA] Support casting map to string
#1572 [FEA] Add in decimal support for pmod, remainder and divide
#4763 [FEA] Improve support for reading JSON boolean types
#4003 [FEA] Add regular expression support to GPU implementation of StringSplit
#4626 [FEA] cannot run on GPU because unsupported data types in 'partitionSpec'
#33 [FEA] hypot SQL function
#4515 [FEA] Set RMM async allocator as default

Performance

#3026 [FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data
#4895 Add support for structs in GpuScalarSubquery
#4393 [BUG] Columnar to Columnar transfers are very slow
#589 [FEA] Support ExistenceJoin
#4784 [FEA] Improve copying decimal data from CPU columnar data
#4685 [FEA] Avoid regexp cost in string_split for escaped characters
#4777 Remove input upcast in GpuExtractChunk32
#4722 Optimize DECIMAL128 average aggregations
#4645 [FEA] Investigate ASYNC allocator performance with additional queries
#4539 [FEA] semaphore optimization in shuffled hash join
#2441 [FEA] Use AST for filter in join APIs

Bugs Fixed

#5233 [BUG] rapids-tools v22.04.0 release jar reports maven dependency issue : rapids-4-spark-common_2.12:jar:22.04.0 NOT FOUND
#5183 [BUG] UCX EGX integration test array_test.py::test_array_exists failures
#5180 [BUG] create_map failed with java.lang.IllegalStateException: This is not supported yet
#5181 [BUG] Dataproc tests failing when trying to detect for accelerated row conversions
#5154 [BUG] build failed in databricks 10.4 runtime (updated recently)
#5159 [BUG] Approx percentile query fails with UnsupportedOperationException
#5164 [BUG] Databricks 9.1ML failed with "java.lang.NoSuchMethodError: org.apache.spark.sql.execution.metric.SQLMetrics$.createSizeMetric"
#5125 [BUG] GpuCast.hasSideEffects does not check if child expression has side effects
#5091 [BUG] Profiling tool fails process custom task accumulators of type CollectionAccumulator
#5050 [BUG] Release build of v22.04.0 FAILED on "Execution attach-javadoc failed: NullPointerException" with maven option '-P source-javadoc'
#5035 [BUG] Different CSV parsing behavior between 22.04 and 22.02
#5065 [BUG] spark330+ build error due to SPARK-37463
#5019 [BUG] udf compiler failed to translate UDF in spark-shell
#5048 [BUG] OOM for q18 of TPC-DS benchmark testing on Spark2a
#5038 [BUG] When spark.rapids.sql.regexp.enabled is on in 22.04 snapshot jars, Reading a Delta table in Databricks may cause driver error
#5023 [BUG] When+sequence could trigger "Illegal sequence boundaries" error
#5021 [BUG] test_cache_reverse_order failed
#5003 [BUG] Cloudera 3.1.1 tests fail due to ClouderaShimVersion
#4960 [BUG] Spark 3.3 IT cache_test:test_passing_gpuExpr_as_Expr failure
#4913 [BUG] Fall back to the CPU if we see a scale on Ceil or Floor
#4806 [BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError
#4542 [BUG] test_write_round_trip failed Maximum pool size exceeded
#4911 [BUG][Audit] [SPARK-38314] - Fail to read parquet files after writing the hidden file metadata
#4936 [BUG] databricks nightly window_function_test failures
#4931 [BUG] Spark 3.3 IT test cache_test.py::test_passing_gpuExpr_as_Expr fails with IllegalArgumentException
#4710 [BUG] cudaErrorIllegalAddress for q95 (3TB) on GCP with ASYNC allocator
#4918 [BUG] databricks nightly build failed
#4826 [BUG] cache_test failures when testing with 128-bit decimal
#4855 [BUG] Shim tests in sql-plugin module are not running
#4487 [BUG] regexp_find hangs with some patterns
#4486 [BUG] Regular expressions with hex digits not working as expected
#4879 [BUG] [SPARK-38237][SQL] ClusteredDistribution clustering keys break build with wrong arguments
#4883 [BUG] row-based_udf_test.py::test_hive_empty_* fail nightly tests
#4876 [BUG] Nightly build failed on Databricks with "pip: No such file or directory"
#4739 [BUG] Plugin will crash with query > 100 columns on pascal GPU
#4840 [BUG] test_dpp_via_aggregate_subquery_aqe_off failed with table already exists
#4841 [BUG] test_compress_write_round_trip failed on Spark 3.3
#4668 [FEA][Audit] - [SPARK-37750][SQL] ANSI mode: optionally return null result if element not exists in array/map
#3971 [BUG] udf-examples dependencies are incorrect
#4022 [BUG] Ensure shims.v2.ParquetCachedBatchSerializer and similar classes are at most package-private
#4526 [BUG] Short circuit AND/OR in ANSI mode
#4787 [BUG] Dataproc notebook IT test failure - NoSuchMethodError: org.apache.spark.network.util.ByteUnit.toBytes
#4704 [BUG] Update the premerge and nightly tests after moving the UDF example to external repository
#4795 [BUG] Read ORC does not ignoreCorruptFiles
#4802 [BUG] GPU CSV read does not honor ignoreCorruptFiles or ignoreMissingFiles
#4803 [BUG] GPU JSON read does not honor ignoreCorruptFiles or ignoreMissingFiles
#1986 [BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false
#126 [BUG] CSV parsing large number values overflow
#4759 [BUG] Profiling tool can miss datasources when they are GPU reads
#4798 [BUG] Integration test builds failing with worker_id not found
#4727 [BUG] Read Parquet does not ignoreCorruptFiles
#4744 [BUG] test_groupby_std_variance_partial_replace_fallback failed
#4761 [BUG] test_simple_partitioned_read failed on Spark 3.3
#2071 [BUG] parsing invalid boolean CSV values return true instead of null
#4749 [BUG] test_write_empty_parquet_round_trip failed
#4730 [BUG] python UDF tests are leaking
#4290 [BUG] Investigate q32 and q67 for decimals potential regression
#4409 [BUG] Possible race condition in regular expression support for octal digits
#4728 [BUG] test_mixed_compress_read orc_test.py failures
#4736 [BUG] buildall --profile=321 fails on missing spark301 rapids-4-spark-sql dependency
#4702 [BUG] cache_test.py failed w/ cache.serializer in spark 3.3.0
#4031 [BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue
#4664 [BUG] MortgageAdaptiveSparkSuite failed with duplicate buffer exception
#4564 [BUG] map_test ansi failed in spark330
#119 [BUG] LIKE does not work if null chars are in the string
#124 [BUG] CSV/JSON Parsing some float values results in overflow
#4045 [BUG] q93 failed in this week's NDS runs
#4488 [BUG] isCastingStringToNegDecimalScaleSupported seems set wrong for some Spark versions

PRs

#5251 Update 22.04 changelog to latest [skip ci]
#5232 Fix issue in GpuArrayExists where a parent view outlived the child
#5239 Fix tools depending on the common jar
#5205 Update 22.04 changelog to latest [skip ci]
#5190 Fix column->row conversion GPU check:
#5184 Fix CPU fallback for Map lookup
#5191 Update version-def to use released cudfjni 22.04.0 [skip ci]
#5167 Update cudfjni version to released 22.04.0
#5169 Terminate test earlier if pytest ENV issue [skip ci]
#5160 Fix approximate percentile reduction UnsupportedOperationException
#5165 Update Databricks 10.4 for changes to the QueryStageExec and ClusteredDistribution
#4997 Update docs for the 22.04 release[skip ci]
#5146 Support env var INTEGRATION_TEST_VERSION to override shim version
#5103 Init 22.04 changelog [skip ci]
#5122 Disable GPU accelerated row-column transpose for Pascal GPUs:
#5127 GpuCast.hasSideEffects now checks to see if the child expression has side-effects
#5118 On task failure catch some CUDA exceptions and kill executor
#5069 Update for the public release [skip ci]
#5097 Implement hasSideEffects for GpuGetArrayItem, GpuElementAt, GpuGetMapValue, GpuUnaryMinus, and GpuAbs
#5079 Disable spark snapshot shims pre-merge build in 22.04
#5094 Fix profiling tool reading collectionAccumulator
#5078 Disable JSON and CSV floating-point reads by default
#4961 Support approx_percentile in reduction context
#5062 Update Spark 2.x explain API with changes in 22.04
#5066 Add getOrcSchemaString for OrcShims
#5030 Fix regression from 21.12 where udfs defined in repl no longer worked
#5051 Revert "Replace ParquetFileReader.readFooter with open() and getFooter "
#5052 Work around incompatibility between Databricks Delta loads and GpuRegExpExtract
#4972 Add support for ORC forced positional evolution
#5042 Implement hasSideEffects for GpuSequence
#5040 Fix missing imports for 321db shim
#5033 Removed limit from the test
#4938 Improve compatibility when reading timestamps from JSON and CSV sources
#5026 Update RoCE doc URL [skip ci]
#4976 Replace ParquetFileReader.readFooter with open() and getFooter
#4989 Use conf.useCompression config to decide if we should be compressing the cache
#4956 Add avro reader support
#5009 Remove references of shims folder in docs [skip ci]
#5004 Add ClouderaShimVersion to unshimmed files
#4971 Fall back to the CPU for non-zero scale on Ceil or Floor functions
#4996 Fix collect_set on struct type
#4998 Added the id back for struct children to make them unique
#4995 Include 321db shim in distribution build [skip ci]
#4981 Update doc for CSV reading interval
#4973 Implement support for ArrayExists expression
#4988 Remove support for Spark 3.0.x
#4955 Add UDT support to ParquetCachedBatchSerializer (CPU)
#4994 Add databricks 10.4 build in pre-merge
#4990 Remove 30X permerge support for version 22.04 and above [skip ci]
#4958 Add independent mvn verify check [skip ci]
#4933 Set OrcConf.INCLUDE_COLUMNS for ORC reading
#4944 Support for non-string key-types for GetMapValue and element_at()
#4974 Add shim for Databricks 10.4
#4907 Add markdown check action
#4977 Add missing 314 to buildall script
#4927 Support reading ANSI day time interval type from CSV source
#4965 Documentation: add example python api call for ExplainPlan.explainPotentialGpuPlan [skip ci]
#4957 Document agg pushdown on ORC file limitation [skip ci]
#4946 Support predictors on ANSI day time interval type
#4952 Have a fixed GPU memory size for integration tests
#4954 Fix of failing to read parquet files after writing the hidden file metadata in
#4953 Add Decimal 128 as a supported type in partition by for databricks running window
#4941 Use new list reduction API to improve performance
#4926 Support DayTimeIntervalType in ParquetCachedBatchSerializer
#4947 Fallback to ARENA if ASYNC configured and driver < 11.5.0
#4934 Replace MetadataAttribute with FileSourceMetadataAttribute to follow the update in Spark for 3.3.0+
#4942 Fix window rank integration tests on
#4928 Disable regular expressions on GPU by default
#4923 Support GpuScalarSubquery on nested types
#4924 Implement percent_rank() on GPU
#4853 Improve date support in JSON and CSV readers
#4930 Add in support for sorting arrays with structs in sort_array
#4861 Add Apache Spark 3.1.4-SNAPSHOT Shims
#4925 Remove unused Spark322PlusShims
#4921 Add DatabricksShimVersion to unshimmed class list
#4917 Default some configs to protect against cluster settings in integration tests
#4922 Add support for decimal 128 for db and spark 320+
#4919 Case-insensitive PR title check [skip ci]
#4796 Implement ExistenceJoin Iterator using an auxiliary left semijoin
#4857 Transition to v2 shims [Databricks]
#4899 Fixed Decimal 128 bug in ParquetCachedBatchSerializer
#4810 Support ANSI intervals to/from Parquet
#4909 Make ARENA the default allocator for 22.04
#4856 Enable shim tests in sql-plugin module
#4880 Bump hadoop-client dependency to 3.1.4
#4825 Initial support for reading decimal types from JSON and CSV
#4859 Fallback to CPU when Spark pushes down Aggregates (Min/Max/Count) for ORC
#4872 Speed up copying decimal column from parquet buffer to GPU buffer
#4904 Relocate Hive UDF Classes
#4871 Minor changes to print revision differences when building shims
#4882 Disable write/read Parquet when Parquet field IDs are used
#4858 Support non-literal index for GpuElementAt and GpuGetArrayItem
#4875 Support running GetArrayStructFields on GPU
#4885 Enable fuzz testing for Regular Expression repetitions and move remaining edge cases to CPU
#4869 Support for hexadecimal digits in regular expressions on the GPU
#4854 Avoid regexp_cost with stringSplit on the GPU using transpilation
#4888 Clean up leak detection code
#4901 fix a broken link in CONTRIBUTING.md[skip ci]
#4891 update getting started doc because aws-emr 6.5.0 released[skip ci]
#4881 Fix compilation error caused by ClusteredDistribution parameters
#4890 Integration-test tests jar for hive UDF tests
#4878 Set conda/mamba default to Python version to 3.8 [skip ci]
#4874 Fix spark-tests syntax issue [skip ci]
#4850 Also check cuda runtime version when using the ASYNC allocator
#4851 Add worker ID to temporary table names in tests
#4847 Fix test_compress_write_round_trip failure on Spark 3.3
#4848 Profile tool: fix printing of task failed reason
#4636 Support str_to_map
#4835 Trim parquet_write_test to reduce integration test runtime
#4819 Throw exception if casting from double to datetime
#4838 Trim cache tests to improve integration test time
#4839 Optionally return null if element not exists map/array
#4822 Push decimal workarounds to cuDF
#4619 Move the udf-examples module to the external repository spark-rapids-examples
#4844 Update spark313 dep to released one
#4827 Make InternalExclusiveModeGpuDiscoveryPlugin and ExplainPlanImpl as protected class.
#4836 Support WindowExec partitioning by Decimal 128 on the GPU
#4760 Short circuit AND/OR in ANSI mode
#4829 Make bloopInstall version configurable in buildall
#4823 Reduce redundancy of decimal testing
#4715 Patterns such (3?)+ should now fall back to CPU
#4809 Add ignoreCorruptFiles for ORC readers
#4790 Improve JSON and CSV parsing of integer values
#4812 Default integration test configs to allow negative decimal scale
#4805 Avoid output cast by using unsigned type output for GpuExtractChunk32
#4804 Profiling tool can miss datasources when they are GPU reads
#4797 Do not check for metadata during schema comparison
#4785 Support casting Map to String
#4794 Decimal-128 support for mod and pmod
#4799 Fix failure to generate worker_id when xdist is not present
#4742 Add ignoreCorruptFiles feature for Parquet reader
#4792 Ensure GpuM2 merge aggregation does not produce a null mean or m2
#4770 Improve columnarCopy for HostColumnarToGpu
#4776 Improve aggregation performance of average on DECIMAL128 columns
#4786 Add shims to compare ORC TypeDescription
#4780 Improve JSON and CSV support for boolean values
#4778 Decrease chance of random collisions in test temporary paths
#4782 Check in host leak detection code
#4781 Add Spark properties table to profiling tool output
#4714 Add regular expression support to string_split
#4754 Close SpillableBatch to avoid leaks
#4758 Fix merge conflict with branch-22.02 [skip ci]
#4694 Add clarifications and details to integration-tests README [skip ci]
#4740 Enable regular expressions on GPU by default
#4735 Re-enables partial regex support for octal digits on the GPU
#4737 Check for a null compression codec when creating ORC OutStream
#4738 Change resume-from to aggregator in buildall [skip ci]
#4698 Add tests for few json options
#4731 Trim join tests to improve runtime of tests
#4732 Fix failing serializer tests on Spark 3.3.0
#4709 Update centos 8 dockerfile to handle EOL issue [skip ci]
#4724 Debug dump to Parquet support for DECIMAL128 columns
#4688 Optimize DECIMAL128 sum aggregations
#4692 Add FAQ entry to discuss executor task concurrency configuration [skip ci]
#4588 Optimize semaphore acquisition in GpuShuffledHashJoinExec
#4697 Add preliminary test and test framework changes for ExistanceJoin
#4716 GpuStringSplit should return an array on not-null elements
#4611 Support BitLength and OctetLength
#4408 Use the ORC version that corresponds to the Spark version
#4686 Fall back to CPU for queries referencing hidden metadata columns
#4669 Prevent deadlock between RapidsBufferStore and RapidsBufferBase on close
#4707 Fix auto merge conflict 4705 [skip ci]
#4690 Fix map_test ANSI failure in Spark 3.3.0
#4681 Reimplement check for non-regexp strings using RegexParser
#4683 Fix documentation link, clarify documentation [skip ci]
#4677 Make Collect, first and last as deterministic aggregate functions for Spark-3.3
#4682 Enable test for LIKE with embedded null character
#4673 Allow GpuWindowExec to partition on structs
#4637 Improve support for reading CSV and JSON floating-point values
#4629 Remove shims module
#4648 Append new authorized user to blossom-ci safelist
#4623 Fallback to CPU when aggregate push down used for parquet
#4606 Set default RMM pool to ASYNC for cuda 11.2+
#4531 Use libcudf mixed joins for conditional hash semi and anti joins
#4624 Enable integration test results report on Jenkins [skip ci]
#4597 Update plugin version to 22.04.0-SNAPSHOT
#4592 Adds SQL function HYPOT using the GPU
#4504 Implement AST-based regular expression fuzz tests
#4560 Make shims.v2.ParquetCachedBatchSerializer as protected

Release 22.02

Features

#4305 [FEA] write nvidia tool wrappers to allow old YARN versions to work with MIG
#4410 [FEA] ReplicateRows - Support ReplicateRows for decimal 128 type
#4360 [FEA] Add explain api for Spark 2.X
#3541 [FEA] Support max on single-level struct in aggregation context
#4238 [FEA] Add a Spark 3.X Explain only mode to the plugin
#3952 [Audit] [FEA][SPARK-32986][SQL] Add bucketed scan info in query plan of data source v1
#4412 [FEA] Improve support for \A, \Z, and \z in regular expressions
#3979 [FEA] Improvements for CPU(Row) based UDF
#4467 [FEA] Add support for regular expression with repeated digits (\d+, \d*, \d?)
#4439 [FEA] Enable GPU broadcast exchange reuse for DPP when AQE enabled
#3512 [FEA] Support org.apache.spark.sql.catalyst.expressions.Sequence
#3475 [FEA] Spark 3.2.0 reads Parquet unsigned int64(UINT64) as Decimal(20,0) but CUDF does not support it
#4091 [FEA] regexp_replace: Improve support for ^ and $
#4104 [FEA] Support org.apache.spark.sql.catalyst.expressions.ReplicateRows
#4027 [FEA] Support SubqueryBroadcast on GPU to enable exchange reuse during DPP
#4284 [FEA] Support idx = 0 in GpuRegExpExtract
#4002 [FEA] Implement regexp_extract on GPU
#3221 [FEA] Support GpuFirst and GpuLast on nested types under reduction aggregations
#3944 [FEA] Full support for sum with overflow on Decimal 128
#4028 [FEA] support GpuCast from non-nested ArrayType to StringType
#3250 [FEA] Make CreateMap duplicate key handling compatible with Spark and enable CreateMap by default
#4170 [FEA] Make regular expression behavior with $ and \r consistent with CPU
#4001 [FEA] Add regexp support to regexp_replace
#3962 [FEA] Support null characters in regular expressions in RLIKE
#3797 [FEA] Make RLike support consistent with Apache Spark

Performance

#4392 [FEA] could the parquet scan code avoid acquiring the semaphore for an empty batch?
#679 [FEA] move some deserialization code out of the scope of the gpu-semaphore to increase cpu concurrent
#4350 [FEA] Optimize the all-true and all-false cases in GPU If and CaseWhen
#4309 [FEA] Leverage cudf conditional nested loop join to implement semi/anti hash join with condition
#4395 [FEA] acquire the semaphore after concatToHost in GpuShuffleCoalesceIterator
#4134 [FEA] Allow EliminateJoinToEmptyRelation in GpuBroadcastExchangeExec
#4189 [FEA] understand why between is so expensive

Bugs Fixed

#4725 [DOC] Broken links in guide doc
#4675 [BUG] Jenkins integration build timed out at 10 hours
#4665 [BUG] Spark321Shims.getParquetFilters failed with NoSuchMethodError
#4635 [BUG] nvidia-smi wrapper script ignores ENABLE_NON_MIG_GPUS=1 on a heterogeneous multi-GPU machine
#4500 [BUG] Build failures against Spark 3.2.1 rc1 and make 3.2.1 non snapshot
#4631 [BUG] Release build with mvn option -P source-javadoc FAILED
#4625 [BUG] NDS query 5 fails with AdaptiveSparkPlanExec assertion
#4632 [BUG] Build failing for Spark 3.3.0 due to deprecated method warnings
#4599 [BUG] test_group_apply_udf and test_group_apply_udf_more_types hangs on Databricks 9.1
#4600 [BUG] crash if we have a decimal128 in a struct in an array
#4581 [BUG] Build error "GpuOverrides.scala:924: wrong number of arguments" on DB9.1.x spark-3.1.2
#4593 [BUG] dup GpuHashJoin.diff case-folding issue
#4559 [BUG] regexp_replace with replacement string containing \ can produce incorrect results
#4503 [BUG] regexp_replace with back references produces incorrect results on GPU
#4567 [BUG] Profile tool hangs in compare mode
#4315 [BUG] test_hash_reduction_decimal_overflow_sum[30] failed OOM in integration tests
#4551 [BUG] protobuf-java version changed to 3.x
#4499 [BUG]GpuSequence blows up when nulls exist in any of the inputs (start, stop, step)
#4454 [BUG] Shade warnings when building the tools artifact
#4541 [BUG] Column vector leak in conditionals_test.py
#4514 [BUG] test_hash_reduction_pivot_without_nans failed
#4521 [BUG] Inconsistencies in handling of newline characters and string and line anchors
#4548 [BUG] ai.rapids.cudf.CudaException: an illegal instruction was encountered in databricks 9.1
#4475 [BUG] \D and \W match newline in Spark but not in cuDF
#1866 [BUG] GpuFileFormatWriter does not close the data writer
#4524 [BUG] RegExp transpiler fails to detect some choice expressions that cuDF cannot compile
#3226 [BUG]OOM happened when do cube operations
#2504 [BUG] OOM when running NDS queries with UCX and GDS
#4273 [BUG] Rounding past the size that can be stored in a type produces incorrect results
#4060 [BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently
#4039 [BUG] Spark 3.3.0 IT Array test failures
#3849 [BUG] In ANSI mode we can fail in cases Spark would not due to conditionals
#4445 [BUG] mvn clean prints an error message on a clean dir
#4421 [BUG] the driver is trying to load CUDA with latest 22.02
#4455 [BUG] join_test.py::test_struct_self_join[IGNORE_ORDER({'local': True})] failed in spark330
#4442 [BUG] mvn build FAILED with option -P noSnapshotsWithDatabricks
#4281 [BUG] q9 regression between 21.10 and 21.12
#4280 [BUG] q88 regression between 21.10 and 21.12
#4422 [BUG] Host column vectors are being leaked during tests
#4446 [BUG] GpuCast crashes when casting from Array with unsupportable child type
#4432 [BUG] nightly build 3.3.0 failed: HashClusteredDistribution is not a member of org.apache.spark.sql.catalyst.plans.physical
#4443 [BUG] SPARK-37705 breaks parquet filters from Spark 3.3.0 and Spark 3.2.2 onwards
#4316 [BUG] Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly intermittently
#4378 [BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+
#4423 [BUG] Build is failing due to FileScanRDD changes in Spark 3.3.0-SNAPSHOT
#4401 [BUG]array_test.py::test_array_contains failures
#4403 [BUG] NDS query 72 logs codegen fallback exception and produces incorrect results
#4386 [BUG] conditionals_test.py FAILED with side_effects_cast[Integer/Long] on Databricks 9.1 Runtime
#3934 [BUG] Dependencies of published integration tests jar are missing
#4341 [BUG] GpuCast.scala:nnn warning: discarding unmoored doc comment
#4356 [BUG] nightly spark303 deploy pulling spark301 aggregator
#4347 [BUG] Dist jar pom lists aggregator jar as dependency
#4176 [BUG] ParseDateTimeSuite UT failed
#4292 [BUG] no meaningful message is surfaced to maven when binary-dedupe fails
#4351 [BUG] Tests FAILED On SPARK-3.2.0, com.nvidia.spark.rapids.SerializedTableColumn cannot be cast to com.nvidia.spark.rapids.GpuColumnVector
#4346 [BUG] q73 decimal was twice as slow in weekly results
#4334 [BUG] GpuColumnarToRowExec will always be tagged False for exportColumnarRdd after Spark311
#4339 The parameter dataType is not necessary in resolveColumnVector method.
#4275 [BUG] Row-based Hive UDF will fail if arguments contain a foldable expression.
#4229 [BUG] regexp_replace [^a] has different behavior between CPU and GPU for multiline strings
#4294 [BUG] parquet_write_test.py::test_ts_write_fails_datetime_exception failed in spark 3.1.1 and 3.1.2
#4205 [BUG] Get different results when casting from timestamp to string
#4277 [BUG] cudf_udf nightly cudf import rmm failed
#4246 [BUG] Regression in CastOpSuite due to cuDF change in parsing NaN
#4243 [BUG] test_regexp_replace_null_pattern_fallback[ALLOW_NON_GPU(ProjectExec,RegExpReplace)] failed in databricks
#4244 [BUG] Cast from string to float using hand-picked values failed
#4227 [BUG] RAPIDS Shuffle Manager doesn't fallback given encryption settings
#3374 [BUG] minor deprecation warnings in a 3.2 shim build
#3613 [BUG] release312db profile pulls in 311until320-apache
#4213 [BUG] unused method with a misleading outdated comment in ShimLoader
#3609 [BUG] GpuShuffleExchangeExec in v2 shims has inconsistent packaging
#4127 [BUG] CUDF 22.02 nightly test failure

PRs

#4773 Update 22.02 changelog to latest [skip ci]
#4771 revert cudf api links from legacy to stable[skip ci]
#4767 Update 22.02 changelog to latest [skip ci]
#4750 Updated doc for decimal support
#4757 Update qualification tool to remove DECIMAL 128 as potential problem
#4755 Fix databricks doc for limitations.[skip ci]
#4751 Fix broken hyperlinks in documentation [skip ci]
#4706 Update 22.02 changelog to latest [skip ci]
#4700 Update cudfjni version to released 22.02.0
#4701 Decrease nighlty tests upper limitation to 7 [skip ci]
#4639 Update changelog for 22.02 and archive info of some older releases [skip ci]
#4572 Add download page for 22.02 [skip ci]
#4672 Revert "Disable 311cdh build due to missing dependency (#4659)"
#4662 Update the deploy script [skip ci]
#4657 Upmerge spark2 directory to the latest 22.02 changes
#4659 Disable 311cdh build by default because of a missing dependency
#4508 Fix Spark 3.2.1 build failures and make it non-snapshot
#4652 Remove non-deterministic test order in nightly [skip ci]
#4643 Add profile release301 when mvn help:evaluate
#4630 Fix the incomplete capture of SubqueryBroadcast
#4633 Suppress newTaskTempFile method warnings for Spark 3.3.0 build
#4618 [DB31x] Pick the correct Python runner for flatmap-group Pandas UDF
#4622 Fallback to CPU when encoding is not supported for JSON reader
#4470 Add in HashPartitioning support for decimal 128
#4535 Revert "Disable orc write by default because of https://issues.apache.org/jira/browse/ORC-1075 (#4471)"
#4583 Avoid unapply on PromotePrecision
#4573 Correct version from 21.12 to 22.02[skip ci]
#4575 Correct and update links in UDF doc[skip ci]
#4501 Switch and/or to use new cudf binops to improve performance
#4594 Resolve case-folding issue [skip ci]
#4585 Spark2 module upmerge, deploy script, and updates for Jenkins
#4589 Increase premerge databricks IDLE_TIMEOUT to 4 hours [skip ci]
#4485 Add json reader support
#4556 regexp_replace with back-references should fall back to CPU
#4569 Fix infinite loop with Profiling tool compare mode and app with no sql ids
#4529 Add support for Spark 2.x Explain Api
#4577 Revert "Fix CVE-2021-22569 (#4545)"
#4520 GpuSequence refactor
#4570 A few quick fixes to try to reduce max memory usage in the tests
#4477 Use libcudf mixed joins for conditional hash joins
#4566 remove scala-library from combined tools jar
#4552 Fix resource leak in GpuCaseWhen
#4553 Reenable test_hash_reduction_pivot_without_nans
#4530 Fix correctness issues in regexp and add \r and \n to fuzz tests
#4549 Fix typos in integration tests README [skip ci]
#4545 Fix CVE-2021-22569
#4543 Enable auto-merge from branch-22.02 to branch-22.04 [skip ci]
#4540 Remove user kuhushukla
#4434 Support max on single-level struct in aggregation context
#4534 Temporarily disable integration test - test_hash_reduction_pivot_without_nans
#4322 Add an explain only mode to the plugin
#4497 Make better use of pinned memory pool
#4512 remove hadoop version requirement[skip ci]
#4527 Fall back to CPU for regular expressions containing \D or \W
#4525 Properly close data writer in GpuFileFormatWriter
#4502 Removed the redundant test for element_at and fixed the failing one
#4523 Add more integration tests for decimal 128
#3762 Call the right method to convert table from row major <=> col major
#4482 Simplified the construction of zero scalar in GpuUnaryMinus
#4510 Update copyright in NOTICE [skip ci]
#4484 Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU.
#4492 Fall back to CPU for regular expressions containing hex digits
#4495 Enable approx_percentile by default
#4420 Fix up incorrect results of rounding past the max digits of data type
#4483 Update test case of reading nested unsigned parquet file
#4490 Remove warning about RMM default allocator
#4461 [Audit] Add bucketed scan info in query plan of data source v1
#4489 Add arrays of decimal128 to join tests
#4476 Don't acquire the semaphore for empty input while scanning
#4424 Improve support for regular expression string anchors \A, \Z, and \z
#4491 Skip the test for spark versions 3.1.1, 3.1.2 and 3.2.0 only
#4459 Use merge sort for struct types in non-key columns
#4494 Append new authorized user to blossom-ci whitelist [skip ci]
#4400 Enable approx percentile tests
#4471 Disable orc write by default because of https://issues.apache.org/jira/browse/ORC-1075
#4462 Rename DECIMAL_128_FULL and rework usage of TypeSig.gpuNumeric
#4479 Change signoff check image to slim-buster [skip ci]
#4464 Throw SparkArrayIndexOutOfBoundsException for Spark 3.3.0+
#4469 Support repetition of \d and \D in regexp functions
#4472 Modify docs for 22.02 to address issue-4319[skip ci]
#4440 Enable GPU broadcast exchange reuse for DPP when AQE enabled
#4376 Add sequence support
#4460 Abstract the text based PartitionReader
#4383 Fix correctness issue with CASE WHEN with expressions that have side-effects
#4465 Refactor for shims 320+
#4463 Avoid replacing a hash join if build side is unsupported by the join type
#4456 Fix build issues: 1 clean non-exists target dirs; 2 remove duplicated plugin
#4416 Unshim join execs
#4172 Support String to Decimal 128
#4458 Exclude some metadata operators when checking GPU replacement
#4451 Some metrics improvements and timeline reporting
#4435 Disable add profile src execution by default to make the build log clean
#4436 Print error log to stderr output
#4155 Add partial support for line begin and end anchors in regexp_replace
#4428 Exhaustively iterate ColumnarToRow iterator to avoid leaks
#4430 update pca example link in ml-integration.md[skip ci]
#4452 Limit parallelism of nightly tests [skip ci]
#4449 Add recursive type checking and fallback tests for casting array with unsupported element types to string
#4437 Change logInfo to logWarning
#4447 Fix 330 build error and add 322 shims layer
#4417 Fix an Intellij debug issue
#4431 Add DateType support for AST expressions
#4433 Import the right pandas from conda [skip ci]
#4419 Import the right pandas from conda
#4427 Update getFileScanRDD shim for recent changes in Spark 3.3.0
#4397 Ignore cufile.log
#4388 Add support for ReplicateRows
#4399 Update docs for Profiling and Qualification tool to change wording
#4407 Fix GpuSubqueryBroadcast on multi-fields relation
#4396 GpuShuffleCoalesceIterator acquire semaphore after host concat
#4361 Accommodate altered semantics of cudf::lists::contains()
#4394 Use correct column name in GpuIf test
#4385 Add missing GpuSubqueryBroadcast replacement rule for spark31x
#4387 Fix auto merge conflict 4384[skip ci]
#4374 Fix the IT module depends on the tests module
#4365 Not publishing integration_tests jar to Maven Central [skip ci]
#4358 Update GpuIf to support expressions with side effects
#4382 Remove unused scallop dependency from integration_tests
#4364 Replace Scala document with Scala comment for inner functions
#4373 Add pytest tags for nightly test parallel run [skip ci]
#4150 Support GpuSubqueryBroadcast for DPP
#4372 Move casting to string tests from array_test.py and struct_test.py to cast_test.py
#4371 Fix typo in skipTestsFor330 calculation [skip ci]
#4355 Dedicated deploy-file with reduced pom in nightly build [skip ci]
#4352 Revert "Ignore failing string to timestamp tests temporarily (#4197)"
#4359 Audit - SPARK-37268 - Remove unused variable in GpuFileScanRDD [Databricks]
#4327 Print meaningful message when calling scripts in maven
#4354 Fix regression in AQE optimizations
#4343 Fix issue with binding to hash agg columns with computation
#4285 Add support for regexp_extract on the GPU
#4349 Fix PYTHONPATH in pre-merge
#4269 The option for the nightly script not deploying jars [skip ci]
#4335 Fix the issue of exporting Column RDD
#4336 Split expensive pytest files in cases level [skip ci]
#4328 Change the explanation of why the operator will not work on GPU
#4338 Use scala Int.box instead of Integer constructors
#4340 Remove the unnecessary parameter dataType in resolveColumnVector method
#4256 Allow returning an EmptyHashedRelation when a broadcast result is empty
#4333 Add tests about writing empty table to ORC/PAQUET
#4337 Support GpuFirst and GpuLast on nested types under reduction aggregations
#4331 Fix parquet options builder calls
#4310 Fix typo in shim class name
#4326 Fix 4315 decrease concurrentGpuTasks to avoid sum test OOM
#4266 Check revisions for all shim jars while build all
#4282 Use data type to create an inspector for a foldable GPU expression.
#3144 Optimize AQE with Spark 3.2+ to avoid redundant transitions
#4317 [BUG] Update nightly test script to dynamically set mem_fraction [skip ci]
#4206 Porting GpuRowToColumnar converters to InternalColumnarRDDConverter
#4272 Full support for SUM overflow detection on decimal
#4255 Make regexp pattern [^a] consistent with Spark for multiline strings
#4306 Revert commonizing the int96ParquetRebase* functions
#4299 Fix auto merge conflict 4298 [skip ci]
#4159 Optimize sample perf
#4235 Commonize v2 shim
#4274 Add tests for timestamps that overflowed before.
#4271 Skip test_regexp_replace_null_pattern_fallback on Spark 3.1.1 and later
#4278 Use mamba for cudf conda install [skip ci]
#4270 Document exponent differences when casting floating point to string [skip ci]
#4268 Fix merge conflict with branch-21.12
#4093 Add tests for regexp() and regexp_like()
#4259 fix regression in cast from string to float that caused signed NaN to be considered valid
#4241 fix bug in parsing regex character classes that start with ^ and contain an unescaped ]
#4224 Support row-based Hive UDFs
#4221 GpuCast from ArrayType to StringType
#4007 Implement duplicate key handling for GpuCreateMap
#4251 Skip test_regexp_replace_null_pattern_fallback on Databricks
#4247 Disable failing CastOpSuite test
#4239 Make EOL anchor behavior match CPU for strings ending with newline
#4153 Regexp: Only transpile once per expression rather than once per batch
#4230 Change to build tools module with all the versions by default
#4223 Fixes a minor deprecation warning
#4215 Rebalance testing load
#4214 Fix pre_merge ci_2 [skip ci]
#4212 Remove an unused method with its outdated comment
#4211 Update test_floor_ceil_overflow to be more lenient on exception type
#4203 Move all the GpuShuffleExchangeExec shim v2 classes to org.apache.spark
#4193 Rename 311until320-apache to 311until320-noncdh
#4197 Ignore failing string to timestamp tests temporarily
#4160 Fix merge issues for branch 22.02
#4081 Convert String to DecimalType without casting to FloatType
#4132 Fix auto merge conflict 4131 [skip ci]
#4099 [REVIEW] Init version 22.02.0
#4113 Fix pre-merge CI 2 conditions [skip ci]
#4064 Regex: transpile . to [^\r\n] in cuDF
#4044 RLike: Fall back to CPU for regex that would produce incorrect results

Release 21.12

Features

#1571 [FEA] Better precision range for decimal multiply, and possibly others
#3953 [FEA] Audit: Add array support to union by name
#4085 [FEA] Decimal 128 Support: Concat
#4073 [FEA] Decimal 128 Support: MapKeys, MapValues, MapEntries
#3432 [FEA] Qualification tool checks if there is any "Scan JDBCRelation" and count it as "problematic"
#3824 [FEA] Support MapType in ParquetCachedBatchSerializer
#4048 [FEA] WindowExpression support for Decimal 128 in Spark 320
#4047 [FEA] Literal support for Decimal 128 in Spark 320
#3863 [FEA] Add Spark 3.3.0-SNAPSHOT Shim
#3814 [FEA] stddev stddev_samp and std should be supported over a window
#3370 [FEA] Add support for Databricks 9.1 runtime
#3876 [FEA] Support REGEXP_REPLACE to replace null values
#3784 [FEA] Support ORC write Map column(single level)
#3470 [FEA] Add shims for 3.2.1-SNAPSHOT
#3855 [FEA] CPU based UDF to run efficiently and transfer data back to GPU for supported operations
#3739 [FEA] Provide an explicit config for fallback on CPU if plan rewrite fails
#3888 [FEA] Decimal 128 Support: Add a "Trust me I know it will not overflow config"
#3088 [FEA] Profile tool print problematic operations
#3886 [FEA] Decimal 128 Support: Extend the range for Decimal Multiply and Divide
#79 [FEA] Support Size operation
#3880 [FEA] Decimal 128 Support: Average aggregation
#3659 [FEA] External tool integration with Qualification tool
#2 [FEA] RLIKE support
#3192 [FEA] Support decimal type in ORC writer
#3419 [FEA] Add support for org.apache.spark.sql.execution.SampleExec
#3535 [FEA] Qualification tool can detect RDD APIs in SQL plan
#3494 [FEA] Support structs in ORC writer
#3514 [FEA] Support collect_set on struct in aggregation context
#3515 [FEA] Support CreateArray to produce array(struct)
#3116 [FEA] Support Maps, Lists, and Structs as non-key columns on joins
#2054 [FEA] Add support for Arrays to ParquetCachedBatchSerializer
#3573 [FEA] Support Cache(PCBS) Array-of-Struct

Performance

#3768 [DOC] document databricks init script required for UCX
#2867 [FEA] Make LZ4_CHUNK_SIZE configurable
#3832 [FEA] AST enabled GpuBroadcastNestedLoopJoin left side can't be small
#3798 [FEA] bounds checking in joins can be expensive
#3603 [FEA] Allocate UCX bounce buffers outside of RMM if ASYNC allocator is enabled

Bugs Fixed

#4253 [BUG] Dependencies missing of spark-rapids v21.12.0 release jars
#4216 [BUG] AQE Crashing Spark RAPIDS when using filter() and union()
#4188 [BUG] data corruption in GpuBroadcastNestedLoopJoin with empty relations edge case
#4191 [BUG] failed to read DECIMAL128 within MapType from ORC
#4175 [BUG] arithmetic_ops_test failed in spark 3.2.0
#4162 [BUG] isCastDecimalToStringEnabled is never called
#3894 [BUG] test_pandas_scalar_udf and test_pandas_map_udf failed in UCX standalone CI run
#3970 [BUG] mismatching timezone settings on executor and driver can cause ORC read data corruption
#4141 [BUG] Unable to start the RapidsShuffleManager in databricks 9.1
#4102 [BUG] udf-example build failed: Unknown CMake command "cpm_check_if_package_already_added".
#4084 [BUG] window on unbounded preceeding and unbounded following can produce incorrect results.
#3990 [BUG] Scaladoc link warnings in ParquetCachedBatchSerializer and ExplainPlan
#4108 [BUG] premerge fails due to Spark 3.3.0 HadoopFsRelation after SPARK-37289
#4042 [BUG] cudf_udf tests fail on nightly Integration test run
#3743 [BUG] Implicitly catching all exceptions warning in GpuOverrides
#4069 [BUG] parquet_test.py pytests FAILED on Databricks-9.1-ML-spark-3.1.2
#3461 [BUG] Cannot build project from a sub-directory
#4053 [BUG] buildall uses a stale aggregator dependency during test compilation
#3703 [BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed with TypeError
#3706 [BUG] approx_percentile returns array of zero percentiles instead of null in some cases
#4017 [BUG] Why is the hash aggregate not handling empty result expressions
#3994 [BUG] can't open notebook 'docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb'
#3996 [BUG] Exception happened when getting a null row
#3999 [BUG] Integration cache_test failures - ArrayIndexOutOfBoundsException
#3532 [BUG] DatabricksShimVersion must carry runtime version info
#3834 [BUG] Approx_percentile deserialize error when calling "show" rather than "collect"
#3992 [BUG] failed create-parallel-world in databricks build
#3987 [BUG] "mvn clean package -DskipTests" is no longer working
#3866 [BUG] RLike integration tests failing on Azure Databricks 7.3
#3980 [BUG] udf-example build failed due to maven-antrun-plugin upgrade
#3966 [BUG] udf-examples module fails on mvn compile and mvn test
#3977 [BUG] databricks aggregator jar deployed failed
#3915 [BUG] typo in verify_same_sha_for_unshimmed prevents the offending class file name from being logged.
#1304 [BUG] Query fails with HostColumnarToGpu doesn't support Structs
#3924 [BUG] ExpressionEncoder does not work for input in GpuScalaUDF
#3911 [BUG] CI fails on an inconsistent set of partial builds
#2896 [BUG] Extra GpuColumnarToRow when using ParquetCachedBatchSerializer on databricks
#3864 [BUG] test_sample_produce_empty_batch failed in dataproc
#3823 [BUG]binary-dedup.sh script fails on mac
#3658 [BUG] DataFrame actions failing with error: Error : java.lang.NoClassDefFoundError: Could not initialize class com.nvidia.spark.rapids.GpuOverrides withlatest 21.10 jars
#3857 [BUG] nightly build push dist packge w/ single version of spark
#3854 [BUG] not found: type PoissonDistribution in databricks build
#3852 spark-nightly-build deploys all modules due to typo in -pl
#3844 [BUG] nightly spark311cdh build failed
#3843 [BUG] databricks nightly deploy failed
#3705 [BUG] Change nullOnDivideByZero from runtime parameter to aggregate expression for stddev and variance aggregation families
#3614 [BUG] ParquetMaterializer.scala appears in both v1 and v2 shims
#3430 [BUG] Profiling tool silently stops without producing any output on a Synapse Spark event log
#3311 [BUG] cache_test.py failed w/ cache.serializer in spark 3.1.2
#3710 [BUG] Usage of Class.forName without specifying a classloader
#3462 [BUG] IDE complains about duplicate ShimBasePythonRunner instances
#3476 [BUG] test_non_empty_ctas fails on yarn

PRs

#4362 Decimal128 support for Parquet
#4391 update gcp custom dataproc image version to avoid log4j issue[skip ci]
#4379 update hot fix cudf link v21.12.2
#4367 update 21.12 branch for doc [skip ci]
#4245 Update changelog 21.12 to latest [skip ci]
#4258 Sanitize column names in ParquetCachedBatchSerializer before writing to Parquet
#4308 Bump up GPU reserve memory to 640MB
#4307 Update Download page for 21.12 [skip ci]
#4261 Update cudfjni version to released 21.12.0
#4265 Remove aggregator dependency before deploying dist artifact
#4030 Support code coverage report with single version jar [skip ci]
#4287 Update 21.12 compatibility guide for known regexp issue [skip ci]
#4242 Fix indentation issue in getting-started-k8s guide [skip ci]
#4263 Add missing ORC write tests on Map of Decimal
#4257 Implement getShuffleRDD and fixup mismatched output types on shuffle reuse
#4250 Update the release script [skip ci]
#4222 Add arguments support to 'databricks/run-tests.py'
#4233 Add databricks init script for UCX
#4231 RAPIDS Shuffle Manager fallback if security is enabled
#4228 Fix unconditional nested loop joins on empty tables
#4217 Enable event log for qualification & profiling tools testing from IT
#4202 Parameter for the Databricks zone-id [skip ci]
#4199 modify some words for synapse getting started guide[skip ci]
#4200 Disable approx percentile tests that intermittently fail
#4187 Added a getting started guide for Synapse[skip ci]
#4192 Fix ORC read DECIMAL128 inside MapType
#4173 Update approx percentile docs to link to issue 4060 [skip ci]
#4174 Document Bloop, Metals and VS code as an IDE option [skip ci]
#4181 Fix element_at for 3.2.0 and array/struct cast
#4110 Add a getting started guide on workload qualification [skip ci]
#4106 Add docs for MIG on YARN [skip ci]
#4100 Add PCA example to ml-integration page [skip ci]
#4177 Decimal128: added missing decimal128 signature on Spark 32X
#4161 More integration tests with decimal128
#4165 Fix type checks for get array item in 3.2.0
#4163 Enable config to check for casting decimals to strings
#4154 Use num_slices to guarantee partition shape in the pandas udf tests
#4129 Check executor timezone is same as driver timezone when running on GPU
#4139 Decimal128 Support
#4128 Fix build errors in udf-examples native build
#4063 Regexp_replace support regexp
#4125 Remove unused imports
#4052 Support null safe host column vector
#4116 Add in tests to check for overflow in unbounded window
#4111 Added external doc links for JRE and Spark
#4105 Enforce checks for unused imports and missed interpolation
#4107 Set the task context in background reader threads
#4114 Refactoring cudf_udf test setup
#4109 Stop using redundant partitionSchemaOption dropped in 3.3.0
#4097 Enable auto-merge from branch-21.12 to branch-22.02 [skip ci]
#4094 Remove spark311db shim layer
#4082 Add abfs and abfss to the cloud scheme
#4071 Treat scalac warnings as errors
#4043 Promote cudf as dist direct dependency, mark aggregator provided
#4076 Sets the GPU device id in the UCX early start thread
#4087 Regex parser improvements and bug fixes
#4079 verify "Add array support to union by name " by adding an integration test
#4090 Update pre-merge expression for 2022+ CI [skip ci]
#4049 Change Databricks image from 8.2 to 9.1 [skip ci]
#4051 Upgrade ORC version from 1.5.8 to 1.5.10
#4080 Add case insensitive when clipping parquet blocks
#4083 Fix compiler warning in regex transpiler
#4070 Support building from sub directory
#4072 Fix overflow checking on optimized decimal sum
#4067 Append new authorized user to blossom-ci whitelist [skip ci]
#4066 Temply disable cudf_udf test
#4057 Restore original ASL 2.0 license text
#3937 Qualification tool: Detect JDBCRelation in eventlog
#3925 verify AQE and DPP both on
#3982 Fix the issue of parquet reading with case insensitive schema
#4054 Use install for the base version build thread [skip ci]
#4008 [Doc] Update the getting started guide for databricks: Change from 8.2 to 9.1 runtime [skip ci]
#4010 Enable MapType for ParquetCachedBatchSerializer
#4046 lower GPU memory reserve to 256MB
#3770 Enable approx percentile tests
#4038 Change the catalystConverter to be a Scala val.
#4035 Hash aggregate fix empty resultExpressions
#3998 Check for CPU cores and free memory in IT script
#3984 Check for data write command before inserting hash sort optimization
#4019 initialize RMM with a single pool size
#3993 Qualification tool: Remove "unsupported" word for nested complex types
#4033 skip spark 330 tests temporarily in nightly [skip ci]
#4029 Update buildall script and the build doc [skip ci]
#4014 fix can't open notebook 'docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb'[skip ci]
#4024 Allow using a custom Spark Resource Name for a GPU
#4012 Add Apache Spark 3.3.0-SNAPSHOT Shims
#4021 Explicitly use the public version of ParquetCachedBatchSerializer
#3869 Add Std dev samp for windowing
#3960 Use a fixed RMM pool size
#3767 Add shim for Databricks 9.1
#3862 Prevent approx_percentile aggregate from being split between CPU and GPU
#3871 Add integration test for RLike with embedded null in input
#3968 Allow null character in regexp_replace pattern
#3821 Support ORC write Map column
#3991 Fix aggregator jar copy logic
#3973 Add shims for Apache Spark 3.2.1-SNAPSHOT builds
#3967 Bring back AST support for BNLJ inner joins
#3947 Enable rlike tests on databricks
#3981 Replace tasks w/ target of maven-antrun-plugin in udf-example
#3976 Replace long artifact lists with an ant loop
#3972 Revert udf-examples dependency change to restore test build phase
#3978 Update aggregator jar name in databricks deploy script
#3965 Add how-to resolve auto-merge conflict [skip ci]
#3963 Add a dedicated RapidsConf option to tolerate GpuOverrides apply failures
#3923 Prepare for 3.2.1 shim, various shim build fixes and improvements
#3969 add doc on using compute-sanitizer
#3964 Qualification tool: Catch exception for invalid regex patterns
#3961 Avoid using HostColumnarToGpu for nested types
#3910 Refactor the aggregate API
#3897 Support running CPU based UDF efficiently
#3950 Fix failed auto-merge #3939
#3946 Document compatability of operations with side effects.
#3945 Update udf-examples dependencies to use dist jar
#3938 remove GDS alignment code
#3943 Add artifact revisions check for nightly tests [skip ci]
#3933 Profiling tool: Print potential problems
#3926 Add zip unzip to integration tests dockerfiles [skip ci]
#3757 Update to nvcomp-2.x JNI APIs
#3922 Stop using -U in build merges aggregator jars of nightly [skip ci]
#3907 Add version properties to integration tests modules
#3912 Stop using -U in the build that merges all aggregator jars
#3909 Fix warning when catching all throwables in GpuOverrides
#3766 Use JCudfSerialization to deserialize a table to host columns
#3820 Advertise CPU orderingSatisfies
#3858 update emr 6.4 getting started doc and pic[skip ci]
#3899 Fix sample test cases
#3896 Xfail the sample tests temporarily
#3848 Fix binary-dedupe failures and improve its performance on macOS
#3867 Disable rlike integration tests on Databricks
#3850 Add explain Plugin API for CPU plan
#3868 Fix incorrect schema of nested types of union - audit SPARK-36673
#3860 Add unit test for GpuKryoRegistrator
#3847 Add Running Qualification App API
#3861 Revert "Fix typo in nightly deploy project list (#3853)" [skip ci]
#3796 Add Rlike support
#3856 Fix not found: type PoissonDistribution in databricks build
#3853 Fix typo in nightly deploy project list
#3831 Support decimal type in ORC writer
#3789 GPU sample exec
#3846 Include pluginRepository for cdh build
#3819 Qualification tool: Detect RDD Api's in SQL plan
#3835 Minor cleanup: do not set cuda stream to null
#3845 Include 'DB_SHIM_NAME' from Databricks jar path to fix nightly deploy [skip ci]
#3523 Interpolate spark.version.classifier in build.dir
#3813 Change nullOnDivideByZero from runtime parameter to aggregate expression for stddev and variance aggregations
#3791 Add audit script to get list of commits from Apache Spark master branch
#3744 Add developer documentation for setting up Microk8s [skip ci]
#3817 Fix auto-merge conflict 3816 [skip ci]
#3804 Missing statistics in GpuBroadcastNestedLoopJoin
#3799 Optimize out bounds checking for joins when the gather map has only valid entries
#3801 Update premerge to use the combined snapshots jar
#3696 Support nested types in ORC writer
#3790 Fix overflow when casting integral to neg scale decimal
#3779 Enable some union of structs tests that were marked xfail
#3787 Fix auto-merge conflict 3786 from branch-21.10 [skip ci]
#3782 Fix auto-merge conflict 3781 [skip ci]
#3778 Remove extra ParquetMaterializer.scala file
#3773 Restore disabled ORC and Parquet tests
#3714 Qualification tool: Error handling while processing large event logs
#3758 Temporarily disable timestamp read tests for Parquet and ORC
#3748 Fix merge conflict with branch-21.10
#3700 CollectSet supports structs
#3740 Throw Exception if failure to load ParquetCachedBatchSerializer class
#3726 Replace Class.forName with ShimLoader.loadClass
#3690 Added support for Array[Struct] to GpuCreateArray
#3728 Qualification tool: Fix bug to process correct listeners
#3734 Fix squashed merge from #3725
#3725 Fix merge conflict with branch-21.10
#3680 cudaMalloc UCX bounce buffers when async allocator is used
#3681 Clean up and document metrics
#3674 Move file TestingV2Source.Scala
#3617 Update Version to 21.12.0-SNAPSHOT
#3612 Add support for nested types as non-key columns on joins
#3619 Added support for Array of Structs

Release 21.10

Features

#1601 [FEA] Support AggregationFunction StddevSamp
#3223 [FEA] Rework the shim layer to robustly handle ABI and API incompatibilities across Spark releases
#13 [FEA] Percentile support
#3606 [FEA] Support approx_percentile on GPU with decimal type
#3552 [FEA] extend allowed datatypes for add and multiply in ANSI mode
#3450 [FEA] test the UCX shuffle with the new build changes
#3043 [FEA] Qualification tool: Add support to filter specific configuration values
#3413 [FEA] Add in support for transform_keys
#3297 [FEA] ORC reader supports reading Map columns.
#3367 [FEA] Support GpuRowToColumnConverter on BinaryType
#3380 [FEA] Support CollectList/CollectSet on nested input types in GroupBy aggregation
#1923 [FEA] Fall back to the CPU when LEAD/LAG wants to IGNORE NULLS
#3044 [FEA] Qualification tool: Report the nested data types
#3045 [FEA] Qualification tool: Report the write data formats.
#3224 [FEA] Add maven compile/package plugin executions, one for each supported Spark dependency version
#3047 [FEA] Profiling tool: Structured output format
#2877 [FEA] Support HashAggregate on struct and nested struct
#2916 [FEA] Support GpuCollectList and GpuCollectSet as TypedImperativeAggregate
#463 [FEA] Support NESTED_SCHEMA_PRUNING_ENABLED for ORC
#1481 [FEA] ORC Predicate pushdown for Nested fields
#2879 [FEA] ORC reader supports reading Struct columns.
#27 [FEA] test current_date and current_timestamp
#3229 [FEA] Improve CreateMap to support multiple key and value expressions
#3111 [FEA] Support conditional nested loop joins
#3177 [FEA] Support decimal type in ORC reader
#3014 [FEA] Add initial support for CreateMap
#3110 [FEA] Support Map as input to explode and pos_explode
#3046 [FEA] Profiling tool: Scale to run large number of event logs.
#3156 [FEA] Support casting struct to struct
#2876 [FEA] Support joins(SHJ and BHJ) on struct as join key with nested struct in the selected column list
#68 [FEA] support StringRepeat
#3042 [FEA] Qualification tool: Add conjunction and disjunction filters.
#2615 [FEA] support collect_list and collect_set as groupby aggregation
#2943 [FEA] Support PreciseTimestampConversion when using windowing function
#2878 [FEA] Support Sort on nested struct
#2133 [FEA] Join support for passing MapType columns along when not join keys
#3041 [FEA] Qualification tool: Add filters based on Regex and user name.
#576 [FEA] Spark 3.1 orc nested predicate pushdown support

Performance

#3651 [DOC] Point users to UCX 1.11.2
#2370 [FEA] RAPIDS Shuffle Manager enable/disable config
#2923 [FEA] Move to dispatched binops instead of JIT binops

Bugs Fixed

#3929 [BUG] published rapids-4-spark dist artifact references aggregator
#3837 [BUG] Spark-rapids v21.10.0 release candidate jars failed on the OSS validation check.
#3769 [BUG] dedupe fails with find: './parallel-world/spark301/ ...' No such file or directory
#3783 [BUG] spark-rapids v21.10.0 release build failed on script "dist/scripts/binary-dedupe.sh"
#3775 [BUG] Hash aggregate with structs crashes with IllegalArgumentException
#3704 [BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment
#3760 [BUG] Databricks class cast exception failure
#3736 [BUG] Crossjoin performance degraded a lot on RAPIDS 21.10 snapshot
#3369 [BUG] UDF compiler can cause crashes with unexpected class input
#3713 [BUG] AQE shuffle coalesce optimization is broken with Spark 3.2
#3720 [BUG] Qualification tool warnings
#3718 [BUG] plugin failing to build for CDH due to missing dependency
#3653 [BUG] Issue seen with AQE on in Q5 (possibly others) using Spark 3.2 rc3
#3686 [BUG] binary-dedupe doesn't fail the build on errors
#3520 [BUG] Scaladoc warnings emitted during build
#3516 [BUG] MultiFileParquetPartitionReader can fail while trying to write the footer
#3648 [BUG] test_cast_decimal_to failing in databricks 7.3
#3670 [BUG] mvn test failed compiling rapids-4-spark-tests-next-spark_2.12
#3640 [BUG] q82 regression after #3288
#3642 [BUG] Shims improperly overridden
#3611 [BUG] test_no_fallback_when_ansi_enabled failed in databricks
#3601 [BUG] Latest 21.10 snapshot jars failing with java.lang.ClassNotFoundException: com.nvidia.spark.rapids.ColumnarRdd with XGBoost
#3589 [BUG] Latest 21.10 snapshot jars failing with java.lang.ClassNotFoundException: com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin
#3424 [BUG] Aggregations in ANSI mode do not detect overflows
#3592 [BUG] Failed to find data source: com.nvidia.spark.rapids.tests.datasourcev2.parquet.ArrowColumnarDataSourceV2
#3580 [BUG] Class deduplication pulls wrong class for ProxyRapidsShuffleInternalManagerBase
#3331 [BUG] Failed to read file into buffer in CuFile.readFromFile in gds standalone test
#3376 [BUG] Unit test failures in Spark 3.2 shim build
#3382 [BUG] Support years with up to 7 digits when casting from String to Date in Spark 3.2
#3266 CDP - Flakiness in JoinSuite in Integration tests
#3415 [BUG] Fix regressions in WindowFunctionSuite with Spark 3.2.0
#3548 [BUG] GpuSum overflow on 3.2.0+
#3472 [BUG] GpuAdd and GpuMultiply do not include failOnError
#3502 [BUG] Spark 3.2.0 TimeAdd/TimeSub fail due to new DayTimeIntervalType
#3511 [BUG] "Sequence" function fails with "java.lang.UnsupportedOperationException: Not supported on UnsafeArrayData"
#3518 [BUG] Nightly tests failed with RMM outstanding allocations on shutdown
#3383 [BUG] ParseDateTime should not support special dates with Spark 3.2
#3384 [BUG] AQE does not work with Spark 3.2 due to unrecognized GPU partitioning
#3478 [BUG] CastOpSuite and ParseDateTimeSuite failures spark 302 and others
#3495 Fix shim override config
#3482 [BUG] ClassNotFound error when running a job
#1867 [BUG] In Spark 3.2.0 and above dynamic partition pruning and AQE are not mutually exclusive
#3468 [BUG] GpuKryoRegistrator ClassNotFoundException
#3488 [BUG] databricks 8.2 runtime build failed
#3429 [BUG] test_sortmerge_join_struct_mixed_key_with_null_filter LeftSemi/LeftAnti fails
#3400 [BUG] Canonicalized GPU plans sometimes not consistent when using Spark 3.2
#3440 [BUG] Followup comments from PR3411
#3372 [BUG] 3.2.0 shim: ShuffledBatchRDD.scala:141: match may not be exhaustive.
#3434 [BUG] Fix the unit test failure of KnownNotNull in Scala UDF for Spark 3.2
#3084 [AUDIT] [SPARK-32484][SQL] Fix log info BroadcastExchangeExec.scala
#3463 [BUG] 301+-nondb is named incorrectly
#3435 [BUG] tools - test dsv1 complex and decimal test fails
#3388 [BUG] maven scalastyle checks don't appear to work for alterneate source directories
#3416 [BUG] Resource cleanup issues with Spark 3.2
#3339 [BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback
#3375 [BUG] SPARK-35742 Replace semanticEquals with canonicalize
#3334 [BUG] UCX join_test FAILED on spark standalone
#3058 [BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema.
#3385 [BUG] misc_expr_test FAILED on Dataproc
#2052 [BUG] Spark 3.2.0 test fails due to SPARK-34906 Refactor TreeNode's children handling methods into specialized traits
#3401 [BUG] Qualification tool failed with java.lang.ArrayIndexOutOfBoundsException
#3333 [BUG]Mortgage ETL input_file_name is not correct when using CPU's CsvScan
#3391 [BUG] UDF example build fail
#3379 [BUG] q93 failed w/ UCX
#3364 [BUG] analysis tool cannot handle a job with no tasks.
#3235 Classes directly in Apache Spark packages
#3237 BasicColumnWriteJobStatsTracker might be affected by spark change SPARK-34399
#3134 [BUG] Add more checkings before coalescing ORC files
#3324 [BUG] Databricks builds failing with missing dependency issue
#3244 [BUG] join_test LeftAnti failing on Databricks
#3268 [BUG] CDH ParquetCachedBatchSerializer fails to build due to api change in VectorizedColumnReader
#3305 [BUG] test_case_when failed on Databricks 7.3 nightly build
#3139 [BUG] case when on some nested types can produce a crash
#3253 [BUG] ClassCastException for unsupported TypedImperativeAggregate functions
#3256 [BUG] udf-examples native build broken
#3271 [BUG] Databricks 301 shim compilation error
#3255 [BUG] GpuRunningWindowExecMeta is missing ExecChecks for partitionSpec in databricks runtime
#3222 [BUG] test_running_window_function_exec_for_all_aggs failed in the UCX EGX run
#3195 [BUG] failures parquet_test test:read_round_trip
#3176 [BUG] test_window_aggs_for_rows_collect_list[IGNORE_ORDER({'local': True})] FAILED on EGX Yarn cluster
#3187 [BUG] NullPointerException in SLF4J on startup
#3166 [BUG] Unable to build rapids-4-spark jar from source due to missing 3.0.3-SNAPSHOT for spark-sql
#3131 [BUG] hash_aggregate_test TypedImperativeAggregate tests failed
#3147 [BUG] window_function_test.py::test_window_ride_along failed in databricks runtime
#3094 [BUG] join_test.py::test_sortmerge_join_with_conditionals failed in databricks 8.2 runtime
#3078 [BUG] test_hash_join_map, test_sortmerge_join_map failed in databricks runtime
#3059 [BUG] orc_test:test_pred_push_round_trip failed

PRs

#3940 Update changelog [skip ci]
#3930 Dist artifact with provided aggregator dependency
#3918 Update changelog [skip ci]
#3906 Doc updated for v2110[skip ci]
#3840 Update changelog [skip ci]
#3838 Update deploy script [skip ci]
#3827 Update changelog 21.10 to latest [skip ci]
#3808 Rewording qualification and profiling tools doc files[skip ci]
#3815 Correct 21.10 docs such as PCBS related FAQ [skip ci]
#3807 Update 21.10.0 release doc [skip ci]
#3800 Update approximate percentile documentation
#3810 Update to include Spark 3.2.0 in nosnapshots target so it gets released officially.
#3806 Update spark320.version to 3.2.0
#3795 Reduce usage of escaping in xargs
#3785 [BUG] Update cudf version in version-dev script [skip ci]
#3771 Update cudfjni version to 21.10.0
#3777 Ignore nullability when checking for need to cast aggregation input
#3763 Force parallel world in Shim caller's classloader
#3756 Simplify shim classloader logic
#3746 Avoid using AST on inner joins and avoid coalesce after nested loop join filter
#3719 Advertise CPU sort order and partitioning expressions to Catalyst
#3737 Add note referencing known issues in approx_percentile implementation
#3729 Update to ucx 1.11.2 for 21.10
#3711 Surface problems with overrides and fallback
#3722 CDH build stopped working due to missing jars in maven repo
#3691 Fix issues with AQE and DPP enabled on Spark 3.2
#3373 Support stddev and variance aggregations families
#3708 disable percentile approx tests
#3695 Remove duplicated data types for collect_list tests
#3687 Improve dedupe script
#3646 Debug utility method to dump a table or columnar batch to Parquet
#3683 Change deploy scripts for new build system
#3301 Approx Percentile
#3673 Add the Scala jar as an external lib for a linkage warning
#3668 Improve the diagnostics in udf compiler for try-and-catch.
#3666 Recompute Parquet block metadata when estimating footer from multiple file input
#3671 Fix tests-spark310+ dependency
#3663 Add back the tests-spark310+
#3657 Revert "Use cudf to compute exact hash join output row sizes (#3288)"
#3643 Properly override Shims for int96Rebase
#3645 Verify unshimmed classes are bitwise-identical
#3650 Fix dist copy dependencies
#3649 Add ignore_order to other fallback tests for the aggregate
#3631 Change premerge to build all Spark versions
#3630 Fix CDH Build
#3636 Change nightly build to not deploy dist for each classifier version [skip ci]
#3632 Revert disabling of ctas test
#3628 Fix 313 ShuffleManager build
#3618 Update changelog script to strip ambiguous annotation [skip ci]
#3626 Add in support for casting decimal to other number types
#3615 Ignore order for the test_no_fallback_when_ansi_enabled
#3602 Dedupe proxy rapids shuffle manager byte code
#3330 Support int96RebaseModeInWrite and int96RebaseModeInRead
#3438 Parquet read unsigned int: uint8, uin16, uint32
#3607 com.nvidia.spark.rapids.ColumnarRdd not exposed to user for XGBoost
#3566 Enable String Array Max and Min
#3590 Unshim ExclusiveModeGpuDiscoveryPlugin
#3597 ANSI check for aggregates
#3595 Update the overflow check algorithm for Subtract
#3588 Disable test_non_empty_ctas test
#3577 Commonize more shim module files
#3594 Fix nightly integration test script for specfic artifacts
#3544 Add test for nested grouping sets, rollup, cube
#3587 Revert shared class list modifications in PR#3545
#3570 ANSI Support for Abs, UnaryMinus, and Subtract
#3574 Add in ANSI date time fallback
#3578 Deploy all of the classifier versions of the jars [skip ci]
#3569 Add commons-lang3 dependency to tests
#3568 Enable 3.2.0 unit test in premerge and nightly
#3559 Commonize shim module join and shuffle files
#3565 Auto-dedupe ASM-relocated shim dependencies
#3531 Fall back to the CPU for date/time parsing we cannot support yet
#3561 Follow on to ANSI Add
#3557 Add IDEA profile switch workarounds
#3504 Fix reserialization of broadcasted tables
#3556 Fix databricks test.sh script for passing spark shim version
#3545 Dynamic class file deduplication across shims in dist jar build
#3551 Fix window sum overflow for 3.2.0+
#3537 GpuAdd supports ANSI mode.
#3533 Define a SPARK_SHIM_VER to pick up specific rapids-4-spark-integration-tests jars
#3547 Range window supports DayTime on 3.2+
#3534 Fix package name and sql string issue for GpuTimeAdd
#3536 Enable auto-merge from branch 21.10 to 21.12 [skip ci]
#3521 Qualification tool: Report nested complex types in Potential Problems and improve write csv identification.
#3507 TimeAdd supports DayTimeIntervalType
#3529 Support UnsafeArrayData in scalars
#3528 Update NOTICE copyrights to 2021
#3527 Ignore CBO tests that fail against Spark 3.2.0
#3439 Stop parsing special dates for Spark 3.2+
#3524 Update hashing to normalize -0.0 on 3.2+
#3508 Auto abort dup pre-merge builds [skip ci]
#3501 Add limitations for Databricks doc
#3517 Update empty CTAS testing to avoid Hive if possible
#3513 Allow spark320 tests to run with 320 or 321
#3493 Initialze RAPIDS Shuffle Manager at driver/executor startup
#3496 Update parse date to leverage cuDF support for single digit components
#3454 Catch UDF compiler exceptions and fallback to CPU
#3505 Remove doc references to cudf JIT
#3503 Have average support nulls for 3.2.0
#3500 Fix GpuSum type to match resultType
#3485 Fix regressions in cast from string to date and timestamp
#3487 Add databricks build tests to pre-merge CI [skip ci]
#3497 Re-enable spark.rapids.shims-provider-override
#3499 Fix Spark 3.2.0 test_div_by_zero_ansi failures
#3418 Qualification tool: Add filtering based on configuration parameters
#3498 Update the scala repl loader to avoid issues with broadcast.
#3479 Test with Spark 3.2.1-SNAPSHOT
#3474 Build fixes and IDE instructions
#3460 Add DayTimeIntervalType/YearMonthIntervalType support
#3491 Shim GpuKryoRegistrator
#3489 Fix 311 databricks shim for AnsiCastOpSuite failures
#3456 Fallback to CPU when datasource v2 enables RuntimeFiltering
#3417 Adds pre/post steps for merge and update aggregate
#3431 Reinstate test_sortmerge_join_struct_mixed_key_with_null_filter
#3477 Update supported docs to clarify casting floating point to string
#3447 Add CUDA async memory resource as an option
#3473 Create non-shim specific version of ParquetCachedBatchSerializer
#3471 Fix canonicalization of GpuScalarSubquery
#3480 Temporarily disable failing cast string to date tests
#3377 Fix AnsiCastOpSuite failures with Spark 3.2
#3467 Update docs to better describe support for floating point aggregation and NaNs
#3459 Use Shims v2 for ShuffledBatchRDD
#3457 Update the children unpacking pattern for GpuIf.
#3464 Add test for empty relation propagation
#3458 Fix log info GPU BroadcastExchangeExec
#3466 Databricks build fixes for missing shouldFailDivOverflow and removal of needed imports
#3465 Fix name of 301+-nondb directory to stop at Spark 3.2.0
#3452 Enable AQE/DPP test for Spark 3.2
#3436 Qualification tool: Update expected result for test
#3455 Decrease pre_merge_ci parallelism to 4 and reordering time-consuming tests
#3420 IntegralDivide throws an exception on overflow in ANSI mode
#3433 Batch scalastyle checks across all modules upfront
#3453 Fix spark-tests script for classifier
#3445 Update nightly build to pull Databricks jars
#3446 Format aggregator pom and commonize some configuration
#3444 Add in tests for unaligned parquet pages
#3451 Fix typo in spark-tests.sh
#3443 Remove 301emr shim
#3441 update deploy script for Databricks
#3414 Add in support for transform_keys
#3320 Add AST support for logical AND and logical OR
#3425 Throw an error by default if CREATE TABLE AS SELECT overwrites data
#3422 Stop double closing SerializeBatchDeserializeHostBuffer host buffers when running with Spark 3.2
#3411 Make new build default and combine into dist package
#3368 Extend TagForReplaceMode to adapt Databricks runtime
#3428 Remove commented-out semanticEquals overrides
#3421 Revert to CUDA runtime image for build
#3381 Implement per-shim parallel world jar classloader
#3303 Update to cudf conditional join change that removes null equality argument
#3408 Add leafNodeDefaultParallelism support
#3426 Correct grammar in qualification tool doc
#3423 Fix hash_aggregate tests that leaked configs
#3412 Restore AST conditional join tests
#3403 Fix canonicalization regression with Spark 3.2
#3394 Orc read map
#3392 Support transforming BinaryType between Row and Columnar
#3393 Fill with null columns for the names exist only in read schema in ORC reader
#3399 Fix collect_list test so it covers nested types properly
#3410 Specify number of RDD slices for ID tests
#3363 Add AST support for null literals
#3396 Throw exception on parse error in ANSI mode when casting String to Date
#3315 Add in reporting of time taken to transition plan to GPU
#3409 Use devel cuda image for premerge CI
#3405 Qualification tool: Filter empty strings from Read Schema
#3387 Fallback to the CPU for IGNORE NULLS on lead and lag
#3398 Fix NPE on string repeat when there is no data buffer
#3366 Fix input_file_xxx issue when FileScan is running on CPU
#3397 Add tests for GpuInSet
#3395 Fix UDF native example build
#3389 Bring back setRapidsShuffleManager in the driver side
#3263 Qualification tool: Report write data format and nested types
#3378 Make Dockerfile.cuda consistent with getting-started-kubernetes.md
#3359 UnionExec array and nested array support
#3342 Profiling tool add CSV output option and add new combined mode
#3365 fix databricks builds
#3323 Enable optional Spark 3.2.0 shim build
#3361 Fix databricks 3.1.1 arrow dependency version
#3354 Support HashAggregate on struct and nested struct
#3341 ArrayMax and ArrayMin support plus map_entries, map_keys, map_values
#3356 Support Databricks 3.0.1 with new build profiles
#3344 Move classes out of Apache Spark packages
#3345 Add job commit time to task tracker stats
#3357 Avoid RAT checks on any CSV file
#3355 Add new authorized user to blossom-ci whitelist [skip ci]
#3340 xfail AST nested loop join tests until cudf empty left table bug is fixed
#3276 Use child type in some places to make it more clear
#3346 Mark more tests as premerge_ci_1
#3353 Fix automerge conflict 3349 [skip ci]
#3335 Support Databricks 3.1.1 in new build profiles
#3317 Adds in support for the transform_values SQL function
#3299 Insert buffer converters for TypedImperativeAggregate
#3325 Fix spark version classifier being applied properly
#3288 Use cudf to compute exact hash join output row sizes
#3318 Fix LeftAnti nested loop join missing condition case
#3316 Fix GpuProjectAstExec when projecting only literals
#3262 Re-enable the struct support for the ORC reader.
#3312 Fix inconsistent function name and add backward compatibility support for premerge job [skip ci]
#3319 Temporarily disable cache test except for spark 3.1.1
#3308 Branch 21.10 FAQ update forward compatibility, update Spark and CUDA versions
#3309 Prepare Spark 3.2.0 related changes
#3289 Support for ArrayTransform
#3307 Fix generation of null scalars in tests
#3306 Update guava to be 30.0-jre
#3304 Fix nested cast type checks
#3302 Fix shim aggregator dependencies when snapshot-shims profile provided
#3291 Bump guava from 28.0-jre to 29.0-jre in /tests
#3292 Bump guava from 28.0-jre to 29.0-jre in /integration_tests
#3293 Bump guava from 28.0-jre to 29.0-jre in /udf-compiler
#3294 Update Qualification and Profiling tool documentation for gh-pages
#3282 Test for current_date, current_timestamp and now
#3298 Minor parent pom fixes
#3296 Support map type in case when expression
#3295 Rename pytest 'slow_test' tag as 'premerge_ci_1' to avoid confusion
#3274 Add m2 cache to fast premerge build
#3283 Fix ClassCastException for unsupported TypedImperativeAggregate functions
#3251 CreateMap support for multiple key-value pairs
#3234 Parquet support for MapType
#3277 Build changes for Spark 3.0.3, 3.0.4, 3.1.1, 3.1.2, 3.1.3, 3.1.1cdh and 3.0.1emr
#3275 Improve over-estimating for ORC coalescing reading
#3280 Update project URL to the public doc website
#3285 Qualification tool: Check for metadata being null
#3281 Decrease parallelism for pre-merge pod to avoid potential OOM kill
#3264 Add parallel support to nightly spark standalone tests
#3257 Add maven compile/package plugin executions for Spark302 and Spark301
#3272 Fix Databricks shim build
#3270 Remove reference to old maven-scala-plugin
#3259 Generate docs for AST from checks
#3164 Support Union on Map types
#3261 Fix some typos[skip ci]
#3242 Support for LeftOuter/BuildRight and RightOuter/BuildLeft nested loop joins
#3239 Support decimal type in orc reader
#3258 Add ExecChecks to Databricks shims for RunningWindowFunctionExec
#3230 Initial support for CreateMap on GPU
#3252 Update to new cudf AST API
#3249 Fix typo in Spark311dbShims
#3183 Add TypeSig checks for join keys and other special cases
#3246 Disable test_broadcast_nested_loop_join_condition_missing_count on Databricks
#3241 Split pytest by 'slow_test' tag and run from different k8s pods to reduce premerge job duration
#3184 Support broadcast nested loop join for LeftSemi and LeftAnti
#3236 Fix Scaladoc warnings in GpuScalaUDF and BufferSendState
#2846 default rmm alloc fraction to the max to avoid unnecessary fragmentation
#3231 Fix some resource leaks in GpuCast and RapidsShuffleServerSuite
#3179 Support GpuFirst/GpuLast on more data types
#3228 Fix unreachable code warnings in GpuCast
#3200 Enable a smoke test for UCX in pre-merge
#3203 Fix Parquet test_round_trip to avoid CPU write exception
#3220 Use LongRangeGen instead of IntegerGen
#3218 Add UCX 1.11.0 to the pre-merge Docker image
#3204 Decrease parallelism for pre-merge integration tests
#3212 Fix merge conflict 3211 [skip ci]
#3188 Exclude slf4j classes from the spark-rapids jar
#3189 Disable snapshot shims by default
#3178 Fix hash_aggregate test failures due to TypedImperativeAggregate
#3190 Update GpuInSet for SPARK-35422 changes
#3193 Append res-life to blossom-ci whitelist [skip ci]
#3175 Add in support for explode on maps
#3171 Refine upload log stage naming in workflow file [skip ci]
#3173 Profile tool: Fix reporting app contains Dataset
#3165 Add optional projection via AST expression evaluation
#3113 Fix order of operations when using mkString in typeConversionInfo
#3161 Rework Profile tool to not require Spark to run and process files faster
#3169 Fix auto-merge conflict 3167 [skip ci]
#3162 Add in more generalized support for casting nested types
#3158 Enable joins on nested structs
#3099 Decimal_128 type checks
#3155 Simple nested additions v2
#2728 Support string repeat SQL
#3148 Updated RunningWindow to support extended types too
#3112 Qualification tool: Add conjunction and disjunction filters
#3117 First pass at enabling structs, arrays, and maps for more parts of the plan
#3109 Cudf agg type changes
#2971 Support GpuCollectList and GpuCollectSet as TypedImperativeAggregate
#3107 Add setting to enable/disable RAPIDS Shuffle Manager dynamically
#3105 Add filter in query plan for conditional nested loop and cartesian joins
#3096 add spark311db GpuSortMergeJoinExec conditional joins filter
#3086 Fix Support of MapType in joins on Databricks
#3089 Add filter node in the query plan for conditional joins
#3074 Partial support for time windows
#3061 Support Union on Struct of Map
#3034 Support Sort on nested struct
#3011 Support MapType in joins
#3031 add doc for PR status checks [skip ci]
#3028 Enable parallel build for pre-merge job to reduce overall duration [skip ci]
#3025 Qualification tool: Add regex and username filters.
#2980 Init version 21.10.0
#3000 Merge branch-21.08 to branch-21.10

Release 21.08.1

Bugs Fixed

#3350 [BUG] Qualification tool: check for metadata being null

PRs

#3351 Update changelog for tools v21.08.1 release [skip CI]
#3348 Change tool version to 21.08.1 [skip ci]
#3343 Qualification tool backport: Check for metadata being null (#3285)

Release 21.08

Features

#1584 [FEA] Support rank as window function
#1859 [FEA] Optimize row_number/rank for memory usage
#2976 [FEA] support for arrays in BroadcastNestedLoopJoinExec and CartesianProductExec
#2398 [FEA] GpuIf and GpuCoalesce supports ArrayType
#2445 [FEA] Support literal arrays in case/when statements
#2757 [FEA] Profiling tool display input data types
#2860 [FEA] Minimal support for LEGACY timeParserPolicy
#2693 [FEA] Profiling Tool: Print GDS + UCX related parameters
#2334 [FEA] Record GPU time and Fetch time separately, instead of recording Total Time
#2685 [FEA] Profiling compare mode for table SQL Duration and Executor CPU Time Percent
#2742 [FEA] include App Name from profiling tool output
#2712 [FEA] Display job and stage info in the dot graph for profiling tool
#2562 [FEA] Implement KnownNotNull on the GPU
#2557 [FEA] support sort_array on GPU
#2307 [FEA] Enable Parquet writing for arrays
#1856 [FEA] Create a batch chunking iterator and integrate it with GpuWindowExec

Performance

#866 [FEA] combine window operations into single call
#2800 [FEA] Support ORC small files coalescing reading
#737 [FEA] handle peer timeouts in shuffle
#1590 Rapids Shuffle - UcpListener
#2275 [FEA] UCP error callback deal with cleanup
#2799 [FEA] Support ORC multi-file cloud reading

Bugs Fixed

#3135 [BUG] Regression seen in concatenate in NDS with RAPIDS Shuffle Manager enabled
#3017 [BUG] orc_write_test failed in databricks runtime
#3060 [BUG] ORC read can corrupt data when specified schema does not match file schema ordering
#3065 [BUG] window exec tries to do too much on the GPU
#3066 [BUG] Profiling tool generate dot file fails to convert
#3038 [BUG] leak in getDeviceMemoryBuffer for the unspill case
#3007 [BUG] data mess up reading from ORC
#3029 [BUG] udf_test failed in ucx standalone env
#2723 [BUG] test failures in CI build (observed in UCX job) after starting to use 21.08
#3016 [BUG] databricks script failed to return correct exit code
#3002 [BUG] writing parquet with partitionBy() loses sort order
#2959 [BUG] Resolve common code source incompatibility with supported Spark versions
#2589 [BUG] RapidsShuffleHeartbeatManager needs to remove executors that are stale
#2964 [BUG] IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS] FAILED in spark 3.0.3+
#2942 [BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR"
#2965 [BUG] test_round_robin_sort_fallback failed with ValueError: 'a_1' is not in list
#2891 [BUG] Discrepancy in getting count before and after caching
#2972 [BUG] When using timeout option(-t) of qualification tool, it does not print anything in output after timeout.
#2958 [BUG] When AQE=on, SMJ with a Map in SELECTed list fails with "key not found: numPartitions"
#2929 [BUG] No validation of format strings when formatting dates in legacy timeParserPolicy mode
#2900 [BUG] CAST string to float/double produces incorrect results in some cases
#2957 [BUG] Builds failing due to breaking changes in SPARK-36034
#2901 [BUG] GpuCompressedColumnVector cannot be cast to GpuColumnVector with AQE
#2899 [BUG] CAST string to integer produces incorrect results in some cases
#2937 [BUG] Fix more edge cases when parsing dates in legacy timeParserPolicy
#2939 [BUG] Window integration tests failing with Lead expected at least 3 but found 0
#2912 [BUG] Profiling compare mode fails when comparing spark 2 eventlog to spark 3 event log
#2892 [BUG] UCX error Message truncated observed with UCX 1.11 RC in Q77 NDS
#2807 [BUG] Use UCP_AM_FLAG_WHOLE_MSG and UCP_AM_FLAG_PERSISTENT_DATA for receive handlers
#2930 [BUG] Profiling tool does not show "Potential Problems" for dataset API in section "SQL Duration and Executor CPU Time Percent"
#2902 [BUG] CAST string to bool produces incorrect results in some cases
#2850 [BUG] "java.io.InterruptedIOException: getFileStatus on s3a://xxx" for ORC reading in Databricks 8.2 runtime
#2856 [BUG] cache of struct does not work on databricks 8.2ML
#2790 [BUG] In Comparison mode health check does not show the application id
#2713 [BUG] profiling tool does not error or warn if incompatible options are given
#2477 [BUG] test_single_sort_in_part is failed in nightly UCX and AQE (no UCX) integration
#2868 [BUG] to_date produces wrong value on GPU for some corner cases
#2907 [BUG] incorrect expression to detect previously set --master
#2893 [BUG] TransferRequest request transactions are getting leaked
#120 [BUG] GPU InitCap supports too much white space.
#2786 [BUG][initCap function]There is an issue converting the uppercase character to lowercase on GPU.
#2754 [BUG] cudf_udf tests failed w/ 21.08
#2820 [BUG] Metrics are inconsistent for GpuRowToColumnarToExec
#2710 [BUG] dot file generation can go over the limits of dot
#2772 [BUG] new integration test failures w/ maxFailures=1
#2739 [BUG] CBO causes less efficient plan for NDS q84
#2717 [BUG] CBO forces joins back onto CPU in some cases
#2718 [BUG] CBO falls back to CPU to write to Parquet in some cases
#2692 [BUG] Profiling tool: Add error handling for comparison functions
#2711 [BUG] reused stages should not appear multiple times in dot
#2746 [BUG] test_single_nested_sort_in_part integration test failure 21.08
#2690 [BUG] Profiling tool doesn't properly read rolled log files
#2546 [BUG] Build Failure when building from source
#2750 [BUG] nightly test failed with lists: testStringReplaceWithBackrefs
#2644 [BUG] test event logs should be compressed
#2725 [BUG] Heartbeat from unknown executor when running with UCX shuffle in local mode
#2715 [BUG] Part of the plan is not columnar class com.databricks.sql.execution.window.RunningWindowFunc
#2521 [BUG] cudf_udf failed in all spark release intermittently
#1712 [BUG] Scala UDF compiler can decompile UDFs with RAPIDS implementation

PRs

#3216 Update changelog to include download doc update [skip ci]
#3214 Update download and databricks doc for 21.06.2 [skip ci]
#3210 Update 21.08.0 changelog to latest [skip ci]
#3197 Databricks parquetFilters api change in db 8.2 runtime
#3168 Update 21.08 changelog to latest [skip ci]
#3146 update cudf Java binding version to 21.08.2
#3080 Update docs for 21.08 release
#3136 Update tool docs to explain default filesystem [skip ci]
#3128 Fix merge conflict 3126 from branch-21.06 [skip ci]
#3124 Fix merge conflict 3122 from branch-21.06 [skip ci]
#3100 Update databricks 3.0.1 shim to new ParquetFilter api
#3083 Initial CHANGELOG.md update for 21.08
#3079 Remove the struct support in ORC reader
#3062 Fix ORC read corruption when specified schema does not match file order
#3064 Tweak scaladoc to callout the GDS+unspill case in copyBuffer
#3049 Handle mmap exception more gracefully in RapidsShuffleServer
#3067 Update to UCX 1.11.0
#3024 Check validity of any() or all() results that could be null
#3069 Fall back to the CPU on window partition by struct or array
#3068 Profiling tool generate dot file fails on unescaped html characters
#3048 Apply unique committer job ID fix from SPARK-33230
#3050 Updates for google analytics [skip ci]
#3015 Fix ORC read error when read schema reorders file schema columns
#3053 cherry-pick #3028 [skip ci]
#2887 ORC reader supports struct
#3032 Add disorder read schema test case for Parquet
#3022 Add in docs to describe window performance
#3018 [BUG] fix db script hides error issue
#2953 Add in support for rank and dense_rank
#3009 Propagate child output ordering in GpuCoalesceBatches
#2989 Re-enable Array support in Cartesian Joins, Broadcast Nested Loop Joins
#2999 Remove unused configuration setting spark.rapids.sql.castStringToInteger.enabled
#2967 Resolve hidden source incompatibility between Spark30x and Spark31x Shims
#2982 Add FAQ entry for timezone error
#2839 GpuIf and GpuCoalesce support array and struct types
#2987 Update documentation for unsupported edge cases when casting from string to timestamp
#2977 Expire executors from the RAPIDS shuffle heartbeat manager on timeout
#2985 Move tools README to docs/additional-functionality/qualification-profiling-tools.md with some modification
#2992 Remove commented/redundant window-function tests.
#2994 Tweak RAPIDS Shuffle Manager configs for 21.08
#2984 Avoid comparing window range canonicalized plans on Spark 3.0.x
#2970 Put the GPU data back on host before processing cache on CPU
#2986 Avoid struct aliasing in test_round_robin_sort_fallback
#2935 Read the complete batch before returning when selectedAttributes is empty
#2826 CaseWhen supports scalar of list and struct
#2978 enable auto-merge from branch 21.08 to 21.10 [skip ci]
#2946 ORC reader supports list
#2947 Qualification tool: Filter based on timestamp in event logs
#2973 Assert that CPU and GPU row fields match when present
#2974 Qualification tool: fix performance regression
#2948 Remove unnecessary copies of ParquetCachedBatchSerializer
#2968 Fix AQE CustomShuffleReaderExec not seeing ShuffleQueryStageExec
#2969 Make the dir for spark301 shuffle shim match package name
#2933 Improve CAST string to float implementation to handle more edge cases
#2963 Add override getParquetFilters for shim 304
#2956 Profile Tool: make order consistent between runs
#2924 Fix bug when collecting directly from a GPU shuffle query stage with AQE on
#2950 Fix shutdown bugs in the RAPIDS Shuffle Manager
#2922 Improve UCX assertion to show the failed assertion
#2961 Fix ParquetFilters issue
#2951 Qualification tool: Allow app start and app name filtering and test with filesystem filters
#2941 Make test event log compression codec configurable
#2919 Fix bugs in CAST string to integer
#2944 Fix childExprs list for GpuWindowExpression, for Spark 3.1.x.
#2917 Refine GpuHashAggregateExec.setupReference
#2909 Support orc coalescing reading
#2938 Qualification tool: Add negation filter
#2940 qualification tool: add filtering by app start time
#2928 Qualification tool support recognizing decimal operations
#2934 Qualification tool: Add filter based on appName
#2904 Qualification and Profiling tool handle Read formats and datatypes
#2927 Restore aggregation sorted data hint
#2932 Profiling tool: Fix comparing spark2 and spark3 event logs
#2926 GPU Active Messages for all buffer types
#2888 Type check with the information from RapidsMeta
#2903 Fix cast string to bool
#2895 Add in running window optimization using scan
#2859 Add spillable batch caching and sort fallback to hash aggregation
#2898 Add fuzz tests for cast from string to other types
#2881 fix orc readers leak issue for ORC PERFILE type
#2842 Support STRUCT/STRING for LEAD()/LAG()
#2880 Added ParquetCachedBatchSerializer support for Databricks
#2911 Add in ID as sort for Job + Stage level aggregated task metrics
#2914 Profiling tool: add app index to tables that don't have it
#2906 Fix compiler warning
#2890 Fix cast to date bug
#2908 Fixes bad string contains in run_pyspark_from_build
#2886 Use UCP Listener for UCX connections and enable peer error handling
#2875 Add support for timeParserPolicy=LEGACY
#2894 Fixes a JVM leak for UCX TransactionRequests
#2854 Qualification Tool to output only the 'k' highest-ranked or 'k' lowest-ranked applications
#2873 Fix infinite loop in MultiFileCloudPartitionReaderBase
#2838 Replace toTitle with capitalize for GpuInitCap
#2870 Avoid readers acquiring GPU on next batch query if not first batch
#2882 Refactor window operations to do them in the exec
#2874 Update audit script to clone branch-3.2 instead of master
#2843 Qualification/Profiling tool add tests for Spark2 event logs
#2828 add cloud reading for orc
#2721 Check-list for corner cases in testing.
#2675 Support for Decimals with negative scale for Parquet Cached Batch Serializer
#2849 Update release notes to include qualification and profiling tool
#2852 Fix hash aggregate tests leaking configs into other tests
#2845 Split window exec into multiple stages if needed
#2853 Tag last batch when coalescing
#2851 Fix build failure - update ucx profiling test to fix parameter type to getEventLogInfo
#2785 Profiling tool: Print UCX and GDS parameters
#2840 Fix Gpu -> GPU
#2844 Document Qualification tool Spark requirements
#2787 Add metrics definition link to tool README.md[skip ci]
#2841 Add a threadpool to Qualification tool to process logs in parallel
#2833 Stop running so many versions of Spark unit tests for premerge
#2837 Append new authorized user to blossom-ci whitelist [skip ci]
#2822 Rewrite Qualification tool for better performance
#2823 Add semaphoreWaitTime and gpuOpTime for GpuRowToColumnarExec
#2829 Fix filtering directories on compression extension match
#2720 Add metrics documentation to the tuning guide
#2816 Improve some existing collectTime handling
#2821 Truncate long plan labels and refer to "print-plans"
#2827 Update cmake to build udf native [skip ci]
#2793 Report equivilant stages/sql ids as a part of compare
#2810 Use SecureRandom for UCPListener TCP port choice
#2798 Mirror apache repos to urm
#2788 Update the type signatures for some expressions
#2792 Automatically set spark.task.maxFailures and local[*, maxFailures]
#2805 Revert "Use UCX Active Messages for all shuffle transfers (#2735)"
#2796 show disk bytes spilled when GDS spill is enabled
#2801 Update pre-merge to use reserved_pool [skip ci]
#2795 Improve CBO debug logging
#2794 Prevent integer overflow when estimating data sizes in cost-based optimizer
#2784 Make spark303 shim version w/o snapshot and add shim layer for spark304
#2744 Cost-based optimizer: Implement simple cost model that demonstrates benefits with NDS queries
#2762 Profiling tool: Update comparison mode output format and add error handling
#2761 Update dot graph to include stages and remove some duplication
#2760 Add in application timeline to profiling tool
#2735 Use UCX Active Messages for all shuffle transfers
#2732 qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark
#2768 Make window function test results deterministic.
#2769 Add developer documentation for Adaptive Query Execution
#2532 date_format should not suggest enabling incompatibleDateFormats for formats we cannot support
#2743 Disable dynamicAllocation and set maxFailures to 1 in integration tests
#2749 Revert "Add in support for lists in some joins (#2702)"
#2181 abstract the parquet coalescing reading
#2753 Merge branch-21.06 to branch-21.08 [skip ci]
#2751 remove invalid blossom-ci users [skip ci]
#2707 Support KnownNotNull running on GPU
#2747 Fix num_slices for test_single_nested_sort_in_part
#2729 fix 301db-shim typecheck typo
#2726 Fix local mode starting RAPIDS shuffle heartbeats
#2722 Support aggregation on NullType in RunningWindowExec
#2719 Avoid executing child plan twice in CoalesceExec
#2586 Update metrics use in GpuUnionExec and GpuCoalesceExec
#2716 Add file size check to pre-merge CI
#2554 Upload build failure log to Github for external contributors access
#2596 Initial running window memory optimization
#2702 Add in support for arrays in BroadcastNestedLoopJoinExec and CartesianProductExec
#2699 Add a pre-commit hook to reject large files
#2700 Set numSlices and use parallelize to build dataframe for partition-se…
#2548 support collect_set in rolling window
#2661 Make tools inherit common dependency versions from parent pom
#2668 Remove CUDA 10.x from getting started guide [skip ci]
#2676 Profiling tool: Print Job Information in compare mode
#2679 Merge branch-21.06 to branch-21.08 [skip ci]
#2677 Add pre-merge independent stage timeout [skip ci]
#2616 support GpuSortArray
#2582 support parquet write arrays
#2609 Fix automerge failure from branch-21.06 to branch-21.08
#2570 Added nested structs to UnionExec
#2581 Fix merge conflict 2580 [skip ci]
#2458 Split batch by key for window operations
#2565 Merge branch-21.06 into branch-21.08
#2563 Document: git commit twice when copyright year updated by hook
#2561 Fixing the merge of 21.06 to 21.08 for comment changes in Profiling tool
#2558 Fix cdh shim version in 21.08 [skip ci]
#2543 Init branch-21.08

Release 21.06.2

Bugs Fixed

#3191 [BUG] Databricks parquetFilters build failure in db 8.2 runtime

PRs

#3209 Update 21.06.2 changelog [skip ci]
#3208 Update rapids plugin version to 21.06.2 [skip ci]
#3207 Disable auto-merge from 21.06 to 21.08 [skip ci]
#3205 Branch 21.06 databricks update [skip ci]
#3198 Databricks parquetFilters api change in db 8.2 runtime

Release 21.06.1

Bugs Fixed

#3098 [BUG] Databricks parquetFilters build failure

PRs

#3127 Update CHANGELOG for the release v21.06.1 [skip ci]
#3123 Update rapids plugin version to 21.06.1 [skip ci]
#3118 Fix databricks 3.0.1 for ParquetFilters api change
#3119 Branch 21.06 databricks update [skip ci]

Release 21.06

Features

#2483 [FEA] Profiling and qualification tool
#951 [FEA] Create Cloudera shim layer
#2481 [FEA] Support Spark 3.1.2
#2530 [FEA] Add support for Struct columns in CoalesceExec
#2512 [FEA] Report gpuOpTime not totalTime for expand, generate, and range execs
#63 [FEA] support ConcatWs sql function
#2501 [FEA] Add support for scalar structs to named_struct
#2286 [FEA] update UCX documentation for branch 21.06
#2436 [FEA] Support nested types in CreateNamedStruct
#2461 [FEA] Report gpuOpTime instead of totalTime for project, filter, window, limit
#2465 [FEA] GpuFilterExec should report gpuOpTime not totalTime
#2013 [FEA] Support concatenating ArrayType columns
#2425 [FEA] Support for casting array of floats to array of doubles
#2012 [FEA] Support Window functions(lead & lag) for ArrayType
#2011 [FEA] Support creation of 2D array type
#1582 [FEA] Allow StructType as input and output type to InMemoryTableScan and InMemoryRelation
#216 [FEA] Range window-functions must support non-timestamp order-by expressions
#2390 [FEA] CI/CD for databricks 8.2 runtime
#2273 [FEA] Enable struct type columns for GpuHashAggregateExec
#20 [FEA] Support out of core joins
#2160 [FEA] Support Databricks 8.2 ML Runtime
#2330 [FEA] Enable hash partitioning with arrays
#1103 [FEA] Support date_format on GPU
#1125 [FEA] explode() can take expressions that generate arrays
#1605 [FEA] Support sorting on struct type keys

Performance

#1445 [FEA] GDS Integration
#1588 Rapids shuffle - UCX active messages
#2367 [FEA] CBO: Implement costs for memory access and launching kernels
#2431 [FEA] CBO should show benefits with q24b with decimals enabled

Bugs Fixed

#2652 [BUG] No Job Found. Exiting.
#2659 [FEA] Group profiling tool "Potential Problems"
#2680 [BUG] cast can throw NPE
#2628 [BUG] failed to build plugin in databricks runtime 8.2
#2605 [BUG] test_pandas_map_udf_nested_type failed in Yarn integration
#2622 [BUG] compressed event logs are not processed
#2478 [BUG] When tasks complete, cancel pending UCX requests
#1953 [BUG] Could not allocate native memory when running DLRM ETL with --output_ordering input on A100
#2495 [BUG] scaladoc warning GpuParquetScan.scala:727 "discarding unmoored doc comment"
#2368 [BUG] Mismatched number of columns while performing GpuSort
#2407 [BUG] test_round_robin_sort_fallback failed
#2497 [BUG] GpuExec failed to find metric totalTime in databricks env
#2473 [BUG] enable test_window_aggs_for_rows_lead_lag_on_arrays and make the order unambiguous
#2489 [BUG] Queries with window expressions fail when cost-based optimizer is enabled
#2457 [BUG] test_window_aggs_for_rows_lead_lag_on_arrays failed
#2371 [BUG] Performance regression for crossjoin on 0.6 comparing to 0.5
#2372 [BUG] FAILED ../../src/main/python/udf_cudf_test.py::test_window
#2404 [BUG] test_hash_pivot_groupby_nan_fallback failed on Dataproc
#2474 [BUG] when ucp listener enabled we bind 16 times always
#2427 [BUG] test_union_struct_missing_children[(Struct(not_null) failed in databricks310 and spark 311
#2455 [BUG] CaseWhen crashes on literal arrays
#2421 [BUG] NPE when running mapInPandas Pandas UDF in 0.5GA
#2428 [BUG] Intermittent ValueError in test_struct_groupby_count
#1628 [BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM
#2276 [BUG] SPARK-33386 - ansi-mode changed ElementAt/Elt/GetArray behavior in Spark 3.1.1 - fallback to cpu
#2309 [BUG] legacy cast of a struct column to string with a single nested null column yields null instead of '[]'
#2315 [BUG] legacy struct cast to string crashes on a two field struct
#2406 [BUG] test_struct_groupby_count failed
#2378 [BUG] java.lang.ClassCastException: GpuCompressedColumnVector cannot be cast to GpuColumnVector
#2355 [BUG] convertDecimal64ToDecimal32Wrapper leaks ColumnView instances
#2346 [BUG] segfault when using UcpListener in TCP-only setup
#2364 [BUG] qa_nightly_select_test.py::test_select integration test fails
#2302 [BUG] Int96 are not being written as expected
#2359 [BUG] Alias is different in spark 3.1.0 but our canonicalization code doesn't handle
#2277 [BUG] spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED or LEGACY still fails to read LEGACY date from parquet
#2320 [BUG] TypeChecks diagnostics outputs column ids instead of unsupported types
#2238 [BUG] Unnecessary to cache the batches that will be sent to Python in FlatMapGroupInPandas.
#1811 [BUG] window_function_test.py::test_multi_types_window_aggs_for_rows_lead_lag[partBy failed

PRs

#2817 Update changelog for v21.06.0 release [skip ci]
#2806 Noted testing for A10, noted that min driver ver is HW specific
#2797 Update documentation for InitCap incompatibility
#2774 Update changelog for 21.06 release [skip ci]
#2770 [Doc] add more for Alluxio page [skip ci]
#2745 Add link to Mellanox RoCE documentation and mention --without-ucx installation option
#2740 Update cudf Java bindings to 21.06.1
#2664 Update changelog for 21.06 release [skip ci]
#2697 fix GDS spill bug when copying from the batch write buffer
#2691 Update properties to check if table there
#2687 Remove CUDA 10.x from getting started guide (#2668)
#2686 Profiling tool: Print Job Information in compare mode
#2657 Print CPU and GPU output when _assert_equal fails to help debug given…
#2681 Avoid NPE when casting empty strings to ints
#2669 Fix multiple problems reported and improve error handling
#2666 [DOC]Update custom image guide in GCP dataproc to reduce cluster startup time
#2665 Update docs to move RAPIDS Shuffle out of beta [skip ci]
#2671 Clean profiling&qualification tool README
#2673 Profiling tool: Enable tests and update compressed event log
#2672 Update cudfjni dependency version to 21.06.0
#2663 Qualification tool - add in estimating the App end time when the event log missing application end event
#2600 Accelerate RunningWindow queries on GPU
#2651 Profiling tool - fix reporting contains dataset when sql time 0
#2623 Fixed minor mistakes in documentation
#2631 Update docs for Databricks 8.2 ML
#2638 Add an init script for databricks 7.3ML with CUDA11.0 installed
#2643 Profiling tool: Health check follow on
#2640 Add physical plan to the dot file as the graph label
#2637 Fix databricks for 3.1.1
#2577 Update download.md and FAQ.md for 21.06.0
#2636 Profiling tool - Fix file writer for generating dot graphs, supporting writing sql plans to a file, change output to subdirectory
#2625 Exclude failed jobs/queries from Qualification tool output
#2626 Enable processing of compressed Spark event logs
#2632 Profiling tool: Add support for health check.
#2627 Ignore order for map udf test
#2620 Change aggregation of executor CPU and run time for Qualification tool to speed up query
#2618 Correct an issue for README for tools and also correct s3 solution in Args.scala
#2612 Profiling tool, add in job to stage, duration, executor cpu time, fix writing to HDFS
#2614 change rapids-4-spark-tools directory to tools in deploy script [skip ci]
#2611 Revert "disable cudf_udf tests for #2521"
#2604 Profile/qualification tool error handling improvements and support spark < 3.1.1
#2598 Rename rapids-4-spark-tools directory to tools
#2576 Add filter support for qualification and profiling tool.
#2603 Add the doc for -g option of the profiling tool.
#2594 Change the README of the qualification and profiling tool to match the current version.
#2591 Implement test for qualification tool sql metric aggregates
#2590 Profiling tool support for collection and analysis
#2587 Handle UCX connection timeouts from heartbeats more gracefully
#2588 Fix package name
#2574 Add Qualification tool support
#2571 Change test_single_sort_in_part to print source data frame on failure
#2569 Remove -SNAPSHOT in documentation in preparation for release
#2429 Change RMM_ALLOC_FRACTION to represent percentage of available memory, rather than total memory, for initial allocation
#2553 Cancel requests that are queued for a client/handler on error
#2566 expose unspill config option
#2460 align GDS reads/writes to 4 KiB
#2515 Remove fetchTime and standardize on collectTime
#2523 Not compile RapidsUDF when udf compiler is enabled
#2538 Fixed code indentation in ParquetCachedBatchSerializer
#2559 Release profiling tool jar to maven central
#2423 Add cloudera shim layer
#2520 Add event logs for integration tests
#2525 support interval.microseconds for range window TimeStampType
#2536 Don't do an extra shuffle in some TopN cases
#2508 Refactor the code for conditional expressions
#2542 enable auto-merge from 21.06 to 21.08 [skip ci]
#2540 Update spark 312 shim, and Add spark 313-SNAPSHOT shim
#2539 disable cudf_udf tests for #2521
#2514 Add Struct support for ParquetWriter
#2534 Remove scaladoc on an internal method to avoid warning during build
#2537 Add CentOS documentation and improve dockerfiles for UCX
#2531 Add nested types and decimals to CoalesceExec
#2513 Report opTime not totalTime for expand, range, and generate execs
#2533 Fix concat_ws test specifying only a separator for databricks
#2528 Make GenerateDot test more robust
#2529 Change Databricks 310 shim to be 311 to match reported spark.version
#2479 Support concat with separator on GPU
#2507 Improve test coverage for sorting structs
#2526 Improve debug print to include addresses and null counts
#2463 Add EMR 6.3 documentation
#2516 Avoid listener race collecting wrong plan in assert_gpu_fallback_collect
#2505 Qualification tool updates for datasets, udf, and misc fixes
#2509 Added in basic support for scalar structs to named_struct
#2449 Add code for generating dot file visualizations
#2475 Update shuffle documentation for branch-21.06 and UCX 1.10.1
#2500 Update Dockerfile for native UDF
#2506 Support creating Scalars/ColumnVectors from utf8 strings directly.
#2502 Remove work around for nulls in semi-anti joins
#2503 Remove temporary logging and adjust test column names
#2499 Fix regression in TOTAL_TIME metrics for Databricks
#2498 Add in basic support for scalar maps and allow nesting in named_struct
#2496 Add comments for lazy binding in WindowInPandas
#2493 improve window agg test for range numeric types
#2491 Fix regression in cost-based optimizer when calculating cost for Window operations
#2482 Window tests with smaller batches
#2490 Add temporary logging for Dataproc round robin fallback issue
#2486 Remove the null replacement in computePredicate
#2469 Adding additional functionalities to profiling tool
#2462 Report gpuOpTime instead of totalTime for project, filter, limit, and window
#2484 Fix the failing test test_window on Databricks
#2472 Fix hash_aggregate_test
#2476 Fix for UCP Listener created spark.port.maxRetries times
#2471 skip test_window_aggs_for_rows_lead_lag_on_arrays
#2446 Update plugin version to 21.06.0
#2409 Change shuffle metadata messages to use UCX Active Messages
#2397 Include memory access costs in cost models (cost-based optimizer)
#2442 fix GpuCreateNamedStruct not serializable issue
#2379 support GpuConcat on ArrayType
#2456 Fall back to the CPU for literal array values on case/when
#2447 Filter out the nulls after slicing the batches.
#2426 Implement cast of nested arrays
#2299 support creating array of array
#2451 Update tuning docs to add batch size recommendations.
#2435 support lead/lag on arrays
#2448 support creating list ColumnVector for Literal(ArrayType(NullType))
#2402 Add profiling tool
#2313 Supports GpuLiteral of array type

Older Releases

Changelog of older releases can be found at docs/archives