Skip to content

Serializable transaction failures are not retried on the server #1202

@george-zubrienko

Description

@george-zubrienko

Describe the bug

After this change was introduced, when multiple concurrent requests to create/drop tables or even simply read catalog info are thrown at the webhost, it will quite often throw this:

Caused by: java.sql.SQLException: Query failed (#20250319_060102_06461_43kzu): Failed to drop table 'staging_custinvoicejour__2025_03_19_06_01_01_3f698ce2_58ab_4f81_8892_66e014a7a927'
	at io.trino.jdbc.ResultUtils.resultsException(ResultUtils.java:33)
	at io.trino.jdbc.AsyncResultIterator.lambda$new$1(AsyncResultIterator.java:93)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
	Suppressed: zio.Cause$FiberTrace: Exception in thread "zio-fiber-1997042971" java.sql.SQLException: Query failed (#20250319_060102_06461_43kzu): Failed to drop table 'staging_custinvoicejour__2025_03_19_06_01_01_3f698ce2_58ab_4f81_8892_66e014a7a927'
	at com.sneaksanddata.arcane.framework.services.merging.JdbcMergeServiceClient.executeBatchQuery(JdbcMergeServiceClient.scala:233)
	at com.sneaksanddata.arcane.framework.services.merging.JdbcMergeServiceClient.executeBatchQuery(JdbcMergeServiceClient.scala:234)
	at com.sneaksanddata.arcane.framework.services.merging.JdbcMergeServiceClient.executeBatchQuery(JdbcMergeServiceClient.scala:235)
	at com.sneaksanddata.arcane.framework.services.streaming.processors.batch_processors.DisposeBatchProcessor.process(DisposeBatchProcessor.scala:25)
	at com.sneaksanddata.arcane.framework.services.streaming.processors.batch_processors.DisposeBatchProcessor.process(DisposeBatchProcessor.scala:26)
	at com.sneaksanddata.arcane.framework.services.streaming.processors.batch_processors.DisposeBatchProcessor.process(DisposeBatchProcessor.scala:27)
	at com.sneaksanddata.arcane.microsoft_synapse_link.services.app.StreamRunnerServiceCdm.run(StreamRunnerServiceCdm.scala:45)
	Suppressed: io.trino.jdbc.$internal.client.FailureException: Failed to drop table 'staging_custinvoicejour__2025_03_19_06_01_01_3f698ce2_58ab_4f81_8892_66e014a7a927'
		Suppressed: io.trino.jdbc.$internal.client.FailureException: Server error: PersistenceException: Exception [EclipseLink-4002] (Eclipse Persistence Services - 4.0.5.v202412231137-a96b873527f305f932543045c8679bb1de8d3a43): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: org.postgresql.util.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during conflict out checking.
  Hint: The transaction might succeed if retried.
Error Code: 0
Call: SELECT CATALOGID, ID, ENTITYVERSION, GRANTRECORDSVERSION, VERSION FROM ENTITIES_CHANGE_TRACKING WHERE ((CATALOGID = ?) AND (ID = ?))
	bind => [2 parameters bound]
Query: ReadObjectQuery(referenceClass=ModelEntityChangeTracking sql="SELECT CATALOGID, ID, ENTITYVERSION, GRANTRECORDSVERSION, VERSION FROM ENTITIES_CHANGE_TRACKING WHERE ((CATALOGID = ?) AND (ID = ?))")
Caused by: io.trino.jdbc.$internal.client.FailureException: Failed to drop table 'staging_custinvoicejour__2025_03_19_06_01_01_3f698ce2_58ab_4f81_8892_66e014a7a927'
	at io.trino.plugin.iceberg.catalog.rest.TrinoRestCatalog.dropTable(TrinoRestCatalog.java:467)
	at io.trino.plugin.iceberg.IcebergMetadata.dropTable(IcebergMetadata.java:2391)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.dropTable(ClassLoaderSafeConnectorMetadata.java:452)
	at io.trino.tracing.TracingConnectorMetadata.dropTable(TracingConnectorMetadata.java:388)
	at io.trino.metadata.MetadataManager.dropTable(MetadataManager.java:1062)

To Reproduce

  1. Deploy Polaris 0.9 with Postgres 15.10 metastore backend, with persistence.xml shown below:
<persistence version="2.0" xmlns="http://java.sun.com/xml/ns/persistence"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd">
  <persistence-unit name="polaris" transaction-type="RESOURCE_LOCAL">
    <provider>org.eclipse.persistence.jpa.PersistenceProvider</provider>
    <class>org.apache.polaris.jpa.models.ModelEntity</class>
    <class>org.apache.polaris.jpa.models.ModelEntityActive</class>
    <class>org.apache.polaris.jpa.models.ModelEntityChangeTracking</class>
    <class>org.apache.polaris.jpa.models.ModelEntityDropped</class>
    <class>org.apache.polaris.jpa.models.ModelGrantRecord</class>
    <class>org.apache.polaris.jpa.models.ModelPrincipalSecrets</class>
    <class>org.apache.polaris.jpa.models.ModelSequenceId</class>
    <shared-cache-mode>NONE</shared-cache-mode>
    <properties>
      <property name="jakarta.persistence.jdbc.url"
                value="jdbc:postgresql://..:5432/{realm}"/>
      <property name="jakarta.persistence.jdbc.user" value="..."/>
      <property name="jakarta.persistence.jdbc.password" value="..."/>
      <property name="jakarta.persistence.schema-generation.database.action" value="create"/>
      <property name="eclipselink.logging.level.sql" value="OFF"/>
      <property name="eclipselink.logging.parameters" value="false"/>
      <property name="eclipselink.persistence-context.flush-mode" value="auto"/>
      <property name="eclipselink.connection-pool.default.initial" value="1" />
      <property name="eclipselink.connection-pool.default.min" value="1" />
      <property name="eclipselink.connection-pool.default.max" value="32" />
      <property name="eclipselink.session.customizer" value="org.apache.polaris.extension.persistence.impl.eclipselink.PolarisEclipseLinkSessionCustomizer" />
      <property name="eclipselink.transaction.join-existing" value="true" />
    </properties>
  </persistence-unit>
</persistence>
  1. Deploy Polaris server via Helm chart with autoscaling enabled, 2 min replicas, bootstrap from admin tool of the same version as per doc.
  2. Throw 30 parallel unique table create/drop/select from statements for 5 minutes, observe the fun

Actual Behavior

Around 90% of the statements succeed, DROP fail sometimes with the error shown in the issue

Expected Behavior

No errors thrown

Additional context

No response

System information

Polaris v0.9, commit from Mar 17, Postgres 15.10 (Aurora), container build deployed on EKS 1.29, from a helm chart built from the same commit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions