Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): time travel query #8517

Conversation

mfatihaktas
Copy link
Contributor

Description of changes

Aims to address #8203.

This PR

  • Adds time_travel() for Table.
  • Adds ibis.expr.types.temporal.TimeTravelTable and ops.TimeTravelDatabaseTable.
  • Adds visit_TimeTravelDatabaseTable() in the compiler for BigQuery and Flink, and time_travel_table_sql() in the SQL generator for Flink.
  • Adds tests for time travel in BigQuery and Flink backends.

Notes:

  • Could not add a functional test for BigQuery as I do not have access to a GCP env to run the functional tests.
  • Will add more unit and functional tests.
    • Especially, for timestamp expressions and timestamps with time zones.

Issues closed

@mfatihaktas mfatihaktas self-assigned this Mar 1, 2024
@mfatihaktas mfatihaktas added the backends Issues related to all backends label Mar 1, 2024
@@ -139,6 +139,8 @@ services:
AZURE_ABFS_OAUTH_SECRET: ""
AZURE_ABFS_OAUTH_ENDPOINT: ""
AZURE_WASB_ACCESS_KEY: ""
ports:
- 9083:9083
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary? If you're looking to expose the hive metastore to the flink container, do that by adding this container to the flink network, not by exposing the port to the host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I actually first tried adding networks: - flink for hive-metastore but that led to

E   Caused by: java.net.ConnectException: Connection refused (Connection refused)

Digging deeper into this, I found that TestConfForStreaming (the TestConf used by the tests in ibis/backends/flink/tests/) creates a local Flink environment, which means the tests do not run on the dockerized environment. This is why we have to expose the meta-store to the local env.

I have put the env creation code in a new function get_table_env(remote_env: bool, streaming_mode: bool) to make it more transparent and configurable. Switched to remote env in TestConfForStreaming to see if we can run the tests without exposing the meta-store port, but that gave me

Caused by: java.util.concurrent.ExecutionException: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 8065e44d3abe8da591f9c169cde351cd)
...
Caused by: java.io.IOException: Failed to create the parent directory: /private/var/folders/9g/7ncp4wjj22q25spxdy7t44080000gn/T/pytest-of-mehmet/pytest-520/test_table0

The error seems to be due to the Flink client being on a file system that is not shared with the Flink TaskManager. I guess this is why the env was left to be local in TestConfForStreaming at the time.

@pytest.fixture
def use_hive_catalog(con):
# Flink related
download_jar_for_package(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this in the flink dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding JAR files in the container in this case won't help because the Flink specific tests do not run on the dockerized env, but on the local env as described above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading JAR's is failing on CI:

package_name = 'apache-flink'
jar_name = 'paimon-flink-1.18-0.8-20240301.002155-30'
jar_url = 'https://repository.apache.org/content/groups/snapshots/org/apache/paimon/paimon-flink-1.18/0.8-SNAPSHOT/paimon-flink-1.18-0.8-20240301.002155-30.jar'

    def download_jar_for_package(
        package_name: str,
        jar_name: str,
        jar_url: str,
    ):
        import os
        from importlib import metadata
    
        import requests
    
        # Find the path to package lib
        try:
            distribution = metadata.distribution(package_name)
            lib_path = distribution.locate_file("")
        except metadata.PackageNotFoundError:
            lib_path = None
    
        # Check if the JAR already exists
        jar_path = os.path.join(lib_path, "pyflink/lib", f"{jar_name}.jar")
        if os.path.exists(jar_path):
            return jar_path
    
        # Download the JAR
        response = requests.get(jar_url, stream=True)
        if response.status_code != 200:
>           raise SystemError(
                f"Failed to download the JAR file \n"
                f"\t jar_url= {jar_url} \n"
                f"\t response.status_code= {response.status_code}"
            )
E           SystemError: Failed to download the JAR file 
E           	 jar_url= https://repository.apache.org/content/groups/snapshots/org/apache/paimon/paimon-flink-1.18/0.8-SNAPSHOT/paimon-flink-1.18-0.8-20240301.002155-30.jar 
E           	 response.status_code= 404

The "easy" way to deal with this whole thing of downloading JAR's in Python would be skipping these tests by default and leaving them only for manual local testing.

@mfatihaktas mfatihaktas force-pushed the feat/support-time-travel-query branch 3 times, most recently from 38c7e9b to 87b050f Compare March 21, 2024 15:51
@mfatihaktas mfatihaktas marked this pull request as ready for review March 21, 2024 16:36
@mfatihaktas mfatihaktas force-pushed the feat/support-time-travel-query branch from 87b050f to 9304ad9 Compare April 4, 2024 14:03
@mfatihaktas mfatihaktas force-pushed the feat/support-time-travel-query branch from 9304ad9 to 31ec7f2 Compare April 4, 2024 14:40
@cpcloud cpcloud added this to the 9.2 milestone Jun 13, 2024
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop all the flink-related changes and just implement this for bigquery and snowflake (and perhaps one other backend)?

@cpcloud cpcloud removed this from the 9.2 milestone Jul 16, 2024
@cpcloud
Copy link
Member

cpcloud commented Sep 30, 2024

Similar to #8412, I don't have time to maintain this code right now.

@cpcloud cpcloud closed this Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backends Issues related to all backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants