New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: Bump psycopg2 to psycopg3 for all Postgres components #1

Closed

job-almekinders wants to merge 20 commits into master from psycopg2-3

Owner

job-almekinders commented Jun 21, 2024 •

edited

Loading

What this PR does / why we need it:

This PR upgrades the psycopg2 dependency to the newer psycopg3 dependency. See here for more information on the differences between the two versions.

This is the 1st out of 2 PRs, required to enable async feature retrieval for the Postgres Online Store.

While here:

Makefile: Remove additional whitespace and exclude Snowflake tests for Postgres offline integration tests
Make some functions more readable, and add some additional docstrings and typehints.
Make the batch_size argument configurable for the postgres online store materialization function. This Fixes: Discussion: Pushing batches of data to online store: Should conn.commit() happen in the for loop or after? feast-dev/feast#4036

Additional remarks

The changes in this commit are related to the linter. In psycopg3, stricter type hints on the Cursor object require handling cases where cursor.description might be None. Although psycopg2 could also return None for this, it wasn't previously accounted for.

Which issue(s) this PR fixes:

1st out of 2 PRs required to fix feast-dev#4260

job-almekinders added 16 commits

June 21, 2024 12:51


          Makefile: Formatting

c8fae69

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Makefile: Exclude Snowflake tests for postgres offline store tests

161fd9d

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Requirements: Update

df0f440

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Bootstrap: Use conninfo

c200254

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Tests: Make connection string compatible with psycopg3

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Tests: Test connection type pool and singleton

86c22e5

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Global: Replace conn.set_session() calls to be psycopg3 compatible

8fb9224

Set connection read only

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Offline: Use psycopg3

0393d6d

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Online: Use psycopg3

2a7ad9a

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Online: Restructure online_write_batch

995fc08

Addition

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Online: Use correct placeholder

6f1a436

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Online: Handle bytes properly in online_read()

bfb5e60

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Online: Whitespace

46f7c96

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Online: Open ConnectionPool

3d3b88c

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Online: Add typehint

6625a2c

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Utils: Use psycopg3

d0f77ca

Use new ConnectionPool

Pass kwargs as named argument

Use executemany over execute_values

Remove not-required open argument in psycopg.connect

Improve

Use SpooledTemporaryFile

Use max_size and add docstring

Properly write with StringIO

Utils: Use SpooledTemporaryFile over StringIO object

Add replace

Fix df_to_postgres_table

Remove import

Utils

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>

job-almekinders commented

View reviewed changes

Owner Author

job-almekinders left a comment

Some clarifications from my side!

sdk/python/feast/infra/online_stores/contrib/postgres.py

@@ @@ -64,57 +75,56 @@ def online_write_batch( @@
                           Tuple[EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]]
                       ],
                       progress: Optional[Callable[[int], Any]],
+                      batch_size: int = 5000,

Owner Author

job-almekinders Jun 21, 2024

Make configurable, addressing feast-dev#4036

sdk/python/feast/infra/online_stores/contrib/postgres.py

Comment on lines +80 to +119

+                      # Format insert values
+                      insert_values = []
+                      for entity_key, values, timestamp, created_ts in data:
+                          entity_key_bin = serialize_entity_key(
+                              entity_key,
+                              entity_key_serialization_version=config.entity_key_serialization_version,
+                          )
+                          timestamp = _to_naive_utc(timestamp)
+                          if created_ts is not None:
+                              created_ts = _to_naive_utc(created_ts)
-                      with self._get_conn(config) as conn, conn.cursor() as cur:
-                          insert_values = []
-                          for entity_key, values, timestamp, created_ts in data:
-                              entity_key_bin = serialize_entity_key(
-                                  entity_key,
-                                  entity_key_serialization_version=config.entity_key_serialization_version,
-                              )
-                              timestamp = _to_naive_utc(timestamp)
-                              if created_ts is not None:
-                                  created_ts = _to_naive_utc(created_ts)
-                              for feature_name, val in values.items():
-                                  vector_val = None
-                                  if config.online_store.pgvector_enabled:
-                                      vector_val = get_list_val_str(val)
-                                  insert_values.append(
-                                      (
-                                          entity_key_bin,
-                                          feature_name,
-                                          val.SerializeToString(),
-                                          vector_val,
-                                          timestamp,
-                                          created_ts,
-                                      )
+                          for feature_name, val in values.items():
+                              vector_val = None
+                              if config.online_store.pgvector_enabled:
+                                  vector_val = get_list_val_str(val)
+                              insert_values.append(
+                                  (
+                                      entity_key_bin,
+                                      feature_name,
+                                      val.SerializeToString(),
+                                      vector_val,
+                                      timestamp,
+                                      created_ts,
                                   )
-                          # Control the batch so that we can update the progress
-                          batch_size = 5000
+                              )
+                      # Create insert query
+                      sql_query = sql.SQL(
+                          """
+                          INSERT INTO {}
+                          (entity_key, feature_name, value, vector_value, event_ts, created_ts)
+                          VALUES (%s, %s, %s, %s, %s, %s)
+                          ON CONFLICT (entity_key, feature_name) DO
+                          UPDATE SET
+                              value = EXCLUDED.value,
+                              vector_value = EXCLUDED.vector_value,
+                              event_ts = EXCLUDED.event_ts,
+                              created_ts = EXCLUDED.created_ts;
+                      """
+                      ).format(sql.Identifier(_table_id(config.project, table)))

Owner Author

job-almekinders Jun 21, 2024

No changes here, only moving code further up in the function to make it more readable.

sdk/python/feast/infra/online_stores/contrib/postgres.py

+                          """
+                          INSERT INTO {}
+                          (entity_key, feature_name, value, vector_value, event_ts, created_ts)
+                          VALUES (%s, %s, %s, %s, %s, %s)

Owner Author

job-almekinders Jun 21, 2024

1 out of 2 actual changes to the function:

We need to explicitly set the number of placeholder values.

sdk/python/feast/infra/online_stores/contrib/postgres.py

-                                  cur_batch,
-                                  page_size=batch_size,
-                              )
+                              cur.executemany(sql_query, cur_batch)

Owner Author

job-almekinders Jun 21, 2024

2 out of 2 actual changes to the function:

The psycopg2.extras.execute_values functionality is removed in psycopg3. The maintainer of psycopg3 advices to use executemany. See psycopg/psycopg#576 and psycopg/psycopg#114

sdk/python/feast/infra/online_stores/contrib/postgres.py

@@ @@ -172,7 +182,9 @@ def online_read( @@
                           # when we iterate through the keys since they are in the correct order
                           values_dict = defaultdict(list)
                           for row in rows if rows is not None else []:
-                              values_dict[row[0].tobytes()].append(row[1:])
+                              values_dict[
+                                  row[0] if isinstance(row[0], bytes) else row[0].tobytes()

Owner Author

job-almekinders Jun 21, 2024

Only call tobytes() when row[0] is not already of bytes type. Otherwise, this will result in Errors.

sdk/python/feast/infra/utils/postgres/connection_utils.py

Comment on lines +35 to +56

+              def _get_conninfo(config: PostgreSQLConfig) -> str:
+                  """Get the `conninfo` argument required for connection objects."""
+                  return (
+                      f"postgresql://{config.user}"
+                      f":{config.password}"
+                      f"@{config.host}"
+                      f":{int(config.port)}"
+                      f"/{config.database}"
+                  )
+              def _get_conn_kwargs(config: PostgreSQLConfig) -> Dict[str, Any]:
+                  """Get the additional `kwargs` required for connection objects."""
+                  return {
+                      "sslmode": config.sslmode,
+                      "sslkey": config.sslkey_path,
+                      "sslcert": config.sslcert_path,
+                      "sslrootcert": config.sslrootcert_path,
+                      "options": "-c search_path={}".format(config.db_schema or config.user),
+                  }

Owner Author

job-almekinders Jun 21, 2024

Helper functions to prevent code duplication in the above methods.

sdk/python/feast/infra/utils/postgres/connection_utils.py

Comment on lines +75 to +82

+                  nr_columns = df.shape[1]
+                  placeholders = ", ".join(["%s"] * nr_columns)
+                  query = f"INSERT INTO {table_name} VALUES ({placeholders})"
+                  values = df.replace({np.NaN: None}).to_numpy().tolist()
                   with _get_conn(config) as conn, conn.cursor() as cur:
                       cur.execute(_df_to_create_table_sql(df, table_name))
-                      psycopg2.extras.execute_values(
-                          cur,
-                          f"""
-                          INSERT INTO {table_name}
-                          VALUES %s
-                          """,
-                          df.replace({np.NaN: None}).to_numpy(),
-                      )
+                      cur.executemany(query, values)

Owner Author

job-almekinders Jun 21, 2024

Moved the parsing of variables further to the top of the function.

Again, we need to replace execute_values by executemany.
Again, we need to explicitly set the number of placeholders. Since this function should be able to handle a dynamic amount of columns, we use the placeholders variable

sdk/python/tests/integration/online_store/test_universal_online.py

Comment on lines +42 to +46

+              @pytest.mark.parametrize(
+                  "conn_type",
+                  [ConnectionType.singleton, ConnectionType.pool],
+                  ids=lambda v: f"conn_type:{v}",
+              )

Owner Author

job-almekinders Jun 21, 2024

Test both ConnectionTypes

sdk/python/tests/integration/registration/test_universal_registry.py

@@ @@ -149,7 +149,7 @@ def pg_registry(): @@
                   registry_config = RegistryConfig(
                       registry_type="sql",
-                      path=f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{container_host}:{container_port}/{POSTGRES_DB}",
+                      path=f"postgresql+psycopg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{container_host}:{container_port}/{POSTGRES_DB}",

Owner Author

job-almekinders Jun 21, 2024

Required to instruct that we are using psycopg3. If we don't add this, it will try to still use psycopg2.

TomSteenbergen Jun 21, 2024

Can you elaborate why? In picnic-feature-store we do not have this issue and we don't need the psycopg part. We shouldn't still have psycopg2 in the venv anymore, right?

TomSteenbergen Jun 21, 2024 •

edited

Loading

Or is it some weird behaviour by SQLAlchemy's create_engine, which gets called in the init of SQLRegistry?

Owner Author

job-almekinders Jun 21, 2024

Or is it some weird behaviour by SQLAlchemy's create_engine, which gets called in the init of SQLRegistry?

Yes, it is because of the create_engine function. I'll add a comment in the code 👍

TomSteenbergen reviewed

View reviewed changes

TomSteenbergen left a comment

Nice, LGTM! Just one question

TomSteenbergen commented Jun 21, 2024

I notice one mention of psycopg2 btw in docs/tutorials/using-scalable-registry.md. Shall we get rid of the 2 there as well?

job-almekinders added 3 commits

June 21, 2024 15:08


          Lint: Raise exceptions if cursor returned no columns or rows

b7de4b2

Add log statement

Lint: Fix _to_arrow_internal

Lint: Fix _get_entity_df_event_timestamp_range

Update exception

Use ZeroColumnQueryResult

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Add comment on +psycopg string

60f5a23

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>


          Docs: Remove mention of psycopg2

f9fa158

Signed-off-by: Job Almekinders <job.almekinders@teampicnic.com>

Owner Author

job-almekinders commented Jun 21, 2024

I notice one mention of psycopg2 btw in docs/tutorials/using-scalable-registry.md. Shall we get rid of the 2 there as well?

Good catch! I've removed it :)


          Lint: Fix

e2d2188

Owner Author

job-almekinders commented Jun 27, 2024

PR on feast repo: feast-dev#4303

job-almekinders closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet