Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Feathr REST API NGINX needs to support larger client header/cookie #878

Open
2 of 4 tasks
Ritaja opened this issue Nov 24, 2022 · 3 comments
Open
2 of 4 tasks
Labels
bug Something isn't working

Comments

@Ritaja
Copy link

Ritaja commented Nov 24, 2022

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the Feathr community.

Feathr version

0.9.0

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.0): Windows 11
  • Python version: 3.10.8
  • Spark version, if reporting runtime issue: N/A

(Problem is not specific to above system info, please refer below)

Describe the problem

Problem 1:
The Feathr client communictes directly to Purview(Atlas) when environment var/config is set to purview name and no other details provided in feature registry:

os.environ['feature_registry__purview__purview_name'] = f'{purview_name}'
vs setting:
os.environ['FEATURE_REGISTRY__API_ENDPOINT']= f'https://{resource_prefix}webapp.azurewebsites.net/api/v1'

Problem 2:
When the client uses FEATHR REST API with backend as Purview registration of Features fails with NGINX error:
feathr_client.register_features()

<center>Request Header Or Cookie Too Large</center>
<hr><center>nginx/1.18.0</center>
</body>
</html>

Possible fix: adapting https://github.com/feathr-ai/feathr/blob/main/deploy/nginx.conf

Tracking information

No response

Code to reproduce bug

Feathr config yml (common for both problems):

api_version: 1
project_config:
  project_name: "nyctaxi"
  required_environment_variables:
    - "REDIS_PASSWORD"
offline_store:
  adls:
    adls_enabled: true
  wasb:
    wasb_enabled: true
  s3:
    s3_enabled: false
    s3_endpoint: ""
  jdbc:
    jdbc_enabled: false
    jdbc_database: ""
    jdbc_table: ""
  snowflake:
    snowflake_enabled: false
    url: ""
    user: ""
    role: ""
spark_config:
  spark_cluster: "azure_synapse"
  spark_result_output_parts: "1"
  azure_synapse:
    dev_url: ""
    pool_name: ""
    workspace_dir: "abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/nyc_taxi"
    executor_size: "Small"
    executor_num: 1
  databricks:
    workspace_instance_url: ""
    config_template:
      {
        "run_name": "",
        "new_cluster":
          {
            "spark_version": "9.1.x-scala2.12",
            "node_type_id": "Standard_D3_v2",
            "num_workers": 2,
            "spark_conf": {},
          },
        "libraries": [{ "jar": "" }],
        "spark_jar_task": { "main_class_name": "", "parameters": [""] },
      }
    work_dir: ""
online_store:
  redis:
    host: "<replace_with_your_redis>.redis.cache.windows.net"
    port: 6380
    ssl_enabled: True
feature_registry:
  api_endpoint: ""

Problem 1:

environment variable settings for Python client:

resource_prefix = "<replace your prefix>"
storage_accountname = "<replace>"
storage_containername = "<replace>"

key_vault_name=resource_prefix+"kv"
synapse_workspace_url=resource_prefix+"syws"
adls_account=resource_prefix+"dls"
adls_fs_name=resource_prefix+"fs"
purview_name=resource_prefix+"purview"
key_vault_uri = f"https://{key_vault_name}.vault.azure.net"
credential = DefaultAzureCredential(exclude_interactive_browser_credential=False, additionally_allowed_tenants=['*'])
client = SecretClient(vault_url=key_vault_uri, credential=credential)
secretName = "FEATHR-ONLINE-STORE-CONN"
retrieved_secret = client.get_secret(secretName).value


redis_port=retrieved_secret.split(',')[0].split(":")[1]
redis_host=retrieved_secret.split(',')[0].split(":")[0]
redis_password=retrieved_secret.split(',')[1].split("password=",1)[1]
redis_ssl=retrieved_secret.split(',')[2].split("ssl=",1)[1]


os.environ['spark_config__azure_synapse__dev_url'] = f'https://{synapse_workspace_url}.dev.azuresynapse.net'
os.environ['spark_config__azure_synapse__pool_name'] = 'spark31'
os.environ['spark_config__azure_synapse__workspace_dir'] = f'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_project'
os.environ['online_store__redis__host'] = redis_host
os.environ['online_store__redis__port'] = redis_port
os.environ['online_store__redis__ssl_enabled'] = redis_ssl
os.environ['REDIS_PASSWORD']=redis_password
feathr_output_path = f'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_output'

problematic setting:

add this to environment above: os.environ['feature_registry__purview__purview_name'] = f'{purview_name}' <-- this seems to use Purview client directly communicating to registry; not using REST API

image

if we define the REST API endpoint: instead of os.environ['feature_registry__purview__purview_name'] = f'{purview_name}' add
os.environ['FEATURE_REGISTRY__API_ENDPOINT']= f'https://{resource_prefix}webapp.azurewebsites.net/api/v1' then client uses REST API

image

Problem 2:

use tha same feather config yaml from above and the same environment variable but now force client to use REST API using os.environ['FEATURE_REGISTRY__API_ENDPOINT']= f'https://{resource_prefix}webapp.azurewebsites.net/api/v1'

from pyspark.sql import SparkSession, DataFrame
def feathr_udf_day_calc(df: DataFrame) -> DataFrame:
    from pyspark.sql.functions import dayofweek, dayofyear, col
    df = df.withColumn("fare_amount_cents", col("fare_amount")*100)
    return df

batch_source = HdfsSource(name="nycTaxiBatchSource",
                          path="wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/green_tripdata_2020-04_with_index.csv",
                          event_timestamp_column="lpep_dropoff_datetime",
                          preprocessing=feathr_udf_day_calc,
                          timestamp_format="yyyy-MM-dd HH:mm:ss")

f_trip_distance = Feature(name="f_trip_distance",
                          feature_type=FLOAT, transform="trip_distance")
f_trip_time_duration = Feature(name="f_trip_time_duration",
                               feature_type=INT32,
                               transform="(to_unix_timestamp(lpep_dropoff_datetime) - to_unix_timestamp(lpep_pickup_datetime))/60")

features = [
    f_trip_distance,
    f_trip_time_duration,
    Feature(name="f_is_long_trip_distance",
            feature_type=BOOLEAN,
            transform="cast_float(trip_distance)>30"),
    Feature(name="f_day_of_week",
            feature_type=INT32,
            transform="dayofweek(lpep_dropoff_datetime)"),
]

request_anchor = FeatureAnchor(name="request_features",
                               source=INPUT_CONTEXT,
                               features=features)

location_id = TypedKey(key_column="DOLocationID",
                       key_column_type=ValueType.INT32,
                       description="location id in NYC",
                       full_name="nyc_taxi.location_id")
agg_features = [Feature(name="f_location_avg_fare",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
                                                          agg_func="AVG",
                                                          window="90d")),
                Feature(name="f_location_max_fare",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
                                                          agg_func="MAX",
                                                          window="90d")),
                Feature(name="f_location_total_fare_cents",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="fare_amount_cents",
                                                          agg_func="SUM",
                                                          window="90d")),
                ]

agg_anchor = FeatureAnchor(name="aggregationFeatures",
                           source=batch_source,
                           features=agg_features)

f_trip_time_distance = DerivedFeature(name="f_trip_time_distance",
                                      feature_type=FLOAT,
                                      input_features=[
                                          f_trip_distance, f_trip_time_duration],
                                      transform="f_trip_distance * f_trip_time_duration")

f_trip_time_rounded = DerivedFeature(name="f_trip_time_rounded",
                                     feature_type=INT32,
                                     input_features=[f_trip_time_duration],
                                     transform="f_trip_time_duration % 10")

feathr_client.build_features(anchor_list=[agg_anchor, request_anchor], derived_feature_list=[
                      f_trip_time_distance, f_trip_time_rounded])

feathr_client.register_features()

What component(s) does this bug affect?

  • Python Client: This is the client users use to interact with most of our API. Mostly written in Python.
  • Computation Engine: The computation engine that execute the actual feature join and generation work. Mostly in Scala and Spark.
  • Feature Registry API: The frontend API layer supports SQL, Purview(Atlas) as storage. The API layer is in Python(FAST API)
  • Feature Registry Web UI: The Web UI for feature registry. Written in React
@Ritaja Ritaja added the bug Something isn't working label Nov 24, 2022
@Ritaja Ritaja changed the title Feathr REST API NGINX needs to support larger client header/cokkie [BUG] Feathr REST API NGINX needs to support larger client header/cokkie Nov 24, 2022
@mufajjul
Copy link
Collaborator

@Yuqing-cat to support @Ritaja on this issue.

@Ritaja Ritaja changed the title [BUG] Feathr REST API NGINX needs to support larger client header/cokkie [BUG] Feathr REST API NGINX needs to support larger client header/cookie Nov 24, 2022
@Yuqing-cat
Copy link
Collaborator

For the first problem, add a log to inform user to use API way: #892.
For the second NGNIX problem, cannot repro it on my env. @Ritaja, could you help to dump the headers to understand why nginx rejects request from python client. The default nginx buffer size is 8k, that means client is send a request with header size over 8k, while our sample will not reach.

@Ritaja
Copy link
Author

Ritaja commented Dec 3, 2022

@Yuqing-cat I could reproduce through Azure deployment. Could you deploy Feathr components using guide here on azure and test with above code snippents ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants