Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only one python worker among 8 workers is triggered after the cancellation of a worker due to a Function timeout #9669

Closed
cgiraudeau opened this issue Nov 8, 2023 · 2 comments
Assignees

Comments

@cgiraudeau
Copy link

cgiraudeau commented Nov 8, 2023

Check for a solution in the Azure portal

For issues in production, please check for a solution to common issues in the Azure portal before opening a bug. In the Azure portal, navigate to your function app, select Diagnose and solve problems from the left, and view relevant dashboards before opening your issue.

Investigative information

Please provide the following:

  • Timestamp: 2023-11-08T08:28:49.6365789Z
  • Function App version: 4.27.5.5
  • Function App name: dev-retriever-engine
  • Function name(s) (as appropriate): function_scraper
  • Trigger: Service Bus
  • Host Instance Id: 299608c0-0e68-40fd-9a82-abb1168801eb
  • Invocation ID: 3c010a34-42df-4393-b0ff-bfa0d1403843
  • Region: France Central
  • Docker image: mcr.microsoft.com/azure-functions/python:4-python3.10-appservice
  • FUNCTIONS_WORKER_PROCESS_COUNT: 8
  • FUNCTIONS_WORKER_RUNTIME: python
  • PYTHON_THREADPOOL_THREAD_COUNT: 2

Repro steps

  1. Have a Function with 8 available worker processes
  2. Invoke a function till the cancellation of the worker due to a timeout. The worker process will be killed and a new one will be created to have always 8 workers.
  3. Make some new invocations, they will be processed only by the new worker. Other workers will never be used.

Expected behavior

The other workers should also be triggered after the kill of a worker due to a cancellation of a function invocation.

Actual behavior

I've ssh-ed to a new fresh instance and did a ps:

root@da4c6cf5bdd3:~/site/wwwroot# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   6872  2648 ?        SNs  08:10   0:00 bash /azure-functions-host/start.sh
root          24  0.0  0.1  13352  3324 ?        SNs  08:10   0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root          25  1.4  6.2 3055060 200580 ?      SNl  08:10   0:08 /azure-functions-host/Microsoft.Azure.WebJobs.Script.WebHost
root          48  0.2  1.3 664784 42660 ?        SNl  08:10   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 5fba1a63-2298-414c-867a-e096d2185627 --reques
root          68  0.2  1.2 664784 41628 ?        SNl  08:10   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 039fd8f5-b5a5-4951-8148-7dc2f1748ca8 --reques
root          88  0.2  1.2 664140 41104 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId ca43468e-6313-4a91-b690-8221a6a01575 --reques
root         102  0.2  1.3 664784 42096 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 8e96c217-79da-46f6-8fbf-ec635a007738 --reques
root         117  0.2  1.2 664784 41744 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 684b33a2-c2c1-48eb-a598-fac296d1e728 --reques
root         130  0.2  1.3 664784 43972 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 9671b4b6-246b-4b88-8f70-7947ef455289 --reques
root         143  0.2  1.3 664784 43732 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId d1589d09-1bb5-40dc-90d2-6cdc31f0cf0e --reques
root         156  0.2  1.2 664784 40268 ?        SNl  08:12   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 4a5a4781-2c64-45b3-856e-9e06c5665a9d --reques
root         180  0.3  0.2  13464  7560 ?        SNs  08:20   0:00 sshd: root@pts/0
root         182  0.0  0.1   7136  3456 pts/0    SNs  08:20   0:00 -bash
root         184  0.0  0.0   9732  2952 pts/0    RN+  08:20   0:00 ps -aux

We have 8 azure_functions_worker processes.

I've done 16 invocations of the function.
We can see in the following image that the invocations are evenly distributed among the workers.

traces
| where timestamp > ago(1d)
| where message startswith "Transaction triggered by Service Bus"
| where customDimensions["HostInstanceId"]=="299608c0-0e68-40fd-9a82-abb1168801eb"
| order by timestamp desc
| project timestamp, instanceId=customDimensions["HostInstanceId"], workerId=substring(substring(message, indexof(message, "worker_id")+13), 0, 36), invocationId=customDimensions["InvocationId"], message

299608c0-0e68-40fd-9a82-abb1168801eb_1

Then, I've done an invocation (3c010a34-42df-4393-b0ff-bfa0d1403843) of the function that has lasted till the cancellation by Azure.
It was executed by the worker 039fd8f5-b5a5-4951-8148-7dc2f1748ca8.
When the function invocation is cancelled, the process of the worker is killed, and a new worker process (39f9623f-8cc1-4980-a802-ae1e43a20f3d) is created:

root@da4c6cf5bdd3:~/site/wwwroot# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   6872  2648 ?        SNs  08:10   0:00 bash /azure-functions-host/start.sh
root          24  0.0  0.1  13352  3324 ?        SNs  08:10   0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root          25  1.0  6.4 3063208 207900 ?      SNl  08:10   0:12 /azure-functions-host/Microsoft.Azure.WebJobs.Script.WebHost
root          48  0.1  1.3 664784 42660 ?        SNl  08:10   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 5fba1a63-2298-414c-867a-e096d2185627 --reques
root          88  0.1  1.2 664140 41104 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId ca43468e-6313-4a91-b690-8221a6a01575 --reques
root         102  0.1  1.3 664784 42096 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 8e96c217-79da-46f6-8fbf-ec635a007738 --reques
root         117  0.1  1.2 664784 41744 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 684b33a2-c2c1-48eb-a598-fac296d1e728 --reques
root         130  0.1  1.3 664784 43972 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 9671b4b6-246b-4b88-8f70-7947ef455289 --reques
root         143  0.1  1.3 664784 43732 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId d1589d09-1bb5-40dc-90d2-6cdc31f0cf0e --reques
root         156  0.1  1.2 664784 40268 ?        SNl  08:12   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 4a5a4781-2c64-45b3-856e-9e06c5665a9d --reques
root         180  0.0  0.2  13464  7560 ?        SNs  08:20   0:00 sshd: root@pts/0
root         182  0.0  0.1   7136  3532 pts/0    SNs  08:20   0:00 -bash
root         194  2.0  1.2 664784 40076 ?        SNl  08:28   0:00 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 39f9623f-8cc1-4980-a802-ae1e43a20f3d --reques
root         209  0.0  0.0   9732  2924 pts/0    RN+  08:29   0:00 ps -aux

The Service Bus event is triggered again (Invocation Id: 3721f1aa-926c-4a46-9536-9850ed8687d9).
The function invocation is processed by the new worker 39f9623f-8cc1-4980-a802-ae1e43a20f3d.
Then the function invocation is cancelled again by Azure due to a timeout. The process worker is killed and a new worker process (69c25cae-78b6-4ac5-9147-a888428d01a4) is created.

root@da4c6cf5bdd3:~/site/wwwroot# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   6872  2648 ?        SNs  08:10   0:00 bash /azure-functions-host/start.sh
root          24  0.0  0.1  13352  3324 ?        SNs  08:10   0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root          25  1.0  6.4 3063208 208120 ?      SNl  08:10   0:13 /azure-functions-host/Microsoft.Azure.WebJobs.Script.WebHost
root          48  0.1  1.3 664784 42660 ?        SNl  08:10   0:02 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 5fba1a63-2298-414c-867a-e096d2185627 --reques
root          88  0.1  1.2 664140 41104 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId ca43468e-6313-4a91-b690-8221a6a01575 --reques
root         102  0.1  1.3 664784 42096 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 8e96c217-79da-46f6-8fbf-ec635a007738 --reques
root         117  0.1  1.2 664784 41744 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 684b33a2-c2c1-48eb-a598-fac296d1e728 --reques
root         130  0.1  1.3 664784 43972 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 9671b4b6-246b-4b88-8f70-7947ef455289 --reques
root         143  0.1  1.3 664784 43732 ?        SNl  08:11   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId d1589d09-1bb5-40dc-90d2-6cdc31f0cf0e --reques
root         156  0.1  1.2 664784 40268 ?        SNl  08:12   0:01 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 4a5a4781-2c64-45b3-856e-9e06c5665a9d --reques
root         180  0.0  0.2  13464  7560 ?        SNs  08:20   0:00 sshd: root@pts/0
root         182  0.0  0.1   7136  3532 pts/0    SNs  08:20   0:00 -bash
root         212  1.8  1.2 664784 39624 ?        SNl  08:31   0:00 /opt/retriever_venv/bin/python -m azure_functions_worker --host 127.0.0.1 --port 33221 --workerId 69c25cae-78b6-4ac5-9147-a888428d01a4 --reques
root         227  0.0  0.0   9732  2884 pts/0    RN+  08:31   0:00 ps -aux

299608c0-0e68-40fd-9a82-abb1168801eb_4

299608c0-0e68-40fd-9a82-abb1168801eb_2

Then, I've done 8 new invocations, and we can see that these invocations have been processed by only one worker (69c25cae-78b6-4ac5-9147-a888428d01a4).
The invocations are not evenly distributed among the worker anymore.

299608c0-0e68-40fd-9a82-abb1168801eb_3

timestamp [UTC] instanceId workerId invocationId
11/8/2023, 8:38:57.548 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 9189cbaa-bbc2-4821-9c7c-f6bb865bb75f
11/8/2023, 8:38:53.831 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 32624d22-5bec-40ae-8ebd-fc4cfe3dfcf2
11/8/2023, 8:38:47.499 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 57861189-420c-406d-b123-7fa8eff92843
11/8/2023, 8:38:43.783 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 6c886dff-61a1-4716-acd4-2ca436d4fafe
11/8/2023, 8:38:37.451 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 0d371c2c-ba97-4ec2-af4d-255a52c75db1
11/8/2023, 8:38:33.736 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 04279503-2926-4408-9bbe-e11e638f1c36
11/8/2023, 8:38:27.399 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 9056474c-9de4-4c47-bd2b-eb4d6f7fc87d
11/8/2023, 8:38:23.684 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 aca81598-bd59-4c4a-a3b6-03b838d2ee27
11/8/2023, 8:31:20.074 AM 299608c0-0e68-40fd-9a82-abb1168801eb 69c25cae-78b6-4ac5-9147-a888428d01a4 5edb6c2b-ecc4-42b6-8a69-f1ce5cfdfad2
11/8/2023, 8:29:07.341 AM 299608c0-0e68-40fd-9a82-abb1168801eb 39f9623f-8cc1-4980-a802-ae1e43a20f3d 3721f1aa-926c-4a46-9536-9850ed8687d9
11/8/2023, 8:26:49.645 AM 299608c0-0e68-40fd-9a82-abb1168801eb 039fd8f5-b5a5-4951-8148-7dc2f1748ca8 3c010a34-42df-4393-b0ff-bfa0d1403843
11/8/2023, 8:22:06.357 AM 299608c0-0e68-40fd-9a82-abb1168801eb 5fba1a63-2298-414c-867a-e096d2185627 143eb478-4b63-4955-9f18-8810af51a62a
11/8/2023, 8:22:05.167 AM 299608c0-0e68-40fd-9a82-abb1168801eb 4a5a4781-2c64-45b3-856e-9e06c5665a9d 2d384606-1d6e-4a4c-9f38-ce05ecf9e12f
11/8/2023, 8:22:03.975 AM 299608c0-0e68-40fd-9a82-abb1168801eb d1589d09-1bb5-40dc-90d2-6cdc31f0cf0e 55a2888e-cbc2-4c4f-9720-826cbd4c7c12
11/8/2023, 8:22:02.835 AM 299608c0-0e68-40fd-9a82-abb1168801eb 9671b4b6-246b-4b88-8f70-7947ef455289 d30c61ef-f5f2-4afd-b562-9c20dee84c13
11/8/2023, 8:22:01.688 AM 299608c0-0e68-40fd-9a82-abb1168801eb 684b33a2-c2c1-48eb-a598-fac296d1e728 7d926914-a460-429f-91b5-cd0083494ad6
11/8/2023, 8:22:00.563 AM 299608c0-0e68-40fd-9a82-abb1168801eb 8e96c217-79da-46f6-8fbf-ec635a007738 3e578493-e843-4eb1-8d54-7129545cdcc5
11/8/2023, 8:21:59.271 AM 299608c0-0e68-40fd-9a82-abb1168801eb ca43468e-6313-4a91-b690-8221a6a01575 dbe48553-cafb-4368-996f-be388a3e1f74
11/8/2023, 8:21:58.083 AM 299608c0-0e68-40fd-9a82-abb1168801eb 039fd8f5-b5a5-4951-8148-7dc2f1748ca8 2fb913e4-ca33-4784-8f49-5cbd6de8acdb
11/8/2023, 8:16:17.929 AM 299608c0-0e68-40fd-9a82-abb1168801eb 5fba1a63-2298-414c-867a-e096d2185627 24bcbb24-413c-42c0-833a-f9c8a4ca4b22
11/8/2023, 8:16:16.730 AM 299608c0-0e68-40fd-9a82-abb1168801eb 4a5a4781-2c64-45b3-856e-9e06c5665a9d c901345b-5a47-405e-8ede-4c273a744246
11/8/2023, 8:16:15.554 AM 299608c0-0e68-40fd-9a82-abb1168801eb d1589d09-1bb5-40dc-90d2-6cdc31f0cf0e f3f4061d-b6b8-4408-9ced-2c4b97e7b77a
11/8/2023, 8:16:14.482 AM 299608c0-0e68-40fd-9a82-abb1168801eb 9671b4b6-246b-4b88-8f70-7947ef455289 7679dd4b-98cd-4bfc-ae4f-24101d198958
11/8/2023, 8:16:13.270 AM 299608c0-0e68-40fd-9a82-abb1168801eb 684b33a2-c2c1-48eb-a598-fac296d1e728 0b187601-d029-457d-8faa-695a585ec19b
11/8/2023, 8:16:11.909 AM 299608c0-0e68-40fd-9a82-abb1168801eb 8e96c217-79da-46f6-8fbf-ec635a007738 9a116e2d-2a86-4901-9071-2731e9e7cb96
11/8/2023, 8:16:10.743 AM 299608c0-0e68-40fd-9a82-abb1168801eb ca43468e-6313-4a91-b690-8221a6a01575 2628d3c3-e91a-4fc6-a5cf-a87e277da96b
11/8/2023, 8:16:09.677 AM 299608c0-0e68-40fd-9a82-abb1168801eb 039fd8f5-b5a5-4951-8148-7dc2f1748ca8 1fc085e2-3149-49a4-a68c-89a0e67a12d3

Known workarounds

None

Related information

Provide any related information

  • Programming language used: Python
  • Bindings used: Service Bus

host.json

{
  "version": "2.0",
  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[4.0.0, 5.0.0)"
  },
  "extensions": {
    "serviceBus": {
      "clientRetryOptions": {
        "mode": "Fixed",
        "tryTimeout": "00:15:00",
        "delay": "00:00:00.80",
        "maxDelay": "00:00:05",
        "maxRetries": 3
      },
      "prefetchCount": 0,
      "maxAutoLockRenewalDuration": "00:15:00",
      "maxConcurrentCalls": 8,
      "maxMessageBatchSize": 1
    }
  },
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 3,
    "delayInterval": "00:00:05"
  },
  "logging": {
    "fileLoggingMode": "debugOnly",
    "logLevel": {
      "default": "Debug"
    },
    "console": {
      "isEnabled": true,
      "DisableColors": true
    }
  },
  "functionTimeout": "00:02:00"
}
Source
import json
import logging
import os
import sys
import time
import threading

import azure.functions as func

logger = logging.getLogger(__name__)


# https://docs.microsoft.com/nb-no/python/api/azure-functions/azure.functions.servicebusmessage?view=azure-python
def main(msg: func.ServiceBusMessage, context: func.Context):
    """Entry point of the Azure Function.

    See: https://docs.microsoft.com/en-us/azure/azure-functions/functions-reference-python?tabs=azurecli-linux%2Capplication-level#programming-model

    The context parameter gives us the invocation_id.
    See: https://docs.microsoft.com/en-us/azure/azure-functions/functions-reference-python?tabs=azurecli-linux%2Capplication-level#context

    :param msg:
    :param context:
    :return:
    """

    # Get the message and deserialize it to a Python object.
    event = get_event(msg)

    # Log the invocation of the function.
    log_content = get_log_content(msg, event)
    log_content['invocation_id'] = context.invocation_id
    log_content['request_id'] = get_argument_value('--requestId')
    log_content['worker_id'] = get_argument_value('--workerId')
    log = f"Transaction triggered by Service Bus: {log_content}"
    logger.info(log)

    if event['provider_id'] == 'very_slow' and msg.delivery_count == 1:
        time.sleep(180)
    else:
        time.sleep(10)


def get_argument_value(arg):
    """

    :param arg:
    :return:
    """
    value = None
    if arg in sys.argv:
        pos = sys.argv.index(arg)
        value = sys.argv[pos + 1]
    return value


def get_event(msg):
    """
    
    :param msg: 
    :return: 
    """
    body = msg.get_body()
    json_str = body.decode('utf8').replace("'", '"')
    return json.loads(json_str)

    
def get_log_content(msg, event):
    """
    
    :param msg: 
    :param event: 
    :return: 
    """
    azure_web_instance_id = os.environ.get('WEBSITE_INSTANCE_ID')
    thread_id = f"name={threading.current_thread().name}-id={threading.current_thread().ident}"
    instance_pid = os.getpid()
    log_content = dict()
    log_content['instance_pid'] = instance_pid
    log_content['azure_web_instance_id_pid'] = azure_web_instance_id
    log_content['thread_id'] = thread_id

    # Log the event.
    enqueued_time = msg.enqueued_time_utc.isoformat() if msg.enqueued_time_utc is not None else None
    locked_until = msg.expires_at_utc.isoformat() if msg.expires_at_utc is not None else None
    log_content['tid'] = event['tid']
    log_content['provider_id'] = event['provider_id']
    log_content['delivery_count'] = msg.delivery_count
    log_content['enqueued_time'] = enqueued_time
    log_content['locked_until'] = locked_until

    return log_content
@bhagyshricompany bhagyshricompany self-assigned this Nov 16, 2023
@bhagyshricompany
Copy link

Thanks for reporting .will check and update.

@brettsam
Copy link
Member

The issue here has now been fixed and will be part of our next release. The PR is #9574 and the relevant line that is causing issues is right here: https://github.com/Azure/azure-functions-host/pull/9574/files#diff-b482fc634ed2b2979781211daa2f08a9196fe4a51a9c76972a2dea16717df99fL231

Previously, we'd remove all channels for a language during a restart (note the TryRemove()). But we wouldn't kill them all -- they'd just be unable to be invoked as they wouldn't be in our channel list anymore. Now we'll do a TryGet() in order to restart only a single worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants