Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Eventhubs] Cannot send encoded bytes #13739

Closed
jannekeskitalo opened this issue Sep 11, 2020 · 15 comments
Closed

[Eventhubs] Cannot send encoded bytes #13739

jannekeskitalo opened this issue Sep 11, 2020 · 15 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that uAMQP
Milestone

Comments

@jannekeskitalo
Copy link

jannekeskitalo commented Sep 11, 2020

  • Package Name: azure-eventhub
  • Package Version: 5.1.0
  • Operating System: python:3.8 (dockerhub image)
  • Python Version: 3.8

Describe the bug
I fetch rows from a SQL server database and send those in batches to Eventhubs using the async client. I have had multiple instances of the service running against different databases without problems until now with some new tables I get random send failures.

To Reproduce

I can't reproduce the error as the failure seems random. It seems to be related to some specific tables, but even they don't immediately fail.

Log from the container:


2020-09-11 14.19.59.732	2020-09-11 14:19:59.731 INFO uamqp.c_uamqp: b'Invalid argument (list=(nil), item=0x55e6b2bd9d10)' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/src/singlylinkedlist.c':b'singlylinkedlist_remove':108) 	
	
    2020-09-11 14.19.59.732	2020-09-11 14:19:59.731 INFO uamqp.c_uamqp: b'Error removing pending delivery from the list' (b'/data/src/vendor/azure-uamqp-c/src/link.c':b'link_transfer_async':1596) 	
	
    2020-09-11 14.19.59.732	malloc(): corrupted unsorted chunks 	
	
    2020-09-11 14.19.59.731	2020-09-11 14:19:59.730 INFO uamqp.connection: Connection b'EHProducer-50b50369-4815-41fd-8639-1d6390535ff6' state changed from <ConnectionState.OPENED: 9> to <ConnectionState.END: 13> 	
	
    2020-09-11 14.19.59.731	2020-09-11 14:19:59.731 INFO uamqp.connection: Connection with ID b'EHProducer-50b50369-4815-41fd-8639-1d6390535ff6' unexpectedly in an error state. Closing: False, Error: None 	
	
    2020-09-11 14.19.59.731	2020-09-11 14:19:59.731 INFO uamqp.c_uamqp: CBS error occurred on connection b'EHProducer-50b50369-4815-41fd-8639-1d6390535ff6'. 	
	
    2020-09-11 14.19.59.731	2020-09-11 14:19:59.731 INFO uamqp.sender: Message sender b'sender-link-9d531a90-358d-446a-952b-241c5d93f492' state changed from <MessageSenderState.Open: 2> to <MessageSenderState.Idle: 0> on connection: b'EHProducer-50b50369-4815-41fd-8639-1d6390535ff6' 	
	
    2020-09-11 14.19.59.731	2020-09-11 14:19:59.731 INFO uamqp.c_uamqp: b'Connection not open' (b'/data/src/vendor/azure-uamqp-c/src/connection.c':b'connection_encode_frame':2048) 	
	
    2020-09-11 14.19.59.731	2020-09-11 14:19:59.731 INFO uamqp.c_uamqp: b'Failed session send transfer' (b'/data/src/vendor/azure-uamqp-c/src/link.c':b'link_transfer_async':1593) 	
	
    2020-09-11 14.19.59.729	2020-09-11 14:19:59.728 INFO uamqp.c_uamqp: b'Failure: sending socket failed. errno=104 (Connection reset by peer).' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/socketio_berkeley.c':b'socketio_send':884) 	
	
    2020-09-11 14.19.59.729	2020-09-11 14:19:59.729 INFO uamqp.c_uamqp: b'Error in xio_send.' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/tlsio_openssl.c':b'write_outgoing_bytes':641) 	
	
    2020-09-11 14.19.59.729	2020-09-11 14:19:59.729 INFO uamqp.c_uamqp: b'Error in write_outgoing_bytes.' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/tlsio_openssl.c':b'tlsio_openssl_send':1374) 	
	
    2020-09-11 14.19.59.729	2020-09-11 14:19:59.729 INFO uamqp.c_uamqp: b'xio_send failed' (b'/data/src/vendor/azure-uamqp-c/src/saslclientio.c':b'saslclientio_send_async':1190) 	
	
    2020-09-11 14.19.59.729	2020-09-11 14:19:59.729 INFO uamqp.c_uamqp: b'Cannot send encoded bytes' (b'/data/src/vendor/azure-uamqp-c/src/connection.c':b'on_bytes_encoded':268) 

Sending code:

    async def send_batch(self, batch: List[Dict]) -> Dict[str, str]:
        idx = 0
        last_lsn = None
        while True:
            # Batch is split to max send size Eventhub accepts
            event_data_batch = await self.client.create_batch()
            for record in batch[idx:]:
                try:
                    event_data_batch.add(EventData(record["event"].json_string))
                    idx += 1
                    last_lsn = record["lsn"]
                except ValueError:
                    # Max size reached
                    break

            if len(event_data_batch) > 0:
                await self.client.send_batch(event_data_batch)

            if idx == len(batch):
                break
        return last_lsn
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Sep 11, 2020
@kaerm kaerm added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Sep 11, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 11, 2020
@KieranBrantnerMagee
Copy link
Member

Hey @jannekeskitalo , thanks for reaching out, and sorry for any trouble you're having with this.

Given the nature of that error, I'm VERY curious to learn more about the data you're sending that triggers this; Is there any way you could log/provide that information? We run stress tests of the SDK and I can't honestly say I've ever seen this symptom before, so any clues that would help facilitate reproduction would be great.

(There may also be useful context in the log you provided above, higher up than you included; I'm interested to see what UAMQP/SB think they were doing prior.)

@KieranBrantnerMagee KieranBrantnerMagee added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Sep 14, 2020
@jannekeskitalo
Copy link
Author

I will copy some sample data from the problematic tables into my local test environment and try reproducing the issue. Much easier to debug that way.

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Sep 15, 2020
@KieranBrantnerMagee KieranBrantnerMagee added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Sep 21, 2020
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Sep 28, 2020
@ghost
Copy link

ghost commented Sep 28, 2020

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@jannekeskitalo
Copy link
Author

jannekeskitalo commented Sep 29, 2020

I can now reproduce the issue locally. It doesn't seem to be related to the shape of the data. It's a combination of the eventhub connection expiring and the amount of data going into the batch afterwards. Here's a test case:

import pytest
from typing import List, Dict
from azure.eventhub.aio import EventHubProducerClient
from azure.eventhub import EventData
import os
import random
import string
import json
import asyncio


@pytest.fixture
async def client():
    eventhub_uri = os.environ["EVENTHUB_URI"]
    client = EventHubProducerClient.from_connection_string(eventhub_uri)
    yield client
    await client.close()


async def send_batch(client: EventHubProducerClient, batch: List[Dict]) -> None:
    idx = 0
    while True:
        # Batch is split to max send size Eventhub accepts
        event_data_batch = await client.create_batch()
        for record in batch[idx:]:
            try:
                event_data_batch.add(EventData(json.dumps(record)))
                idx += 1
            except ValueError:
                # Max size reached
                break

        if len(event_data_batch) > 0:
            await client.send_batch(event_data_batch)

        if idx == len(batch):
            break


def random_string(max_length: int) -> str:
    length = random.randint(1, max_length)
    letters = string.printable
    return "".join(random.choice(letters) for i in range(length))


def create_batch(rows: int, fields: int) -> List[Dict]:
    fields = [f"field_{i}" for i in range(fields)]
    batch = []
    for i in range(rows):
        batch.append({field: random_string(10) for field in fields})
    return batch


def test_create_batch():
    batch = create_batch(10, 5)
    for record in batch:
        message_string = json.dumps(record)
        json.loads(message_string)


@pytest.mark.asyncio
async def test_send_large_batch_after_3min_delay(client):
    await send_batch(client, create_batch(10, 40))
    await asyncio.sleep(180)
    await send_batch(client, create_batch(200, 40))


@pytest.mark.asyncio
async def test_send_small_batch_after_5min_delay(client):
    await send_batch(client, create_batch(10, 40))
    await asyncio.sleep(180)
    await send_batch(client, create_batch(10, 40))


@pytest.mark.asyncio
async def test_send_large_batch_after_5min_delay(client):
    await send_batch(client, create_batch(10, 40))
    await asyncio.sleep(300)
    await send_batch(client, create_batch(200, 40))

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. labels Sep 29, 2020
@jannekeskitalo
Copy link
Author

I'm planning to workaround the issue by keeping track of idle time and recreating the client if more than 4min has passed since the last send. I don't like this approach much, so is there a better way to keep the client alive?

@KieranBrantnerMagee
Copy link
Member

Aha, it was not immediately clear when I read your tests, comprehensive as they were, but your mention of 4 minutes rang a key bell. The service has a link idle timeout of 4 minutes. We ostensibly have a reconnect logic in during the send you're performing, but something seems to be going wrong in the C; I would not expect that error. I'm going to need to dig in once we're out the side of this Friday's release, the code you provided is amazingly comprehensive, thank you for that.

That said, while I agree with your distaste for the workaround, I don't have a great solution in the interim outside what you did or using an empty message as a heartbeat, but I almost feel worse about that fix due to the resource utilization. Will poll our crew and chime in here if anyone can think of something clever to tide you over as we investigate.

@jannekeskitalo
Copy link
Author

I added the connection recycling -logic and it seems to keep the service alive. Closing the old connection after >4min of idle time still generates some nasty looking log entries, but nothing fatal happens.

Sending dummy messages would have an impact on the consumers, hence I rather just recycle the client for now.

Thanks for your help.

@yunhaoling
Copy link
Contributor

hey @jannekeskitalo , I'm testing sending-after-long-sleeping locally but my result showed that functionality doesn't get impacted -- second send after 5mins would succeed as it would re-established the connection, so wondering what error does your code get.

Our sdk has implemented retry: when the producer wants to send an event after 4 mins (service timeout), the sdk would get an error then retry would kick in to re-establish the underlying connection which should brings back to the normal track.

There're two thing I want to confirm:

  1. in your test case
@pytest.mark.asyncio
async def test_send_large_batch_after_5min_delay(client):
    await send_batch(client, create_batch(10, 40))
    await asyncio.sleep(300)
    await send_batch(client, create_batch(200, 40))

does the second send_batch operation get errored out, what error do you get?

  1. is there any connection reestablishment logging after the failure logging?
...
    2020-09-11 14.19.59.729	2020-09-11 14:19:59.729 INFO uamqp.c_uamqp: b'xio_send failed' (b'/data/src/vendor/azure-uamqp-c/src/saslclientio.c':b'saslclientio_send_async':1190) 	
	
    2020-09-11 14.19.59.729	2020-09-11 14:19:59.729 INFO uamqp.c_uamqp: b'Cannot send encoded bytes' (b'/data/src/vendor/azure-uamqp-c/src/connection.c':b'on_bytes_encoded':268) 

@jannekeskitalo
Copy link
Author

Hi

I updated the eventhubs library to 5.2.0, but still get the same error. I revised the test and added a custom logger. You can find the log output after the test code.

Note: I'm testing against an eventhub which has 4 partitions.

import pytest
from typing import List, Dict, Iterator
from azure.eventhub.aio import EventHubProducerClient
from azure.eventhub import EventData
import os
import sys
import random
import string
import json
import asyncio
import logging
from logging.handlers import RotatingFileHandler


def get_logger(filename, level=logging.INFO):
    azure_logger = logging.getLogger("azure.eventhub")
    azure_logger.setLevel(level)
    uamqp_logger = logging.getLogger("uamqp")
    uamqp_logger.setLevel(logging.INFO)

    formatter = logging.Formatter("%(asctime)s %(name)-12s %(levelname)-8s %(message)s")
    console_handler = logging.StreamHandler(stream=sys.stdout)
    console_handler.setFormatter(formatter)
    if not azure_logger.handlers:
        azure_logger.addHandler(console_handler)
    if not uamqp_logger.handlers:
        uamqp_logger.addHandler(console_handler)

    if filename:
        file_handler = RotatingFileHandler(
            filename, maxBytes=5 * 1024 * 1024, backupCount=2
        )
        file_handler.setFormatter(formatter)
        azure_logger.addHandler(file_handler)

    return azure_logger


logger = get_logger("/tmp/ehub_debug.log", level=logging.DEBUG)


@pytest.fixture
async def client() -> Iterator[EventHubProducerClient]:
    client = new_client()
    yield client
    await client.close()


async def recycle_client(client: EventHubProducerClient):
    try:
        await client.close()
    except Exception:
        print("Could not close old client cleanly. Setting to None")
    return new_client()


def new_client():
    eventhub_uri = os.environ["EVENTHUB_URI"]
    if "test-01" not in eventhub_uri:
        raise Exception("Send test events to hub test-01")
    return EventHubProducerClient.from_connection_string(eventhub_uri)


async def send_batch(client: EventHubProducerClient, batch: List[Dict]) -> None:
    idx = 0
    while True:
        # Batch is split to max send size Eventhub accepts
        event_data_batch = await client.create_batch()
        for record in batch[idx:]:
            try:
                event_data_batch.add(EventData(json.dumps(record)))
                idx += 1
            except ValueError:
                # Max size reached
                break

        if len(event_data_batch) > 0:
            await client.send_batch(event_data_batch)

        if idx == len(batch):
            break


def random_string(max_length: int) -> str:
    length = random.randint(1, max_length)
    letters = string.printable
    return "".join(random.choice(letters) for i in range(length))


def create_batch(rows: int, field_count: int) -> List[Dict]:
    fields = [f"field_{i}" for i in range(field_count)]
    batch = []
    for i in range(rows):
        batch.append({field: random_string(10) for field in fields})
    return batch

@pytest.mark.asyncio
async def test_send_large_batch_after_5min_delay_norecycle(client):
    logger.info("Sending first batch...")
    await send_batch(client, create_batch(10, 40))
    logger.info("Sleeping 5 minutes...")
    await asyncio.sleep(300)
    logger.info("Sending second batch...")
    await send_batch(client, create_batch(200, 40))

Log output:

$ pytest -sv tests/integration/eventhubs/test_eventhubs_timeout_issue.py 
========================================================================================================================== test session starts ==========================================================================================================================
platform linux -- Python 3.8.2, pytest-5.4.2, py-1.8.1, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: -
plugins: asyncio-0.12.0, logbook-1.2.0
collected 1 item                                                                                                                                                                                                                                                        

tests/integration/eventhubs/test_eventhubs_timeout_issue.py::test_send_large_batch_after_5min_delay_norecycle 2020-10-12 09:59:58,381 azure.eventhub INFO     Sending first batch...
2020-10-12 09:59:58,392 uamqp.connection INFO     Connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' state changed from <ConnectionState.UNKNOWN: 999> to <ConnectionState.START: 0>
2020-10-12 09:59:58,559 uamqp.connection INFO     Connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' state changed from <ConnectionState.START: 0> to <ConnectionState.START: 0>
2020-10-12 09:59:58,769 uamqp.connection INFO     Connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' state changed from <ConnectionState.START: 0> to <ConnectionState.HDR_SENT: 2>
2020-10-12 09:59:58,821 uamqp.connection INFO     Connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' state changed from <ConnectionState.HDR_SENT: 2> to <ConnectionState.HDR_EXCH: 3>
2020-10-12 09:59:58,822 uamqp.connection INFO     Connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' state changed from <ConnectionState.HDR_EXCH: 3> to <ConnectionState.OPEN_SENT: 7>
2020-10-12 09:59:58,873 uamqp.connection INFO     Connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' state changed from <ConnectionState.OPEN_SENT: 7> to <ConnectionState.OPENED: 9>
2020-10-12 09:59:59,080 uamqp.c_uamqp INFO     CBS for connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' completed opening with status: 0
2020-10-12 09:59:59,183 uamqp.c_uamqp INFO     Token put complete with result: 0, status: 202, description: b'Accepted', connection: b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55'
2020-10-12 09:59:59,236 uamqp.sender INFO     Message sender b'sender-link-d21a54f5-6bd1-478f-b6a3-0572d323debb' state changed from <MessageSenderState.Idle: 0> to <MessageSenderState.Opening: 1> on connection: b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55'
2020-10-12 09:59:59,341 uamqp.sender INFO     Message sender b'sender-link-d21a54f5-6bd1-478f-b6a3-0572d323debb' state changed from <MessageSenderState.Opening: 1> to <MessageSenderState.Open: 2> on connection: b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55'
2020-10-12 09:59:59,456 azure.eventhub INFO     Sleeping 5 minutes...
2020-10-12 10:04:59,513 azure.eventhub INFO     Sending second batch...
2020-10-12 10:04:59,566 uamqp.c_uamqp INFO     b'Failure: sending socket failed. errno=104 (Connection reset by peer).' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/socketio_berkeley.c':b'socketio_send':884)
2020-10-12 10:04:59,566 uamqp.c_uamqp INFO     b'Error in xio_send.' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/tlsio_openssl.c':b'write_outgoing_bytes':641)
2020-10-12 10:04:59,566 uamqp.c_uamqp INFO     b'Error in write_outgoing_bytes.' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/tlsio_openssl.c':b'tlsio_openssl_send':1374)
2020-10-12 10:04:59,566 uamqp.c_uamqp INFO     b'xio_send failed' (b'/data/src/vendor/azure-uamqp-c/src/saslclientio.c':b'saslclientio_send_async':1190)
2020-10-12 10:04:59,566 uamqp.c_uamqp INFO     b'Cannot send encoded bytes' (b'/data/src/vendor/azure-uamqp-c/src/connection.c':b'on_bytes_encoded':268)
2020-10-12 10:04:59,566 uamqp.connection INFO     Connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' state changed from <ConnectionState.OPENED: 9> to <ConnectionState.END: 13>
2020-10-12 10:04:59,567 uamqp.connection INFO     Connection with ID b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55' unexpectedly in an error state. Closing: False, Error: None
2020-10-12 10:04:59,567 uamqp.c_uamqp INFO     CBS error occurred on connection b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55'.
2020-10-12 10:04:59,567 uamqp.sender INFO     Message sender b'sender-link-d21a54f5-6bd1-478f-b6a3-0572d323debb' state changed from <MessageSenderState.Open: 2> to <MessageSenderState.Idle: 0> on connection: b'EHProducer-943e27bd-796f-4e19-9d75-c14ac4c62c55'
2020-10-12 10:04:59,567 uamqp.c_uamqp INFO     b'Connection not open' (b'/data/src/vendor/azure-uamqp-c/src/connection.c':b'connection_encode_frame':2048)
2020-10-12 10:04:59,567 uamqp.c_uamqp INFO     b'Failed session send transfer' (b'/data/src/vendor/azure-uamqp-c/src/link.c':b'link_transfer_async':1525)
2020-10-12 10:04:59,567 uamqp.c_uamqp INFO     b'Invalid argument (list=(nil), item=0x242fe00)' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/src/singlylinkedlist.c':b'singlylinkedlist_remove':108)
2020-10-12 10:04:59,567 uamqp.c_uamqp INFO     b'Error removing pending delivery from the list' (b'/data/src/vendor/azure-uamqp-c/src/link.c':b'link_transfer_async':1528)
2020-10-12 10:04:59,567 uamqp.c_uamqp INFO     b'Error in link transfer' (b'/data/src/vendor/azure-uamqp-c/src/message_sender.c':b'send_one_message':544)
2020-10-12 10:04:59,567 uamqp.c_uamqp INFO     b'Error sending message' (b'/data/src/vendor/azure-uamqp-c/src/message_sender.c':b'messagesender_send_async':916)
Fatal Python error: Segmentation fault

Current thread 0x00007fea19c27740 (most recent call first):
  File "/home/janne/.local/lib/python3.8/site-packages/uamqp/async_ops/sender_async.py", line 157 in send_async
  File "/usr/lib/python3.8/asyncio/events.py", line 81 in _run
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1859 in _run_once
  File "/usr/lib/python3.8/asyncio/base_events.py", line 570 in run_forever
  File "/usr/lib/python3.8/asyncio/base_events.py", line 603 in run_until_complete
  File "/home/janne/.local/lib/python3.8/site-packages/pytest_asyncio/plugin.py", line 155 in inner
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/python.py", line 1477 in runtest
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/runner.py", line 217 in <lambda>
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/runner.py", line 244 in from_call
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_runtest_hook
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/runner.py", line 186 in call_and_report
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/runner.py", line 100 in runtestprotocol
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/main.py", line 247 in _main
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/main.py", line 191 in wrap_session
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/janne/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/janne/.local/lib/python3.8/site-packages/_pytest/config/__init__.py", line 124 in main
  File "/home/janne/.local/bin/pytest", line 8 in <module>
Segmentation fault (core dumped)

@yunhaoling
Copy link
Contributor

@jannekeskitalo , thanks for sharing the logging and your platform information, it's super helpful!
This is actually a segmentation fault trigger by the uamqp library, can you also kindly share the uamqp version installed in your python environment? It would help us locate the issue.

@jannekeskitalo
Copy link
Author

can you also kindly share the uamqp version installed in your python environment? It would help us locate the issue.

$ pip freeze | grep uamqp
uamqp==1.2.11

@yunhaoling
Copy link
Contributor

hey @jannekeskitalo , could you try update the uamqp to v1.2.14 via pip install uamqp --upgrade.
we recently has updated the whole sub C module of uamqp including several fixes which might resolve the issue.

let me know if it still crashes on your machine.

@yunhaoling yunhaoling added this to the [2021] March milestone Feb 11, 2021
@yunhaoling
Copy link
Contributor

yunhaoling commented Mar 11, 2021

apologize @jannekeskitalo for not getting to you sooner, it was until recently that I have time to look into the problem.

with your help I'm now able to recognize the pattern of the seg fault and locate the part of the code which might be causing the problem.

The pattern is that:

  1. The producer keeps inactive for > 300s, which lead to the underlying socket io getting completely lost
  2. send a large message will be split into several amqp frames.

I have written some code with pure uamqp to better illustrate the issue:

import uamqp
from uamqp import authentication
from datetime import datetime
import time
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s')

live_eventhub_config = {...}

uri = "sb://{}/{}".format(live_eventhub_config['hostname'], live_eventhub_config['event_hub'])
sas_auth = authentication.SASTokenAuth.from_shared_access_key(
    uri, live_eventhub_config['key_name'], live_eventhub_config['access_key'])

target = "amqps://{}/{}".format(live_eventhub_config['hostname'], live_eventhub_config['event_hub'])

send_client = uamqp.SendClient(target, auth=sas_auth, debug=True)
send_client.open()
while not send_client.client_ready():
    send_client.do_work()
print(datetime.now(), "send client is opened")

print(datetime.now(), 'start sleep')
time.sleep(350)
# sleep until the underlying socket io is completely lost
# On windows, the socket io reports "Failure: sending socket failed 10054."
# On linux, the socket io reports "sending socket failed. errno=104 (Connection reset by peer)."
print(datetime.now(), 'end sleep')

# big message will be split into multiple amqp frames which goes into an execution path
# different than a small message (which is composed of just one frame)
# see code: https://github.com/Azure/azure-uamqp-c/blob/master/src/session.c#L1532-L1676
message = uamqp.Message(
    b't'*1024*700
)

# seg fault happens
send_client.send_message(message)

send_client.close()
print(datetime.now(), "send client is closed")

I further investigate into the C code to see how to resolve the issue, will keep posting updates here.

@yunhaoling yunhaoling modified the milestones: [2021] March, [2021] April Mar 11, 2021
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-python that referenced this issue Apr 1, 2021
Merge Dev-containerservice-microsoft.containerservice-2021-03-01 API to master  (Azure#13737)

* Adds base for updating Microsoft.ContainerService from version stable/2021-02-01 to version 2021-03-01

* Updates readme

* Updates API version in new specs and examples

* fix the top swagger offenses caused by privateLinkResources, enableCSIProxy and OSDiskSize range (Azure#13349)

Co-authored-by: Charlie Li <charlili@microsoft.com>

* Add enableFIPS property to 2021-03-01 API in dev 2021-03-01 branch (Azure#13359)

* fix the top swagger offenses caused by privateLinkResources, enableCSIProxy and OSDiskSize range

* Add enableFIPS property to the 2021-03-01 API

* Add enableFIPS property to the 2021-03-01 API

Co-authored-by: Charlie Li <charlili@microsoft.com>

* adding node-image upgrade channel enum (Azure#13375)

Co-authored-by: Charlie McBride <Charlie.McBride@microsoft.com>

* Add httpProxyConfig specs to 2021-03-01 API - dev (Azure#13410)

* Add httpProxyConfig specs to 2021-03-01 API - dev

* Fix prettier

* aks: add `bindingSelector` to managed cluster pod identity profile (Azure#13399)

* AKS runCommand new feature  (Azure#13420)

* runCommand target to 03-01

* fix api-version in examples

* Add property disableLocalAccounts to 2021-03-01 API version (Azure#13385)

* Add property disableLocalAccounts

* Fix indentation

* Add OSSKU and GPUInstanceProfile to containerservice/microsoft.container API 2021-03-01 (Azure#13532)

* Add OSSKU and GPUInstanceProfile

* Add sample for OSSKU and GPUInstanceProfile

* Fix custom words

* Reference sample files

* update custom-words

* Update readme for sdks  (Azure#13515)

* update readme for sdks

* update readme for sdks

Co-authored-by: Charlie Li <charlili@microsoft.com>

* Add example to create agent pool with FIPS enabled (Azure#13557)

* add example to create agent pool with FIPS enabled

* change a property name in the example file

Co-authored-by: Charlie Li <charlili@microsoft.com>

* Added extended location parameter to managed cluster which will target all agent pools to that extended location. (Azure#13560)

* Add data model for get OS options. (Azure#13619)

* Add data model for get OS options.

* Add default resource name in API path.

* Fix container service tags. (Azure#13739)

Co-authored-by: Tongyao Si <tosi@microsoft.com>
Co-authored-by: Charlie Li <charlili@microsoft.com>
Co-authored-by: Charlie McBride <33269602+charliedmcb@users.noreply.github.com>
Co-authored-by: Charlie McBride <Charlie.McBride@microsoft.com>
Co-authored-by: Bo Wang <61758179+bowang-666@users.noreply.github.com>
Co-authored-by: hbc <bcxxxxxx@gmail.com>
Co-authored-by: Haitao Chen <haitch@users.noreply.github.com>
Co-authored-by: Tony Xu <tonyxu@microsoft.com>
Co-authored-by: Mirza Sikander <mirsik@microsoft.com>
Co-authored-by: Jun Sun <33297523+JunSun17@users.noreply.github.com>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-python that referenced this issue Apr 1, 2021
Merge Dev-containerservice-microsoft.containerservice-2021-03-01 API to master  (Azure#13737)

* Adds base for updating Microsoft.ContainerService from version stable/2021-02-01 to version 2021-03-01

* Updates readme

* Updates API version in new specs and examples

* fix the top swagger offenses caused by privateLinkResources, enableCSIProxy and OSDiskSize range (Azure#13349)

Co-authored-by: Charlie Li <charlili@microsoft.com>

* Add enableFIPS property to 2021-03-01 API in dev 2021-03-01 branch (Azure#13359)

* fix the top swagger offenses caused by privateLinkResources, enableCSIProxy and OSDiskSize range

* Add enableFIPS property to the 2021-03-01 API

* Add enableFIPS property to the 2021-03-01 API

Co-authored-by: Charlie Li <charlili@microsoft.com>

* adding node-image upgrade channel enum (Azure#13375)

Co-authored-by: Charlie McBride <Charlie.McBride@microsoft.com>

* Add httpProxyConfig specs to 2021-03-01 API - dev (Azure#13410)

* Add httpProxyConfig specs to 2021-03-01 API - dev

* Fix prettier

* aks: add `bindingSelector` to managed cluster pod identity profile (Azure#13399)

* AKS runCommand new feature  (Azure#13420)

* runCommand target to 03-01

* fix api-version in examples

* Add property disableLocalAccounts to 2021-03-01 API version (Azure#13385)

* Add property disableLocalAccounts

* Fix indentation

* Add OSSKU and GPUInstanceProfile to containerservice/microsoft.container API 2021-03-01 (Azure#13532)

* Add OSSKU and GPUInstanceProfile

* Add sample for OSSKU and GPUInstanceProfile

* Fix custom words

* Reference sample files

* update custom-words

* Update readme for sdks  (Azure#13515)

* update readme for sdks

* update readme for sdks

Co-authored-by: Charlie Li <charlili@microsoft.com>

* Add example to create agent pool with FIPS enabled (Azure#13557)

* add example to create agent pool with FIPS enabled

* change a property name in the example file

Co-authored-by: Charlie Li <charlili@microsoft.com>

* Added extended location parameter to managed cluster which will target all agent pools to that extended location. (Azure#13560)

* Add data model for get OS options. (Azure#13619)

* Add data model for get OS options.

* Add default resource name in API path.

* Fix container service tags. (Azure#13739)

Co-authored-by: Tongyao Si <tosi@microsoft.com>
Co-authored-by: Charlie Li <charlili@microsoft.com>
Co-authored-by: Charlie McBride <33269602+charliedmcb@users.noreply.github.com>
Co-authored-by: Charlie McBride <Charlie.McBride@microsoft.com>
Co-authored-by: Bo Wang <61758179+bowang-666@users.noreply.github.com>
Co-authored-by: hbc <bcxxxxxx@gmail.com>
Co-authored-by: Haitao Chen <haitch@users.noreply.github.com>
Co-authored-by: Tony Xu <tonyxu@microsoft.com>
Co-authored-by: Mirza Sikander <mirsik@microsoft.com>
Co-authored-by: Jun Sun <33297523+JunSun17@users.noreply.github.com>
@yunhaoling
Copy link
Contributor

yunhaoling commented Apr 7, 2021

hey @jannekeskitalo, thanks for your patience! we have fixed the issue in azure-eventhub 5.4.0.
please update to the latest version via pip install azure-eventhub --upgrade.
(If you're interested, the root cause lies in uamqp, and analysis could be found here: Azure/azure-uamqp-python#217 (comment))

I'm closing this now, feel free to reopen if you're still encountering the issue, thanks!

@yunhaoling
Copy link
Contributor

hey @jannekeskitalo , hope you are keeping well!

I'm excited to announce that we redesigned our Python EventHubs SDK to not rely on any C code anymore, it’s now entirely implemented in Python (5.8.0a3 - pypi). We’d love to get your feedback and perspective on this new version, as we believe you are using this SDK in an environment that would benefit from the changes.

Here are some highlights of the SDK :
• Stability and robustness, eliminating any risks of C related memory leak or segmentation fault errors.
• Full multi-platform support (ARM processors like M1, etc.)
• Improved sending and receiving performance.

To get started, just install the wheel by command pip install azure-eventhub==5.8.0a3 in your Python environment and you're all set to go with your applications. This package is fully backward compatible, and no changes are needed in your application.

We’re eager to hear what you think. Please reply to this thread at any time with feedback or any questions you have for us! We’re all ears.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that uAMQP
Projects
None yet
Development

No branches or pull requests

4 participants