Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Serialization and Image Data Streaming #676

Open
Jake-Carter opened this issue Nov 14, 2023 · 12 comments
Open

Continuous Serialization and Image Data Streaming #676

Jake-Carter opened this issue Nov 14, 2023 · 12 comments

Comments

@Jake-Carter
Copy link

Issue template

  • Hardware description: MAX78000 / Custom board
  • RTOS: FreeRTOS
  • Installation type: micro_ros_setup / custom static library
  • Version or commit hash: Humble

Hello - I'm an engineer with Analog Devices and I've been working on micro-ros support for our microcontrollers, starting with our embedded AI micros. I've completed the port and custom transports following the excellent tutorials, so first off thank you for the great documentation and project. I would eventually like to open a PR with official part support for our MSDK microcontrollers, and I'm building up a cool demo using an OpenManipulator-X running some custom object detection on our MAX78000.

Let me know if there's a better channel/repo to go through for questions. I couldn't get the Slack channel invite to work.

My current challenge is related to the topic of continuous serialization mentioned in the bottom part of this tutorials page.

I'm currently publishing a sensor_msgs/image image successfully, but the transmissions are very slow since the message is broken up into many packets. I would like some way to continuously stream the image data instead, but still comply with the expected message framing protocol. So...

  1. Is "continuous serialization" what I'm looking for?
  2. The tutorial says the ping_uros_agent example shows an example for continuous serialization of image data, but I don't see it. Are there any examples for this?
  3. I can sort of guess at what the APIs do based on the tutorial, but the API documentation for the microcdr and continuous_serialization modules are somewhat limited. I'm confused about what the ucdr_alignment functions do, and also whether it's possible stream image data row by row from the serialization callback. In the given example does writing into the ucdr buffer push the data out into the transport layer?
// Implementation example:
void serialization_cb(ucdrBuffer * ucdr){
    size_t len = 0;
    micro_ros_fragment_t fragment;

    // Serialize array size
    ucdr_serialize_uint32_t(ucdr, IMAGE_BYTES);

    while(len < IMAGE_BYTES){
      // Wait for new image "fragment"
      ...

      // Serialize data fragment
      ucdr_serialize_array_uint8_t(ucdr, fragment.data, fragment.len); // <-- (JC):  Does this go out to the transport layer?...
      len += fragment.len;
    }
}  // ... or is the data finally sent here, when the callback returns?  

I also have some more general suggestions/questions related to some challenges I had in developing the custom transports, and would love to contribute back to the project in any way I can.

Thank you,
Jake

@pablogs9
Copy link
Member

Hello @Jake-Carter,

Hello - I'm an engineer with Analog Devices and I've been working on micro-ros support for our microcontrollers, starting with our embedded AI micros. I've completed the port and custom transports following the excellent tutorials, so first off thank you for the great documentation and project. I would eventually like to open a PR with official part support for our MSDK microcontrollers, and I'm building up a cool demo using an OpenManipulator-X running some custom object detection on our MAX78000.

Nice to hear that, I'm going to move this internally so we can be in touch.

Is "continuous serialization" what I'm looking for?

Continuous serialization is a kind of advanced feature of the middleware that allows the user to control a multi-stage serialization, it imposes some restrictions to the user type. The main one is that you need to "remove" the buffer part of your type. This implies modifications on your std_msgs/Image type.

I'm not sure if this is the straightforward way and probably you are looking for increasing your transport MTU and/or increasing of middleware stream history. How big is your payload?

I also have some more general suggestions/questions related to some challenges I had in developing the custom transports, and would love to contribute back to the project in any way I can.

I've just accepted you in the micro-ROS Slack, do not hesitate to contact me, open new issues or contribute via pull requests.

@Jake-Carter
Copy link
Author

Thanks @pablogs9,

The ability to increase the transport MTU is very helpful. My current payload is a 160x120 RGB888 image (57600 bytes). My microcontroller only has 128KB of SRAM, so it's fairly constrained.

FreeRTOS, the micro-ros library, and my setup code seem to take up about 62KB of SRAM, so I was able to increase the MTU size to 2048 before I started to run out of space. It does help with the speed by the expected factor of 4x over the default though.

There are some tricks I can do with the CNN accelerator, so this will let me proceed in the short term. In general, are there any disadvantages to extremely large MTU sizes?

I have a couple other questions as well, so I will reach out via Slack.

Thanks for your support.

@pablogs9
Copy link
Member

Well, sending a payload that is almost 45% of your available memory is always kind of hard. In my experience sending single images over micro-ROS/XRCE is possible, but sending video will require some more resources.

In any case, even using continuous serialization will force you to use best-effort streams, which implies that losing a single fragment will cause the whole frame to be lost.

Before going into continuous serialization, do you have any possibilities of compressing into JPEG and sending it via CompressedImage?

@Jake-Carter
Copy link
Author

Jake-Carter commented Nov 23, 2023

Thanks @pablogs9, there are a couple of challenges I found this week:

  • For camera sensors without a built-in JPEG encoder we are reliant on our floating-point engine and the CMSIS-DSP library to implement the compression. This is pretty slow, and it's difficult to run it fast enough to keep up with the camera's data rate.
  • In the past, we have run DCTs inside our CNN accelerator at 60x improved speed over CMSIS-DSP, but only for 1D audio. We think it may be possible to implement for 2D images, but it will take some more research. So at the moment, we don't have a good hardware solution for image compression.
  • All of our CNN models have been trained with uncompressed image data. In general, we haven't tried using compressed inputs but there is some interesting research that shows it could be promising. In the meantime, though, obtaining the uncompressed images is important for us to be able to test the exact data input and collect datasets.

If continuous serialization could offer additional speed improvements I would be interested in exploring it. From what I've seen so far there are two sources of the slow speed:

Delays between each MTU

Your guidance on increasing the MTU size helped a lot with this, and I've achieved good results reducing this delay as much as my memory allows. I'm not sure where the overhead that's associated with this is coming from, but maybe as I get more familiar with the library I can test it further. It could also be related to my FreeRTOS port and implementation, or just unavoidable small delays from the complexity of the library.

Delay gaps inside each MTU

I'm seeing almost a 1ms delay inside each MTU, and this was more unexpected.

It's happening because the library splits up the frame bytes and data bytes into two separate transport calls, but the time it takes between them is longer than I expected. This was one of the main challenges I had developing the custom transports since my UART FIFOs are very shallow (only 8 bytes). I ended up implementing a queue to extend my FIFO so I wouldn't miss bytes inside each MTU.

For example, I captured the logic trace below on the RX side while I was developing my transport functions.

  • When the "indicator" signal is high, I am inside my serial transports.
  • When the "indicator" signal is low, my transport functions have exited, and the micro-ROS library has control.
  • Ignore the small blip towards the end of the gap - I used that to measure my DMA setup time.

You can see its actively waiting for the frame data first. It gets enough bytes and returns (B). The micro-ROS library takes about 800uS to jump back into the transport read for the rest of the data (A).

image

size_t vMXC_Serial_Read (
        struct uxrCustomTransport* transport,
        uint8_t* buffer,
        size_t length,
        int timeout,
        uint8_t* error_code)
{
    TickType_t elapsed = 0;
    const TickType_t xMaxBlockTime = pdMS_TO_TICKS(timeout);

    MXC_GPIO_OutSet(indicator.port, indicator.mask); // <-- A (transition low to high, we have entered the transport)

    unsigned int num_received = 0;
    while(num_received < length && elapsed < xMaxBlockTime) {
        if (uxQueueMessagesWaiting(rx_queue) > 0) {
            if(xQueueReceive(rx_queue, &buffer[num_received], 1)) {
                num_received++;
            }
        }
        elapsed++;
    }

    MXC_GPIO_OutClr(indicator.port, indicator.mask); // <-- B (transition high to low, we are exiting the transport)

    return num_received;
}

So, since there is ~1ms delay inside each MTU and I need thousands of MTUs to transmit the large image data, then I was hoping that the continuous serialization would give me the hooks I need to manually transmit my frame data. That way I could simultaneously eliminate the 1ms delay inside each MTU and the delay between MTUs.

Sorry for the novel :) - just wanted to provide some more context into the challenges I've seen so far with the transmission speed and extremely large messages.

@pablogs9
Copy link
Member

How are you ensuring that this is an active wait when there are no messages in the queue:

while(num_received < length && elapsed < xMaxBlockTime) {
        if (uxQueueMessagesWaiting(rx_queue) > 0) {
            if(xQueueReceive(rx_queue, &buffer[num_received], 1)) {
                num_received++;
            }
        }
        elapsed++;
    }

I mean, if uxQueueMessagesWaiting(...) == 0, this will loop for less than xMaxBlockTime ticks, right?

I also wonder why you are struggling with the reception of packets and serial read operations if your objective is to send an image.

Could you clarify these two points?

@Jake-Carter
Copy link
Author

How are you ensuring that this is an active wait when there are no messages in the queue:

I have my DMA controller constantly unloading the RX FIFO behind the scenes. Every byte, it triggers an ISR to place the received byte in rx_queue.

My full transport implementation can be found here.

I also wonder why you are struggling with the reception of packets and serial read operations if your objective is to send an image.

I have everything working now, but this was something I struggled with a few weeks ago.

I wanted to show the read side because the timing issues caused more critical failures to connect with the micro-ROS agent. The gap above can cause incoming bytes to be missed, whereas any delays on the TX side will just slow down communication. Also, I only saved a logic capture for the read side.

Today we started a short Thanksgiving break so I will capture a trace during an image transmission as soon as I can next week.

@Jake-Carter
Copy link
Author

Hi @pablogs9, I have some updated captures that show the 2 types of delay more clearly. The trace can be opened with Saleae Logic. The zip also includes a -v6 agent log file. The baud rate used is 115200.

adi_micro-ros_tx_image_captures.zip

Delays between each Transport Unit

Here is an image that highlights the delays between each image data packet (I hope "Transport Unit" is the right term here?).

On average it's between 200-300ms per TU.

image

When the red "Indicator" line is high, the code is inside my custom serial write function. Here is a closer look between two TUs.

image

Delays inside each Transport Unit

This image shows the delay between the frame and data portions of the transport unit. It's actually worse than the 1ms delay I captured on the RX side, since it looks like the publisher is waiting on a response from the agent for the frame.

The delay originally varied between 1-16ms.

image

After I decreased my USB latency timer with echo 1 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer, the variability improved to about 2-3ms.

image

Continuous Serialization?

So - basically I would like to know if continuous serialization would let me bypass most of the frame/packeting requirements for the image data. Ideally I'd like to just send one frame, and then manually serialize the data as I receive it.

Thanks for your support,
Jake

@pablogs9
Copy link
Member

Hello @Jake-Carter,

So - basically I would like to know if continuous serialization would let me bypass most of the frame/packeting requirements for the image data. Ideally I'd like to just send one frame, and then manually serialize the data as I receive it.

Continuous serialization will behave the same, because in this mode you provide the serialization data on-the-fly, but the transport layer and framing layer will be the same.

After I decreased my USB latency timer with echo 1 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer, the variability improved to about 2-3ms.

This detail led me to think that those delays are related to your underlying hardware, did you perform any test without micro-ROS?

@Jake-Carter
Copy link
Author

I see, thanks @pablogs9. Could you provide any guidance on the colcon options for building the micro-ros library with stream framing disabled?

Is

"microxrcedds_client": {
    "cmake-args": [
        // ...,
        "-DUCLIENT_PROFILE_STREAM_FRAMING=OFF",
        // ...
    ]
},

and

rmw_uros_set_custom_transport(
        // MICROROS_TRANSPORTS_FRAMING_MODE,
        MICROROS_TRANSPORTS_PACKET_MODE, // <-- Use "packet" mode instead of framing when setting custom transports
        (void *)&transport_config,
        vMXC_Serial_Open,
        vMXC_Serial_Close,
        vMXC_Serial_Write,
        vMXC_Serial_Read
    );

sufficient?

This detail led me to think that those delays are related to your underlying hardware, did you perform any test without micro-ROS?

I see the same general ~1ms USB latency even without micro-ROS. We're going through an FTDI USB-serial converter, so I think that's unavoidable. However, the framing protocol itself introduces an additional ~1-2ms for each packet just waiting on the header response, and then the 200-300ms delay between each packet is definitely from the micro-ROS library

@pablogs9
Copy link
Member

I see, thanks @pablogs9. Could you provide any guidance on the colcon options for building the micro-ros library with stream framing disabled?

You cannot go on top of a Serial port without framing, because the agent needs to "isolate" each XRCE packet. Nonframing mode is used for transports that ensures the "packetization", UDP is an example.

I'm not sure about the implications of this, but maybe it would help increasing the buffer sizes of the framing module, check rb and wb here:

https://github.com/eProsima/Micro-XRCE-DDS-Client/blob/0c6743ffa358f26ca9433e951c534ec2f96be37a/include/uxr/client/profile/transport/stream_framing/stream_framing_protocol.h#L41

I'm not sure if this will have implications on the behavior of the transport.

I see the same general ~1ms USB latency even without micro-ROS. We're going through an FTDI USB-serial converter, so I think that's unavoidable. However, the framing protocol itself introduces an additional ~1-2ms for each packet just waiting on the header response, and then the 200-300ms delay between each packet is definitely from the micro-ROS library

Is your application code available so I can take a look or try to replicate it in another board to check those delay values?

@Jake-Carter
Copy link
Author

Hi @pablogs9, hope you've been well and had a good start to the new year.

I've been working on an internal beta release for micro-ROS integration into the MSDK, and have staged things on the dev/micro-ros of our repo. I've written an install.py script that installs ROS + micro-ROS and builds the micro-ROS Agent using the micro_ros_setup scripts. (Documentation here). Maybe it will be useful as a contribution back to the micro-ROS repos in the future.

On this ticket - most of my troubles were coming from a lack of knowledge on the concepts. Especially the QoS models. "best effort" streams match much better for my applications. All the delays and jitter seems to come from the Linux side, so eliminating as many message frames as possible works great. For your reference my app code is available here and library support files here.

...

However, I saw failures when I tried to publish an image with best effort and traced it to the stream implementation here. I notice stream->size is set to UCLIENT_CUSTOM_TRANSPORT_MTU, and that best effort does not implement any message fragmentation. So far larger messages it returns an error.

In your experience, would it be possible to implement the same message fragmentation as the reliable streams here but without the extra XRCE frame headers/confirmations added? In my case I would be willing to accept any data loss in favor of the reduced transmission latency

@pablogs9
Copy link
Member

Hello @Jake-Carter, nice to hear about your progress. For sure we are interested in having this integrated in the micro-ROS repos.

WRT your question: in XRCE, best-effort streams do not allow fragmentation, so if your payload is an image you need to use reliable streams or configure a big enough buffer so an image fits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants