MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Stressing USB3 affects H264 encoding performance (Update).

Thu Apr 01, 2021 7:14 am

Hi.

Wondering if somebody may be able to explain the Pi bus architecture to me. In particular, how LAN, USB3<--->memory (DMA), and GPU (H264 encoder)<--->memory access are interleaved on the various bus(es) within a Pi 4B. It's a USB3 camera providing images, compress on GPU, then stream over wired LAN.

Are there any publicly available documents (or information that anybody could kindly provide) that specify/document what the timings/limits are?

Are there any bus interleaving/timing policies implemented on the Pi that can be set?

Thanks in advance.
Last edited by MarkDarcy on Tue Apr 20, 2021 9:36 am, edited 1 time in total.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance.

Tue Apr 20, 2021 9:34 am

As nobody was able to offer any information given the question as originally asked, I have provided more detail regarding the usage scenario. I hope the extra information proves useful. I may not be provide any further details as the project is commercially-related; it will depend upon what I'm asked. Again, if anybody can offer any explanation I would be grateful.

Hardware is a Pi 4B, Buster Lite (i.e., headless), HDMI ports disabled, no keyboard/mouse, 1 USB-3 camera, 1 USB-2 serial cable (tty), not overclocked (but force_turbo is 1), wireless/bluetooth disabled, wired LAN connected.

Code: Select all

$ uname -a
Linux raspberrypi 5.4.51-v7l+ #1333 SMP Mon Aug 10 16:51:40 BST 2020 armv7l GNU/Linux
I have developed a technique in C of invoking multiple H.264 encoders on the GPU and performing parallel encoding. A test program confirmed that N encoders each fed frames at 30fps can encode at 30*N frames/sec. This was done by creating N encoder instances, then taking a physical 30fps stream from a camera (V4L2 mmap'd buffers, 1080p packed YUV422 ⇛ 995 Mbit/sec) and submitting each physical frame to each of the N encoder instances.

I successfully ran 1080p at N == 8 (i.e., 8 encodings per frame ⇛ 240fps). The resulting encoded output was successfully streamed using a bespoke protocol (3.5 Mbps/encoder).

However, when I then attempted to stream from the camera at 30*N fps, feeding each physical frame to the encoders in a "round-robin" style so as not to stress any encoder over 30fps, the overall throughput dramatically fell after USB load exceeded around 850 Mbit. This 850 Mbit number was derived from the following tests:

Code: Select all

                 |   |  USB load (Mbit)  |     Encoding     |   Encoder    |
      Input      | N | per frame | total | Throughput (fps) | Input (Mbit) | Notes
-----------------+---+-----------+-------+------------------+--------------+-------
1080p   @  30fps | 1 |   33.18   |  995  |        30        |      995     | Ok
1080p   @  60fps | 2 |   33.18   | 1991  |        30        |      995     | Slow
 720p   @  60fps | 2 |   14.75   |  885  |       ~57.5      |      848     | Slow
640x480 @ 120fps | 4 |    4.92   |  590  |       120        |      590     | Ok
640x480 @ 240fps | 8 |    4.92   | 1180  |      ~175.5      |      863     | Slow
Note: image format packed YUV422 in all cases. Observed CPU usage during all runs was ~15%.

USB line speed tests confirmed I could stream the camera's limit of 1080p @ 90fps (packed YUV422 ⇛ 2.99 Gbit). Thus, running USB line speed at over 1 Gbit, or the Pi's ability to keep up, doesn't appear to be the problem.

The network output is only 3.5*N Mbps which, being numerically small compared to the 1 Gbit the PHY can run at, doesn't appear to be interfering with memory throughput.

No undervoltage warnings appear in syslog during encoding so it appears the GPU isn't momentarily halting/slowing down.

Having read up briefly on the Pi 4's DRAM bandwidth (e.g., this previous discussion) I understand that the "worst-case maximum" total memory bus activity is about 4GB/sec (32 Gbit). Using a naive per-frame memory lifetime model of:

Code: Select all

USB Read -> DMA Store -> RAM read -> GPU write -> encode -> GPU read -> NET write
(1 frame)   (1 frame)    (1 frame)   (1 frame)      ???     (3.5 Mbps)  (3.5 Mbps)
i.e., approximately 4 frame moves per frame processed, at 1080p (33.18 Mbit/frame) this gives a theoretical maximum of ~240fps if memory access was at saturation (and "???" was zero).

Furthermore, if I take my original test program's confirmed throughput derived from 1080p @ 30fps and N == 8, the per-frame memory lifetime model was:

Code: Select all

USB Read -> DMA Store -> 8 x [[ RAM read -> GPU write -> encode -> GPU read -> NET write ]]
(1 frame)   (1 frame)           (1 frame)   (1 frame)      ???     (3.5 Mbps)  (3.5 Mbps)
which gave an actual memory bandwidth of (597 Mbit x 30 times/sec) + NET ⇛ ~18 Gbit. In addition to being well under the 32 Gbit practical limit, it also appears to show that the GPU is comfortably able to cope with handling I/O with respect to all encoders simultaneously while they're running, so neither GPU I/O nor memory bandwidth appear to be the problem.

And yet, it can't do 640x480 @ 240fps in the above table which has only around 4.71 GBit of memory-related bus utilisation; just 8% of the 32 Gbit limit. Regardless of camera frame rate or encoder instances, around 850 Mbit USB loading is the maximum.

So, does anybody know what the cause of the cliff at 850 Mbit USB load might be? Is there any per-peripheral "bus bandwidth reservation" policy? Is there any bus activity during GPU encoding (the "???" above)?

Thanks in advance.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11247
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Apr 20, 2021 11:38 am

A couple of issues that you may not be aware of.

- I'm amazed you can get 8 1080p30 streams encoding simultaneously, as the H264 block is specified for 1 1080p30 (level 4.0) stream.

- The front end of the encoder uses the ISP block to convert whatever format you care to throw at it into the internal YUV420 based format that the H264 blocks use, so another read of YUV422 and write of YUV420.

- UVC isn't as simple as you might hope. It is presented with USB Request Buffers (URBs) which contain fragments of the overall video frame, and the kernel has to memcpy (using the ARM cores) the video data from the URB to the V4L2 buffer. On a previous project I actually hit the limit where this memcpy was slow enough that it resulted in the USB subsystem running out of URBs and dropping USB3 packets, and this was on an x86 processor so not really underpowered. You may need to mess with cache flushing because the CPU is doing this memcpy.
Memory says it's the copy at https://elixir.bootlin.com/linux/latest ... eo.c#L1115 that's of note.

You don't say if you're using DMABUF or not for your V4L2 nodes, or even which API you're using for the encoder.
Dmabufs are supported on the V4L2 M2M encoder and avoids a memcpy from the V4L2 buffer from UVC to the encoder buffer. You will need to allocate from the encoder and import within UVC as the encoder requires physically contiguous buffers whilst UVC is more flexible (mainly due to that memcpy). In theory you could use MMAP on the encoder and USERPTR on UVC, but that causes even more headaches pinning pages and remapping page tables.
If using MMAL, then there are ways of using vcsm to import a contiguous dmabuf for use by the VPU, otherwise it will be doing a copy of the data to get it from ARM memory to GPU memory.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Apr 20, 2021 2:01 pm

Thanks for the suggestions.
- I'm amazed you can get 8 1080p30 streams encoding simultaneously, as the H264 block is specified for 1 1080p30 (level 4.0) stream.
As a result of our previous discussions on threading and deadlock, I was able to craft a very lightweight and highly-parallelised OpenMAX-based implementation. Luckily, it also turned out that the way the H.264 block is implemented via OpenMAX is sufficiently clean and modular that multiple encoders can be created and driven in parallel once deadlocking and other API inconsistencies have been taken care of in application code. This behaviour might be accidental but it's reliable and predictable.

"Slippery" multi-threaded algorithms are my thing. Always have been ;-)
You don't say if you're using DMABUF or not for your V4L2 nodes, or even which API you're using for the encoder.
V4L2 to the camera is opened, then mmap() is called with MAP_SHARED to allocate the buffers. On the OpenMAX side, I allocate buffers using vcos_malloc_aligned() with the OMX-suggested alignment set on the port. On each V4L2 frame de-queued I then memcpy() from the received MMAP buffer to the VCOS buffer and then call OMX_EmptyThisBuffer(). All standard OMX stuff. The copy+empty takes ~15ms per frame per encoder for 1080p packed YUV422. I'm not specifically implementing any DMA transfer logic at application level; I trust that V4L2 or OpenMAX uses it at its discretion.
- The front end of the encoder uses the ISP block to convert whatever format you care to throw at it into the internal YUV420 based format that the H264 blocks use, so another read of YUV422 and write of YUV420.
Understood. So, naive frame memory lifetime model would now look like this?

Code: Select all

USB Read -> DMA Store -> RAM read -> GPU write -> YUV422 Read -> YUV420 Write -> encode -> GPU read -> NET write
(1 frame)   (1 frame)    (1 frame)   (1 frame)     (1 frame)      (1 frame)        ???     (3.5 Mbps)  (3.5 Mbps)
Even at six frame moves per frame, 640x480 @ 240fps would only come up at about 7 Gbit bus traffic; comfortably below the 32 Gbit theoretical maximum. Additionally, the test program which read USB at 30fps but then cycled each frame through N encoders did not have any trouble driving the encoders with that amount of data.

In terms of USB loading, I have managed to receive USB at 3 Gbit and simultaneously transmit that raw data back out over the LAN at 1 GBit while buffering in RAM (i.e., LAN Tx time == 3 x USB receive time). In this case, V4L2 worked fine, no USB stalls, all OK. It's just when I try to go via GPU that this USB "throttle" appears to kick in.

Clutching at straws here, but are there any throttles on USB when the GPU is running? For example, AXI bus arbitration policy between GPU and USB when both simultaneously moving data around? In my particular use case, data will be arriving on USB while the GPU is encoding. It is not a serial "read -> encode -> read -> encode -> ... etc." serialised processing model; USB and GPU data actvity will be interleaved on the bus. Might this be the cause of what is being observed?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11247
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Apr 20, 2021 2:52 pm

You can run multiple encodes simultaneously, but I'm surprised that your overall was significantly above about 120MPix/s in total (1080p60/Level 4.2 can be achieved if everything aligns).

My tot up of memory transactions is:
- USB write (URB)
- CPU read (URB) CPU write (V4L2 buffer) - uvcvideo memcpy
- CPU read (V4L2 buffer) CPU write (OMX buffer) - app memcpy
- DMA read (OMX buffer) DMA write (gpu_mem buffer) - ILCS/VCHI
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
- H264 write (encoded data) - video_encode
- DMA read (encoded data gpu_mem) DMA write (encoded data ARM mem) - ILCS/VCHI
- CPU read (encoded data) and does something with it.

OK it's spread across multiple hardware blocks, but I make that 6 reads of each raw frame, and 6 writes of each raw frame. 4 of each are of YUV422, with the internal video_encode ones being YUV420.
DMAbufs (not available with IL) would allow you to get rid of the app memcpy and the DMA copy in ILCS (IL Component Service)/VCHI (VideoCore Host Interface). They are the Linux kernel thing for sharing raw memory allocations between multiple kernel subsystems in a zero copy manner.

All peripherals hang on largely the same AXI bus. There are arbiter priorities that can be tweaked if really needed, but it's not something that you can do on a generic system. It also becomes a real balancing act over setting the AXI priorities at which different peripherals panic, what triggers those panics, and a load of other stuff. I won't claim it is totally optimally tuned, but it covers most use cases.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 21, 2021 10:33 am

You can run multiple encodes simultaneously, ... (1080p60/Level 4.2 can be achieved if everything aligns).
It's good to hear that it is possible in principle as this is was one of the key features I have been aiming for. The 12 memory transactions you detailed also help a great deal in understanding if and where optimisations can be made. Actually, in your explanation you said
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
These are normal memcpy()'s and not DMA transfers, right?

I've been working under the assumption that all 12 frame copy operations will be happening at "n-bits per clock" on the AXI bus. However, this will probably only happen for DMA transfers. If not all memory transactions within the ISP block are DMA, that would obviously cause a CPU-side slow-down.

Would you happen to know off the top of your head what is the time ratio difference is between a DMA transfer and the C-runtime memcpy() implementation?

However, what might this have to do with USB being active as when USB transfers are not happening during encoding the slowdown doesn't occur...?
... I make that 6 reads of each raw frame, and 6 writes of each raw frame. 4 of each are of YUV422, with the internal video_encode ones being YUV420.
Incidentally, where in your memory transaction list is the YUV422 -> YUV420 conversion performed? If my applicaction can supply YUV420 directly to the ISP block (e.g., by sourcing a suitable camera), will it save a couple of copy operations?
There are arbiter priorities that can be tweaked if really needed, but it's not something that you can do on a generic system.
Understood.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11247
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 21, 2021 11:11 am

MarkDarcy wrote:
Wed Apr 21, 2021 10:33 am
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
These are normal memcpy()'s and not DMA transfers, right?
The ISP (Image Sensor Pipeline) and H264 encoder are hardware blocks, so AXI masters. They are pretty optimised to be making efficient AXI burst requests.
MarkDarcy wrote:I've been working under the assumption that all 12 frame copy operations will be happening at "n-bits per clock" on the AXI bus. However, this will probably only happen for DMA transfers. If not all memory transactions within the ISP block are DMA, that would obviously cause a CPU-side slow-down.

Would you happen to know off the top of your head what is the time ratio difference is between a DMA transfer and the C-runtime memcpy() implementation?
Sorry, no idea.
MarkDarcy wrote:However, what might this have to do with USB being active as when USB transfers are not happening during encoding the slowdown doesn't occur...?
... I make that 6 reads of each raw frame, and 6 writes of each raw frame. 4 of each are of YUV422, with the internal video_encode ones being YUV420.
Incidentally, where in your memory transaction list is the YUV422 -> YUV420 conversion performed? If my applicaction can supply YUV420 directly to the ISP block (e.g., by sourcing a suitable camera), will it save a couple of copy operations?
It's done in the ISP as part of the video_encode component.
The H264 blocks need the frames in a weird column format(*), and also a second 2x2 subsampled version of the image to do a coarse motion search on. The ISP can produce both these images efficiently, and there isn't an easy way to configure the outside world to produce and pass in this pair of images simultaneously.

(*) If you divide your image into 128 column wide strips with both the luma and respective U/V (NV12) interleaved chroma, and then glue these strips together end on end, that's about right. The subsampled image is either planar or a similar column format but 32 pixels wide. Cleverer people than me designed it for optimised SDRAM access patterns.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Apr 23, 2021 2:08 pm

Thanks for your reply. There are some other things I wanted to ask but they rely on me doing some more tests and unfortunately I wasn't able to grab the time today. I'll ask again next week if that's OK.

There was one quick thing...
You can run multiple encodes simultaneously, ... (1080p60/Level 4.2 can be achieved if everything aligns).
Is this theoretical maximum confirmed via a USB path or is it only confirmed for the CSI path? It may be possible with the Pi camera but doesn't that feed data via CSI directly into ISP so probably saving four of the 12 copy operations in our frame memory lifetime model...?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11247
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Apr 23, 2021 2:22 pm

MarkDarcy wrote:
Fri Apr 23, 2021 2:08 pm
There was one quick thing...
You can run multiple encodes simultaneously, ... (1080p60/Level 4.2 can be achieved if everything aligns).
Is this theoretical maximum confirmed via a USB path or is it only confirmed for the CSI path? It may be possible with the Pi camera but doesn't that feed data via CSI directly into ISP so probably saving four of the 12 copy operations in our frame memory lifetime model...?
Only with the legacy camera stack.

When using the legacy camera stack with MMAL (MMAL_ENCODING_OPAQUE) or IL tunnels, the ISP processing step of taking the Bayer image (that has been received over CSI2 and stored in SDRAM) also produces the two versions of the image that the H264 block requires. You therefore only have:
- CSI2 rx: write Bayer image
- ISP: read Bayer image
- ISP: write pair of YUV420 images
- H264 read pair of YUV420 images and reference frame
- H264 write reference frame.
- H264 write encoded bitstream

Bayer is generally only 10bpp (12bpp on HQ camera) and single plane, so w*h*10 bits instead of the w*h*16 bits of your YUV422 image, so that saves some SDRAM bandwidth, and not having to copy the images about is a huge saving.

1080p50 YUYV (422) has been tested from a TC358743 HDMI to CSI2 bridge chip, and I believe that did keep up on Pi4. I don't remember trying 1080p60 as that needs the 4 lane version of the bridge board (which I have, but have never tried in that mode).
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

cleverca22
Posts: 3789
Joined: Sat Aug 18, 2012 2:33 pm

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Apr 23, 2021 3:25 pm

MarkDarcy wrote:
Thu Apr 01, 2021 7:14 am
Wondering if somebody may be able to explain the Pi bus architecture to me. In particular, how LAN, USB3<--->memory (DMA), and GPU (H264 encoder)<--->memory access are interleaved on the various bus(es) within a Pi 4B. It's a USB3 camera providing images, compress on GPU, then stream over wired LAN.

Code: Select all

root@pi400:~# grep axi /boot/config.txt 
dtparam=axiperf
root@pi400:~# cd /sys/kernel/debug/raspberrypi_axi_monitor
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat VPU/data 
     Bus   |    Atrans    Atwait      AMax    Wtrans    Wtwait      WMax    Rtrans    Rtwait      RMax
======================================================================================================
 VPU1_D_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_D_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU1_I_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_I_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
  L2_FLUSH |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    DMA_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU1_D_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_D_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU1_I_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_I_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    L2_OUT |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    DMA_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
     SDRAM |        0K        0K        0K        0K        0K        0K        0K        0K        0K
     L2_IN |        0K        0K        0K        0K        0K        0K        0K        0K        0K
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/data 
     Bus   |    Atrans    Atwait      AMax    Wtrans    Wtwait      WMax    Rtrans    Rtwait      RMax
======================================================================================================
    DMA_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
     TRANS |        0K        0K        0K        0K        0K        0K        0K        0K        0K
      JPEG |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_UC |        1K        0K        0K        0K        0K        0K        1K        0K        0K
    DMA_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    CCP2TX |      128K        0K        0K        0K        0K        0K     2063K        0K        0K
   MPHI_RX |        0K        0K        0K        0K        0K        0K        0K        0K        0K
   MPHI_TX |        0K        0K        0K        0K        0K        0K        0K        0K        0K
       HVS |        5K        0K        0K        1K        0K        0K        4K        0K        0K
      H264 |        1K        0K        0K        2K        0K        0K        1K        0K        0K
       ISP |        0K        0K        0K        0K        0K        0K        0K        0K        0K
       V3D |        0K        0K        0K        0K        0K        0K        0K        0K        0K
PERIPHERAL |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    CPU_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    CPU_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# ls System/
data  enable  filter  sample_time
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/enable 
65535
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/filter 
0
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/sample_time 
100
this driver might also be of some use

in its default config, its reporting transaction counters per destination, measuring each one for 100ms

Code: Select all

root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# echo 11 > System/filter 
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/data 

Monitoring transactions from ISP only
the filter file lets you limit what source increments the counters, so you can then see only reads/writes caused by the ISP for example

the enable file is a bit-mask to not count certain destinations, allowing the samples to update faster (it reads each destination for 100ms, so 16 destinations means 1600ms to update all)
i suspect the driver isnt working 100% correctly on bcm2711 though, the VPU counters arent working right

the part where it may become useful, is that it can report how many reads/writes are having to wait because the bus was too busy

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 28, 2021 9:55 am

Apologies for the delay in getting back.
cleverca22 wrote: this driver might also be of some use
...
the part where it may become useful, is that it can report how many reads/writes are having to wait because the bus was too busy
Hi, cleverca22, and thanks very much for the tip. Much appreciated. I'm sure it will show something, good or bad...!
6by9 wrote: Bayer is generally only 10bpp (12bpp on HQ camera) and single plane, so w*h*10 bits instead of the w*h*16 bits of your YUV422 image, so that saves some SDRAM bandwidth, and not having to copy the images about is a huge saving.
Understood. I was aware of the heaviness of YUV422 from the start but the choice of camera dictates it. The only formats lighter than this that are supported as input to video_encode are planar formats and unfortunately the cameras available to me don't output planar YUV, only packed.

As you mention it, I have a 12-bit Bayer camera. However, even though a 12-bit bayer format appears in the OMX header as a vendor extension format (0x7F000004), it is not available when enumerating the port formats acceptable to the input port of video_encode:

Code: Select all

[0x00000014] OMX_COLOR_FormatYUV420PackedPlanar
[0x7F000007] OMX_COLOR_FormatYVU420PackedPlanar
[0x00000027] OMX_COLOR_FormatYUV420PackedSemiPlanar
[0x7F000008] OMX_COLOR_FormatYVU420PackedSemiPlanar
[0x00000006] OMX_COLOR_Format16bitRGB565
[0x0000000C] OMX_COLOR_Format24bitBGR888
[0x0000000B] OMX_COLOR_Format24bitRGB888
[0x7F000001] OMX_COLOR_Format32bitABGR8888
[0x00000010] OMX_COLOR_Format32bitARGB8888
[0x00000019] OMX_COLOR_FormatYCbYCr
[0x0000001A] OMX_COLOR_FormatYCrYCb
[0x0000001B] OMX_COLOR_FormatCbYCrY
[0x0000001C] OMX_COLOR_FormatCrYCbY
[0x7F000003] OMX_COLOR_FormatYUVUV128
[0x00000017] OMX_COLOR_FormatYUV422PackedPlanar
[0x7F000005] OMX_COLOR_FormatBRCMEGL
By the way, there is a custom colour format the video_encode block supports called OMX_COLOR_FormatYUVUV128 (0x7F000003 in the above dump). Can you explain the layout of this format? Is it the special YUV format you mentioned with the full/reduced images packed together?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11247
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 28, 2021 5:14 pm

MarkDarcy wrote:
Wed Apr 28, 2021 9:55 am
6by9 wrote: Bayer is generally only 10bpp (12bpp on HQ camera) and single plane, so w*h*10 bits instead of the w*h*16 bits of your YUV422 image, so that saves some SDRAM bandwidth, and not having to copy the images about is a huge saving.
Understood. I was aware of the heaviness of YUV422 from the start but the choice of camera dictates it. The only formats lighter than this that are supported as input to video_encode are planar formats and unfortunately the cameras available to me don't output planar YUV, only packed.

As you mention it, I have a 12-bit Bayer camera. However, even though a 12-bit bayer format appears in the OMX header as a vendor extension format (0x7F000004), it is not available when enumerating the port formats acceptable to the input port of video_encode:
Bayer data normally involves quite significant additional image processing, eg white balance, lens shading, denoise, etc. That is what the full ISP component is there for. video_encode just happens to make use of the hardware block for a simple conversion.
MarkDarcy wrote:

Code: Select all

[0x00000014] OMX_COLOR_FormatYUV420PackedPlanar
[0x7F000007] OMX_COLOR_FormatYVU420PackedPlanar
[0x00000027] OMX_COLOR_FormatYUV420PackedSemiPlanar
[0x7F000008] OMX_COLOR_FormatYVU420PackedSemiPlanar
[0x00000006] OMX_COLOR_Format16bitRGB565
[0x0000000C] OMX_COLOR_Format24bitBGR888
[0x0000000B] OMX_COLOR_Format24bitRGB888
[0x7F000001] OMX_COLOR_Format32bitABGR8888
[0x00000010] OMX_COLOR_Format32bitARGB8888
[0x00000019] OMX_COLOR_FormatYCbYCr
[0x0000001A] OMX_COLOR_FormatYCrYCb
[0x0000001B] OMX_COLOR_FormatCbYCrY
[0x0000001C] OMX_COLOR_FormatCrYCbY
[0x7F000003] OMX_COLOR_FormatYUVUV128
[0x00000017] OMX_COLOR_FormatYUV422PackedPlanar
[0x7F000005] OMX_COLOR_FormatBRCMEGL
By the way, there is a custom colour format the video_encode block supports called OMX_COLOR_FormatYUVUV128 (0x7F000003 in the above dump). Can you explain the layout of this format? Is it the special YUV format you mentioned with the full/reduced images packed together?
OMX_COLOR_FormatYUVUV128 is just the full resolution version image in the column-based stripes.

It's only possible to pass the pair of images when using a suitable source component (I believe exclusively the camera), and tunneling with OMX, or using MMAL_ENCODING_OPAQUE.
It's not possible to create suitable buffers from the ARM.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed May 12, 2021 3:29 pm

Hi 6by9,

First, I must apologise for not dropping a line sooner. Something came up for about a week, and then I have been trying to develop a more reliable test environment.

While I have been "away", I managed to revise my test environment so that the encoding is now done on YUV420 planar as opposed to YUYV. This was done by inserting an 8-bit greyscale image received from the camera into two statically-prepared U/V planes set to "zero", then submitting to OMX as YUV420 planar. This had two effects:

  • It reduced the USB traffic by half compared to YUYV.
  • It reduced the traffic the encoder deals with by 25% compared to YUYV.
Here are the results of the new set of tests. Each test was designed to complete in 20 seconds (so 2400 frames at 120fps and 4800 frames at 240fps). I have included a summary of the old YUYV tests for comparison.

First the previous results for 640x480 YUYV:

Code: Select all

YUYV in (614400 bytes/frame) / YUYV enc (614400 bytes/frame): (640×480×16×11) = 54,067,200 bits/frame

     USB TX RATE (BIT/SEC)   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |           NOTES
  ---------------------------+-------------------------+-------------------+---------------------------
  120fps (  589,824,000 bps) | 120fps (36,864,000 pix) | 6,496,064,000 bps | (stream network)
  240fps (1,179,648,000 bps) | 175fps (53,760,000 pix) | 9,469,760,000 bps | (stream network)

  Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

Next, there are the results for 640x480 8-bit greyscale/YUV420 hybrid:

Code: Select all

GREY in (307200 bytes/frame) / YUV420 enc (460800 bytes/frame): (640×480×8×5)+(640×480×12×6) = 34,406,400 bits/frame

    USB TX RATE: BIT/SEC   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |             NOTES
  -------------------------+-------------------------+-------------------+-----------------------------
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 4,136,768,000 bps | (no network, no SD card)
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 4,136,768,000 bps | (stream network, no SD card)
  240fps (589,824,000 bps) | 240fps (73,728,000 pix) | 8,265,536,000 bps | (no network, no SD card)
  240fps (589,824,000 bps) | 240fps (73,728,000 pix) | 8,265,536,000 bps | (stream network, no SD card)

  Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

And here are the results for 720x540 8-bit greyscale/YUV420 hybrid:

Code: Select all

GREY in (388800 bytes/frame) / YUV420 enc (583200 bytes/frame): (720×540×8×5)+(720×540×12×6) =  43,545,600 bits/frame

    USB TX RATE: BIT/SEC   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |             NOTES
  -------------------------+-------------------------+-------------------+-----------------------------
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (no network, no SD card)
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (stream network, no SD card)
  240fps (294,912,000 bps) | 150fps (46,080,000 pix) | 6,539,840,000 bps | (no network, no SD card)
  240fps (589,824,000 bps) | 145fps (44,544,000 pix) | 6,322,112,000 bps | (stream network, no SD card)
  240fps (589,824,000 bps) | 140fps (43,008,000 pix) | 6,104,384,000 bps | (stream network + SD card)

  Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

The bits/frame calculations above were based on your earlier algorithm of:
6by9 wrote: My tot up of memory transactions is:
- USB write (URB)
- CPU read (URB) CPU write (V4L2 buffer) - uvcvideo memcpy
- CPU read (V4L2 buffer) CPU write (OMX buffer) - app memcpy
- DMA read (OMX buffer) DMA write (gpu_mem buffer) - ILCS/VCHI
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
- H264 write (encoded data) - video_encode
- DMA read (encoded data gpu_mem) DMA write (encoded data ARM mem) - ILCS/VCHI
- CPU read (encoded data) and does something with it.
Thus for the original YUYV it is 11 stages at full YUYV, whilst for grey/YUV420-planar it is 5 stages as greyscale and 6 stages as YUV420 planar. Finally, four stages for handling the encoded output per frame simply adds (4 x 2Mbps = 8 Mbps) to the totals in each case.

I would be grateful if you let me know if any of my numbers look incorrect.

The SD card writing was so that I could log the AXI bus metrics while the encoding was running. The command used was:

Code: Select all

bash$ while true; do sudo cat /sys/kernel/debug/raspberrypi_axi_monitor/System/data >> /tmp/__log; sleep 0.2; done

You can see from the last table in the 720x540 test above that streaming 2 Mbps to network appeared to reduce the frame rate by about 5 fps, and writing to SD card also reduced the frame rate by about 5fps. When streaming and logging to SD card, a total of 10 fps slowdown was observed.

All of these new tests were done while monitoring the AXI performance counters. Two of these captures I have attached as text files. The first log at 480p was a full speed 240fps capture and there was no slowdown. The second log is a 540p capture and it equates to the last entry in the last table (140fps throughput).

In summary, these new set of tests suggest that:

  1. The theoretical "worst case" maximum memory performance of 32 Gbit is not being approached. The most throughput I have achieved is ~10 Gbps.
  2. The theoretical maximum throughput of the encoder you suggested earlier was 120 Mpixels/sec. I am only managing to get around half of that at 73 Mpixels peak performance.
I apologise in advance as this represents a lot of information to take in. However, given this more comprehensive information, if possible, could you please explain:

  1. What might be causing me not to hit peak theoretical performance?
  2. Does the information in the attached AXI performance logs indicate unnecessary waits/delays?

Thanks in advance,
Attachments
AXI-Performance-Logs.zip
(14.34 KiB) Downloaded 9 times

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Thu May 13, 2021 8:43 am

Sorry, but there was a recurring error in that last post. Each of the results tables had this comment:

Code: Select all

Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

This is wrong as it was streaming 2 Mbps per encoder. This means that the total memory bandwidth for all encoders returning the encoded stream to the application layer wasn't 8 Mbps total but a multiple of 8 Mbps. For 120fps this means that the Bus Load figures are short by 24 Mbps (three encoders' worth), while for 240fps the Bus Load figures are short by 56 Mbps (seven encoders' worth). In the grand scheme of things it doesn't really change much but I thought it was best you know in case the numbers didn't seem to add up.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Thu May 20, 2021 1:42 pm

Hi,

Was the information I posted last time (in particular the AXI performance monitor logs) yield any new information?

Thanks in advance,

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 28654
Joined: Sat Jul 30, 2011 7:41 pm

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Jun 11, 2021 9:55 am

Can I ask specifically what you are trying to achieve, and what is the current issue? From the first post you are attempting to encode multiple 1080p30 streams? Which is too much for the HW encoder which has a limit of just over 1080p30. Irrespective of the number of streams, there is a maximum number of pixels per second that the encoder can manage. You can multiplex the encode, but you cannot exceed the maximum pixels per second of the encoder, or you will slow down the frame rate.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Working in the Application's Team.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Mon Jun 14, 2021 2:58 am

Thanks for your enquiry.

I have managed to get multiple encoding sessions working. What I am not seeing is anywhere near the throughput that has been cited as being possible for the Pi 4, either in terms of pixels/second processed or memory bandwidth. What I am trying to establish is whether the shortfall in performance is due to insufficient program performance or insufficient hardware performance.

To summarise the thread for you thus far,

1) Theoretical Maximum Encoding Throughput

This I understand to be around 120 megapixels/sec.
6by9 wrote:
Tue Apr 20, 2021 2:52 pm
You can run multiple encodes simultaneously, but I'm surprised that your overall was significantly above about 120MPix/s in total (1080p60/Level 4.2 can be achieved if everything aligns).
2) Theoretical AXI bus maximum memory-bound throughput.

This I understand to be in practical terms to be 4GB/second (32 Gbit) (source: this discussion which is based on this article from MagPi).

3) The Objective

120 megapixels/second is a significant amount of throughput and many resolution/frame-rate combinations should be theoretically possible if my assumptions about hardware are correct. 1080p@60fps is one combination. However, this is not a combination I am interested in. Other combinations that theoretically should fit into 120 Mpixels/sec are 640x480@240fps (73 Mpixels), 720x540@240fps (93 Mpixels), 720p@120fps (110 Mpixels), etc.

I have conducted several tests. Here are the results, repeated from an earlier post, for one of the tests I performed. It's a 720x540@240fps 8-bit greyscale video stream (93 Mpixels) received over USB 3 via V4L2 submitted to the GPU via OMX as planar YUV420.

Code: Select all

GREY in (388800 bytes/frame) / YUV420 enc (583200 bytes/frame): (720×540×8×5)+(720×540×12×6) =  43,545,600 bits/frame

    USB TX RATE: BIT/SEC   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |             NOTES
  -------------------------+-------------------------+-------------------+------------------------------------
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (no network, no SD card access)
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (stream network, no SD card access)
  240fps (589,824,000 bps) | 150fps (46,080,000 pix) | 6,539,840,000 bps | (no network, no SD card access)
  240fps (589,824,000 bps) | 145fps (44,544,000 pix) | 6,322,112,000 bps | (stream network, no SD card access)
  240fps (589,824,000 bps) | 140fps (43,008,000 pix) | 6,104,384,000 bps | (stream network + SD card access)

The calculation for the "bits/frame" metric is based on this earlier post. At the time the encoding is being performed, there is no other significant activity that my program is responsible for. As you can see, 120fps works fine but 240fps is not being achieved. However, the maximum encoded pixel rate being achieved is far short of the 120 megapixels/sec limit that has been cited as possible and this does not look to be caused by memory-bandwidth saturation as the calculated bus load is also far short the 32 GBit limit previously cited by the above source(s).

There seems to be a very large discrepancy between the throughput I understand to be achievable and what is actually being achieved. I am trying to establish where the errors are in my assumptions about the hardware or hardware-related drivers (e.g., USB).

Thanks in advance.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11247
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Mon Jun 14, 2021 8:57 am

Theoretical limits can very rarely be reached.

1) My comment about anything above 1080p60 being reached is exceeding the quoted design spec.
The design spec is 1080p30 (as per the product brief page 3), but there is a significant overhead potentially available, and I seem to recall getting 1080p50 through the TC358743 HDMI to CSI2 bridge, but haven't tested that recently. I thought I'd had 1080p60 too.
720p120 from imx219 was tested on Pi2 or 3, and was achievable with an overclock. There was a thread on it at the time.

2) The key thing with that SDRAM bandwidth analysis is that it is using 1MB blocks, therefore the access patterns to RAM are almost ideal.

When scanning an image for motion estimation, access patterns are far from ideal. Prediction can be from any of the surrounding macroblocks from the previous frame, so for each 16x16 block you're pulling in 48x48 pixels. After each line you're skipping a chunk of memory to get to the same start place on the next line, so the actual contiguous read from memory is 48 bytes, and you now need 48 of them. That's not an ideal access pattern, and you will get frequent page swaps which reduce the bandwidth available from SDRAM.
There is likely to be some minimal caching the search progresses horizontally across the image, but that then results in only 16 bytes being read off each line for the next macroblock search.

What needs to be profiled is ensuring that nothing is ever waiting for a buffer to fill except the source (USB in your case). With pipelining you generally want at least 3 buffers on each link, and with higher frame rates generally more to compensate for any latency in thread switching.

You say there is no other significant activity that your program is responsible for. Do you have a display connected, and if so what resolution is it running at? That's a moderate memory bandwidth hit, and because it has to be real-time it has higher AXI arbiter priority.

Hitting 720x540@240 through the encoder I would expect to be achievable at 367200 Macroblocks/sec (level 4.0 being max 245760, and level 4.2 being max 522240). Benchmarking the encoder in isolation would be a recommended first step.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 44
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Jun 15, 2021 10:31 am

There are a lot a valid points you raised. I would like to get some sort of resolution on each of them before tackling any new points if that's OK?
6by9 wrote:
Mon Jun 14, 2021 8:57 am
Theoretical limits can very rarely be reached.
2) The key thing with that SDRAM bandwidth analysis is that it is using 1MB blocks, therefore the access patterns to RAM are almost ideal.

When scanning an image for motion estimation, access patterns are far from ideal. Prediction can be from any of the surrounding macroblocks from the previous frame, so for each 16x16 block you're pulling in 48x48 pixels. After each line you're skipping a chunk of memory to get to the same start place on the next line, so the actual contiguous read from memory is 48 bytes, and you now need 48 of them. That's not an ideal access pattern, and you will get frequent page swaps which reduce the bandwidth available from SDRAM.
There is likely to be some minimal caching the search progresses horizontally across the image, but that then results in only 16 bytes being read off each line for the next macroblock search.
This I understood from early on and I explicitly tested for it. In previous tests that there was a small drop (~3%) in encoding performance when shooting a completely still, well lit scene at 640x480 when compared to shooting at the same resolution/frame rate a continuously and very fast and randomly oscillating object under the same lighting conditions. However, even with the 3% drop, the resulting rate was still way above >= 240 fps and therefore not of any concern.

I had therefore already assumed that memory access patterns due to motion vector fluctuations, while causing measurable degradation in performance, were not the cause of the near 50% shortfall in theoretical performance in this instance. Is my assumption still valid?

Incidentally, how are memory access patterns affected when two or more encoders are run in parallel? Are my assumptions about maximum throughput only valid when considering a single encoder instance?
6by9 wrote:
Mon Jun 14, 2021 8:57 am
You say there is no other significant activity that your program is responsible for. Do you have a display connected, and if so what resolution is it running at? That's a moderate memory bandwidth hit, and because it has to be real-time it has higher AXI arbiter priority.
The system I am running is headless; no keyboard, no mouse, no monitor. It is the "lite" version of Raspbian so all UI-related components aren't installed. Network access at 20Mbps (in a separate thread to the capture thread) plus whatever SD card activity the OS is doing (I am doing none).

I noted in an earlier post that random SD card access causes delays to occur and the system log (journalctl) seems to be logging messages about FIPS random seed generators, V4L2 status messages, etc. that I can't disable. Could the SD card access be causing delays in the encoder's memory access patterns? Could interrupts from the SD card controller be causing the system to wait?
6by9 wrote:
Mon Jun 14, 2021 8:57 am
Hitting 720x540@240 through the encoder I would expect to be achievable at 367200 Macroblocks/sec (level 4.0 being max 245760, and level 4.2 being max 522240). Benchmarking the encoder in isolation would be a recommended first step.
I agree. I did post some performance counter logs of the AXI bus activity during the above 720x540 run in this previous post. Is there any information in these logs that could point to the source of any potential delay?

Thanks in advance.

Return to “Graphics, sound and multimedia”