Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

video: bcm2708_fb: Try allocating on the ARM and passing to VPU #2875

Merged
merged 1 commit into from
Mar 7, 2019

Conversation

6by9
Copy link
Contributor

@6by9 6by9 commented Feb 27, 2019

Firmware update required to actually allow the allocation, but this is backwards compatible with the old firmware (reverts to using the VPU allocation).

}

fb->cpuaddr = dma_alloc_coherent(info->device, image_size,
&fb->dma_addr, GFP_ATOMIC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an atomic (in time) allocation necessary? Would GFP_KERNEL not do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy/paste from the same call in vc_mem_copy.
GFP_KERNEL would probably do (from the limited amount I know of these flags).

I'm just noticing that the call in vc_mem_copy page aligns the size, but I see no particular reason for doing so in the generic case.

@pelwell
Copy link
Contributor

pelwell commented Feb 28, 2019

I'm happy to merge now if you are,

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

Give me a few minutes to confirm again that standard image with this updated kernel does still work. I had checked yesterday, but better safe than sorry.

I had also tested new kernel but with the default cma=8M (too small for a 1080P frame buffer), and that correctly reverts back.
I guess we do now have the potential for the new kernel, cma=8MB, and eg a 720P frame buffer (3.6MB) that would boot and allocate the majority of the CMA heap at boot. Other functions could then be compromised. I don't know the best route out of that one. AFAIK there are no calls to check the size of the dma_coherent heap (there is cma_get_size right under the hood), so no easy way to check.
If we're increasing the default cma heap size in the very near future to ditch the reloc heap, then I guess we don't worry too much.

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

Tested and all seems good.

That only leaves the caveat I mentioned above. I can't see a good solution to that before we extend the CMA heap.

@popcornmix
Copy link
Collaborator

Extend it a little now?

@pelwell
Copy link
Contributor

pelwell commented Feb 28, 2019

Fine with me - is 16MB sufficient?

@popcornmix
Copy link
Collaborator

That's the number I was imagining.

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

Extend it a little now?

Can do.
I've lost track as to where the default heap size is specified. It's not apparently CONFIG_CMA_SIZE_MBYTES as that is 5 (default is 8), but I also can't see it quickly in device tree.

16MB should be sufficient for most use cases, although still possibly too low if someone is running a 4k display (~32MB for the FB).

@JamesH65
Copy link
Contributor

Multiple FB support or transposed displays or combinations of the two would also require quite a chunk of memory - there is a circumstances where three transpose buffers are needed for the new LCD's (720p * 3) and DPI displays. Is that going to be affected by this change?

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

Depends when exactly the transpose buffers are allocated.

This PR is shifting the main frame buffer that the ARM knows about from being a VPU to ARM allocation.
I suspect the transpose buffers are later mem_alloc calls (maybe wrapped as vc_image_relocatable_alloc), in which case those are independent.
Iff the vcsm-cma service has connected, then it will deal with the allocation. Otherwise it will fall back to the reloc heap (what remains of it). We may need to refactor when it does the alloc to ensure that vcsm-cma has connected first - that may be tricky as currently that is at ~7sec after boot, probably because it is probed as a platform driver via the VCHIQ driver coming up.

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

So it seems CONFIG_CMA_SIZE_MBYTES=5 gets rounded up to 8 by some magic. Setting it to 16 does result in a 16MiB allocation reported by the kernel.

Are we happy at 16MiB and not worry about those using 4k screens? They should revert to using the reloc heap instead, so a soft fail anyway at present.
Likewise James' transpose buffers aren't an issue for now, but no need to be considered when we are trying to shrink the reloc heap for real.

@pelwell
Copy link
Contributor

pelwell commented Feb 28, 2019

They always have the option of adding cma=32M to their cmdline.txt.

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

They always have the option of adding cma=32M to their cmdline.txt.

Very true.
OK, I'll squash the two current patches, and add one patching the defconfigs.

@6by9 6by9 force-pushed the rpi-4.19.y-clean4 branch from d816acb to bd23d63 Compare February 28, 2019 15:37
@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

Squashed, and defconfigs amended.
Any further comments, or are we good to go?

@popcornmix
Copy link
Collaborator

They should revert to using the reloc heap instead, so a soft fail anyway at present.

I think the more serious failure is if these succeed but leave CMA with an awkwardly small amount remaining causing later failures.

I wonder if the firmware should indicate whether cma allocation is preferred? (i.e. new firmware are large cma size)?

@popcornmix
Copy link
Collaborator

The case I'm nervous about is something like kodi with firmware 3d driver that may experience a performance hit from the remoted allocations. As such we probably want a switch between old style gpu_mem and new style cma.

With the switch in the gpu_mem position cma allocation of framebuffers is probably not the best option.

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

@popcornmix This patch is only touching the frame buffer driver, not remoting VPU allocations - that's waiting on https://github.com/6by9/linux/tree/rpi-4.19.y-clean2 and needs firmware changes which are up as an MR. Remote allocations is behind a config.txt flag at present, and off by default.

There's near zero performance overhead with this PR as it is all controlled from the fb driver. During FB setup it does the CMA allocation (from the newly enlarged heap) and tries passing it to the VPU. If the VPU accepts it then job's done. If not it frees the CMA allocation and lets the VPU allocate.

I think the more serious failure is if these succeed but leave CMA with an awkwardly small amount remaining causing later failures.

Hence my comment of do we worry about 4k screen users, or more 2560x1440 as that allocation will just succeed in a 16MiB CMA heap.
CMA is semi-reserved memory, so it can be allocated to other users if normal memory has been exhausted. Those users may get remapped should a bigger CMA allocation require it.
Setting the heap size higher therefore has fewer downsides to setting gpu_mem higher.

How we handle switching from gpu_mem sizing to cma sizing is one to discuss. There are several combinations of things that could result in a system that fails to boot, particularly if we're considering the bare-metal world or RiscOS (they won't have the vcsm service).
John's going to be happy as he's just gained 8MB of reloc heap. Likewise Kodi with firmware v3d has gained a small amount of extra reloc heap.

@popcornmix
Copy link
Collaborator

My initial reaction was to merge this as the awkward edge cases (framebuffer alloc that succeeds but prevents booting because there is insufficient cma remaining) would be temporary until we move to the big cma world.

But my kodi example suggests that the edge cases may not be temporary so we might want to think through possible solutions when we aren't in the big cma world.

It seems like with this PR, I can run fbset with carefully constructed (or just unlucky) dimensions that exhaust cma and cause the kernel to grind to a halt. The same can happen on boot with a larger than 1080p monitor configured.

Some ideas:
Measure free cma and fall back to firmware allocation is it seems too low.
If measuring free cma not possible, use initial value of cma.
If initial value of cma not available a larger allocation can be attempted (and freed) to check
The firmware could pass a flag to say if it's using cma allocations to the kernel (e.g. module parameter)
The kernel could query this information through a mailbox call.

@6by9
Copy link
Contributor Author

6by9 commented Feb 28, 2019

Measure free cma and fall back to firmware allocation is it seems too low.

No apparent call to do that, and certainly not through the dma API.
CMA only keeps a bitmask of allocated/free pages, not an overall count.

If measuring free cma not possible, use initial value of cma.

That could be extracted, although again I suspect not through official APIs.

If initial value of cma not available a larger allocation can be attempted (and freed) to check

Can do that. Make sure we have 6MB excess, and hope no other big users have jumped in before us? If we've done it once, then stick with ARM allocations (unless it actually fails).

The firmware could pass a flag to say if it's using cma allocations to the kernel (e.g. module parameter)

The firmware doesn't know for definite. It knows that it has been asked to do them should a suitable provider connect over VCHI, but there's no guarantee that will happen.

The kernel could query this information through a mailbox call.

As above, the firmware doesn't know at this early stage whether it will actually have a suitable provider for remote allocations or not.

I'm not in tomorrow morning, but it sounds like it's better to have a 15minute bash this through rather than banging things back and forth here.

@6by9 6by9 force-pushed the rpi-4.19.y-clean4 branch from bd23d63 to 7df0535 Compare March 6, 2019 13:29
@6by9
Copy link
Contributor Author

6by9 commented Mar 6, 2019

Conversation had with @popcornmix.
Merge the support for this as is without increasing the CMA heap.

On the firmware side we will be gaining the config.txt setting remote_gpu_mem if we're allocating from CMA via vcsm-cma.
Iff that is set, then the firmware will accept the CMA allocation made by this patch and use it as the frame buffer.
If it is not set then we assume we have lots of gpu_mem, reject the passed in buffer, and allocate our own from gpu_mem.

It does make the full implementation of this dependent on the patch set for remote memory allocations, but that's not a big deal.

All happy with that plan?

@pelwell
Copy link
Contributor

pelwell commented Mar 6, 2019

What is the intended use of remote_gpu_mem? Will it cause a larger CMA area to be reserved on the ARM?

@pelwell
Copy link
Contributor

pelwell commented Mar 6, 2019

Or is it just a boolean to put the choice into the hands of the user (and the default value compiled into the firmware)?

@pelwell
Copy link
Contributor

pelwell commented Mar 6, 2019

After reading the VPU-side counterpart, I'm happy with this scheme.

Currently the VPU allocates the contiguous buffer for the
framebuffer.
Try an alternate path first where we use dma_alloc_coherent
and pass the buffer to the VPU. Should the VPU firmware not
support that path, then free the buffer and revert to the
old behaviour of using the VPU allocation.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
@6by9
Copy link
Contributor Author

6by9 commented Mar 6, 2019

As you've surmised, remote_gpu_mem means that the VPU is expecting to be able to use CMA allocations instead of the reloc heap, therefore other config needs to be updated to allow this.

The intention is not to enable this by default until we have a sensible scheme in place to play off gpu_mem vs CMA. For the standard case we'd expect a largish CMA heap (256MB) with minimal reloc heap (<20MB). When using the firmware GLES driver there is a minor performance hit, therefore users may want to revert back to using a genuine reloc heap.

@6by9 6by9 force-pushed the rpi-4.19.y-clean4 branch from 7df0535 to 6b7ad24 Compare March 6, 2019 14:24
@6by9
Copy link
Contributor Author

6by9 commented Mar 6, 2019

Rebased on top of 4.19.27.

@pelwell pelwell merged commit c0e09b3 into raspberrypi:rpi-4.19.y Mar 7, 2019
popcornmix added a commit to raspberrypi/firmware that referenced this pull request Mar 20, 2019
kernel: Enable MT76 USB wifi modules
See: raspberrypi/linux#2890

kernel: staging: bcm2835-codec: NULL component handle on queue_setup failure
See: raspberrypi/linux#2898

kernel: f2fs: fix to skip verifying block address for non-regular inode
See: raspberrypi/linux#2896

kernel: vcsm: rpi-4.19.y - gcc-v8 fixes
See: raspberrypi/linux#2897

kernel: vc-sm-cma tidy ups, and one for bcm2835_codec
See: raspberrypi/linux#2889

kernel: staging: vc_sm_cma: Remove erroneous misc_deregister
See: raspberrypi/linux#2888

kernel: video: bcm2708_fb: Try allocating on the ARM and passing to VPU
See: raspberrypi/linux#2875

firmware: smservice: Avoid a double free

firmware: arm_loader/display: All the framebuffer allocation to be made on the ARM

firmware: isp_tuners: Fix memory leak in error path

firmware: testc: Fix dps parsing

firmware: camera_subsystem: Fix hardware sync pulses off disable_camera_led

firmware: Replace the remaining direct users of C(mask_gpu_interrupt[0|1])

firmware: arm_display: Kick transposer for the rainbow screen if required
firmware: dispmanx: Do not allow transpose if using Full KMS
See: raspberrypi/linux#2891
popcornmix added a commit to Hexxeh/rpi-firmware that referenced this pull request Mar 20, 2019
kernel: Enable MT76 USB wifi modules
See: raspberrypi/linux#2890

kernel: staging: bcm2835-codec: NULL component handle on queue_setup failure
See: raspberrypi/linux#2898

kernel: f2fs: fix to skip verifying block address for non-regular inode
See: raspberrypi/linux#2896

kernel: vcsm: rpi-4.19.y - gcc-v8 fixes
See: raspberrypi/linux#2897

kernel: vc-sm-cma tidy ups, and one for bcm2835_codec
See: raspberrypi/linux#2889

kernel: staging: vc_sm_cma: Remove erroneous misc_deregister
See: raspberrypi/linux#2888

kernel: video: bcm2708_fb: Try allocating on the ARM and passing to VPU
See: raspberrypi/linux#2875

firmware: smservice: Avoid a double free

firmware: arm_loader/display: All the framebuffer allocation to be made on the ARM

firmware: isp_tuners: Fix memory leak in error path

firmware: testc: Fix dps parsing

firmware: camera_subsystem: Fix hardware sync pulses off disable_camera_led

firmware: Replace the remaining direct users of C(mask_gpu_interrupt[0|1])

firmware: arm_display: Kick transposer for the rainbow screen if required
firmware: dispmanx: Do not allow transpose if using Full KMS
See: raspberrypi/linux#2891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants