Skip to content
This repository has been archived by the owner on Jul 29, 2024. It is now read-only.

Resource de/allocation refactor #467

Merged
merged 21 commits into from
Mar 2, 2020

Conversation

tianjunwork
Copy link
Contributor

@tianjunwork tianjunwork commented Feb 7, 2020

Signed-off-by: Jun Tian jun.tian@intel.com
Signed-off-by: Xu Guangxin guangxin.xu@intel.com

Signed-off-by: Jun Tian <jun.tian@intel.com>
Signed-off-by: Jun Tian <jun.tian@intel.com>
Signed-off-by: Jun Tian <jun.tian@intel.com>
Signed-off-by: Jun Tian <jun.tian@intel.com>
Signed-off-by: Jun Tian <jun.tian@intel.com>
Caused by MotionCompensationPredictionContextDctor
    EB_DELETE(obj->localReferenceBlockL0);

Pointer is changed in EbCodingLoop.c EncodePassPreFetchRef and EbInterPrediction.c EncodePassInterPrediction16bit
    contextPtr->mcpContext->localReferenceBlockL0->bufferY = refPicList0->bufferY + lumaOffSet;

Signed-off-by: Jun Tian <jun.tian@intel.com>
It seems to be a heap exception.
Use Application verifier to get backtrack of writing to invalid address.

Signed-off-by: Jun Tian <jun.tian@intel.com>
@tianjunwork tianjunwork added the Clean up A cleaner implementation or improved functionality label Feb 7, 2020
Compile warning
mem leak of unpack
Crash of OpenVisualCloud#457

Signed-off-by: Jun Tian <jun.tian@intel.com>
Signed-off-by: Jun Tian <jun.tian@intel.com>
@tianjunwork tianjunwork marked this pull request as ready for review February 7, 2020 23:23
Signed-off-by: Jun Tian <jun.tian@intel.com>
}
else {
pictureBufferDescPtr->bufferY = 0;
EB_CALLOC_ALIGNED_ARRAY(pictureBufferDescPtr->bufferY, pictureBufferDescPtr->lumaSize * bytesPerPixel);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hassount, the old size is
pictureBufferDescPtr->lumaSize * bytesPerPixel + (pictureBufferDescPtr->width + 1) * 2 * bytesPerPixel, and it is not aligned mem.
New one keeps sync with AV1, which I thinks makes more sense.
Which should I use for recon buffer?

Signed-off-by: Jun Tian <jun.tian@intel.com>
@tianjunwork

This comment has been minimized.

@tianjunwork

This comment has been minimized.

@tianjunwork
Copy link
Contributor Author

Hi @xuguangxin , hope you are doing well. I wonder if you are interested to review these results :)

@tianjunwork
Copy link
Contributor Author

tianjunwork commented Feb 13, 2020

CI fast test result with 2 failed buffered_test test cases. Log and bitstream seems fine. Not sure why they are marked as fail.

   1 ---------------------------------------------------------
   2 Test Begin...
   3 Total Number of Tests: 2685
   4 Total Passed: 2683
   5 Percentage Passed: 99.9255121043%
   6 Time Elapsed: 7 hour(s), 13 minute(s), 13 second(s)
   7 ---------------------------------------------------------

@xuguangxin
Copy link

Hi @xuguangxin , hope you are doing well. I wonder if you are interested to review these results :)

sure I can help. But it may be a little slower since it's a large commit. Hope I can finish it before mid of next week.
thanks

@tianjunwork
Copy link
Contributor Author

Hi @xuguangxin , yeah, this PR is to much to review. I was asking if you could review the test result above, the committed memory and valgrind. Did you see similar trend with your PR to AV1?

@tianjunwork
Copy link
Contributor Author

tianjunwork commented Feb 13, 2020

Briefly tested below parameters with different combinations, bitstream has no R2R issue compared with master.
CBR, -nb, -hierarchical-levels, -pred-struct, -sao
VBR R2R issue on master is submitted.

@tianjunwork
Copy link
Contributor Author

tianjunwork commented Feb 14, 2020

Speed testing with default parameters:
First two columns are with this PR, last two are from master tip.
Regression for both, but mainly of over all executing time. Will look into this.

input time fps time fps
1920x1080 11.831s 184.28 11.184s 189.41
7680x3840 1m33.284s 27.46 1m25.442s 27.22
7680x4320 10bit 2m47.448s 18.22 2m9.507s 17.08

Signed-off-by: Jun Tian <jun.tian@intel.com>
Signed-off-by: Jun Tian <jun.tian@intel.com>
Source/Lib/Codec/EbThreads.c Outdated Show resolved Hide resolved
Source/Lib/Codec/EbObject.h Outdated Show resolved Hide resolved
Source/Lib/Codec/EbObject.h Show resolved Hide resolved
Source/Lib/Codec/EbMalloc.h Show resolved Hide resolved
Source/Lib/Codec/EbMalloc.h Show resolved Hide resolved
EB_MALLOC_ARRAY(p2d, width); \
EB_MALLOC_ARRAY(p2d[0], width * height); \
for (uint32_t w = 1; w < width; w++) { \
p2d[w] = p2d[0] + w * height; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little tricky here, to allocate the big (width * height * sizeof(*p2d[0])) memory, and then assign each element ofp2d pointer array with the address of equal size memory sequentially...
It requires that the virtual memory allocator must have continuous linear space with that size, although allocation failure almost wouldn't happen. Not a good implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a tread off, if we do not do this, we will have huge malloc times. suppose we have a 1920x1080 it will allocate 2M times.
If we do think this a problem, at least we need to allocate a line for each time.
The malloc times will drop to 2M to 1080.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. According to malloc man (section NOTES) and the MM related Q&A, "The Linux kernel uses lazy (on-demand) allocation of physical pages, deferring the allocation until necessary and avoiding allocating physical pages which will never actually be used." Windows should have the similar allocation strategy. malloc wouldn't take much time, except for setting up the corresponding page table. And page fault handler will allocate physical memory and map the pages with virtuall address, once you really need to read, write or map it.
  2. So there wouldn't be big time consuming difference between: 1) malloc memories for each entry of the pointer array; AND 2) as implemented here, malloc all the memory with 1 time, then assign each entry of the pointer array with the equally spaced virtural addresses.
  3. With current EB_MALLOC_2D implementation, free any entry (except for p2d[0]) of the pointer array will fail, when those memories are not needed any more. All of them needs to wait for freeing p2d[0] to be freed.
  4. As I mentioned before, such implementation requires that the allocator must have continuous (no hole) linear space with that "big" size, although the allocation failure almost wouldn't happen. :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. bind phisical memory is not cost one, How to find the right size for malloc and how to split large memory to meet our requirement will use most of cpu time. Usally, if you do not do any optimization, for a decoder, 2030 cpu time will used by malloc. Encoder may less than this, but it still a noticalbe one.
  2. we never remove any entries, svt alwasy allocate system at start, and free it at end.
  3. it's just virtual address not physical, For 64 bits system, it usally has 48bits address space, whichi is 281TB, so you always can find virtual addreess for our allocation.

Maybe we can do some profile to see the time of multipile malloc vs. one malloc.

Copy link
Contributor

@Austin-Hu Austin-Hu Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 & 2. I noticed that in user space, multiple mallocs (for the same total size) take more time due to syscalls (brk or mmap) and allocation algorithms done in glibc or kernel (Linux slab) allocator. So in theory, the EB_MALLOC_2D could consume less time, especially when width or height is very big.
3. It can't address the separate free issue, unless there is additional resource management such as refcount. Otherwise EB_MALLOC_2D can only be used in the case that the memory resouces would be freed together.
4. Yes, I was talking about virtual address (space).

Source/Lib/Codec/EbMalloc.h Show resolved Hide resolved
Source/Lib/Codec/EbMalloc.h Show resolved Hide resolved
Source/Lib/Codec/EbMalloc.c Show resolved Hide resolved
@tianjunwork
Copy link
Contributor Author

tianjunwork commented Feb 25, 2020

From below vtune data, EbInitEncoder(EbPictureBufferDescCtor->memset) causes the perf regression. CPU time of other functions keep about the same.
With this PR:
vtune_new_bottomup

HEAD:
vtune_old_bottomup

Signed-off-by: Jun Tian <jun.tian@intel.com>
Signed-off-by: Jun Tian <jun.tian@intel.com>
@tianjunwork
Copy link
Contributor Author

With the fix to the regression, init time is back to and better than original.
Thx @Austin-Hu and @lijing0010.

vtune_fix_topdown

HEAD:
vtune_old_topdown

@xuguangxin
Copy link

With the fix to the regression, init time is back to and better than original.
Thx @Austin-Hu and @lijing0010.

vtune_fix_topdown

HEAD:
vtune_old_topdown

@tianjunwork
we use EB_CALLOC_ARRAY in SVT AV1 because it will memset the memory after malloc https://github.com/OpenVisualCloud/SVT-AV1/pull/314/files#diff-40daf4c2999b357c7617eeb0d545b007L99
Could we apply your patch to SVT AV1 too?

thanks

@xuguangxin

This comment has been minimized.

@tianjunwork
Copy link
Contributor Author

@xuguangxin I see, thx for the explanation. SVT-HEVC has no memset after EB_ALLIGN_MALLOC on HEAD. I will make a patch to AV1. But need someone who added memset to review in case it is to fix some issue that I don't know?

@tianjunwork
Copy link
Contributor Author

Speed testing with default parameters. -n 1000 x 5 times.
First two columns are with this PR, last two are from master tip.
No perf regression. Total execution time is 4.4% less then master.

input time fps time fps
7680x4320 10bit 1m10.605s 16.2 1m13.865s 16.26

mem leak is not handled in the calling function itself.
But some source code error is fixed, so that any throw will cause EbInitEncoder return EB_ErrorInsufficientResources
to the application.
Application should terminate encoder in this case.
This makes it easier to handle many possible mem leaks because of throw.

Signed-off-by: Jun Tian <jun.tian@intel.com>
@tianjunwork
Copy link
Contributor Author

Hi @xuguangxin , on your comments, mem leak is not handled in the calling function itself. But some source code error is fixed, so that any throw will cause EbInitEncoder return EB_ErrorInsufficientResources to the application.
Application should terminate encoder in this case.
This makes it easier to handle many possible mem leaks because of throw.
Tested with manually adding return EB_ErrorInsufficientResources to simulate throw. EbInitEncoder can return EB_ErrorInsufficientResources now with my fix.

@tianjunwork
Copy link
Contributor Author

Hi, @Austin-Hu , I didn't change malloc with EB_MALLOC, using malloc gives the opportunity to free previously allocated buffer in this function. EB_MALLOC will return inside the macro, making it hard to handle mem leak.

@@ -427,12 +427,12 @@ static EB_ERRORTYPE EBObjectWrapperCtor(EbObjectWrapper_t* wrapper,
EB_ERRORTYPE ret;

wrapper->dctor = EBObjectWrapperDctor;
ret = objectCreator(&wrapper->objectPtr, objectInitDataPtr);
if (ret != EB_ErrorNone)
return ret;
wrapper->releaseEnable = EB_TRUE;
Copy link
Contributor Author

@tianjunwork tianjunwork Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If objectCreator return error, function returns. objectDestroyer has no chance to get assigned. So crash happens when EBObjectWrapperDctor is called, because it goes to:

 else {
       //hack....

Signed-off-by: Jun Tian <jun.tian@intel.com>
@tianjunwork
Copy link
Contributor Author

tianjunwork commented Mar 1, 2020

Hi All, all the comments are address, please help to review the last round. Thank you again for the help:)

@xuguangxin
Copy link

thans Jun, for addressed all my comments.

@tianjunwork
Copy link
Contributor Author

thans Jun, for addressed all my comments.

Thank you again for helping on this PR.

@tianjunwork tianjunwork merged commit 0daec1a into OpenVisualCloud:master Mar 2, 2020
@tianjunwork tianjunwork deleted the res_refactor branch March 2, 2020 19:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Clean up A cleaner implementation or improved functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants