Bug 161135 - OOM Crash Skia Vulkan rendering dragging full LO appframe wider
Summary: OOM Crash Skia Vulkan rendering dragging full LO appframe wider
Status: RESOLVED WONTFIX
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: graphics stack (show other bugs)
Version:
(earliest affected)
24.2.3.2 release
Hardware: All Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: haveBacktrace
Depends on:
Blocks: Skia
  Show dependency treegraph
 
Reported: 2024-05-16 13:33 UTC by V Stuart Foote
Modified: 2024-05-18 11:27 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
WinDbg stack and analyze (41.82 KB, text/plain)
2024-05-16 13:37 UTC, V Stuart Foote
Details

Note You need to log in before you can comment on or make changes to this bug.
Description V Stuart Foote 2024-05-16 13:33:22 UTC
Crash with Skia/Vulkan rendering when dragging the LO appframe wider on Win10 with modest nVidia dGPU, 2GB graphics memory on a 4K (2048x3840px) display. Stacktrace(s) attached.

STR
1. open LO
2. use the os/DE to drag and attach appframe to right half or left half of desktop
3. grab midpoint edge and drag the LO appframe wider
4. note the appframe will expand for some distance, but then will crash
5. repeat, but set LO to Skia/raster based rendering -- no crash with the raster framing

Attached soffice.bin to WinDbg with symbols. Look to be hitting the OOM assert at vcl/skia/gdiimpl.cxx 485 that Mike K. put in with https://gerrit.libreoffice.org/c/core/+/161516

Version: 24.2.3.2 (X86_64) / LibreOffice Community
Build ID: 433d9c2ded56988e8a90e6b2e771ee4e6a5ab2ba
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: CL threaded

vulkaninfo.exe
Device Properties and Extensions:
=================================
GPU0:
VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 1.3.277 (4206869)
        driverVersion     = 552.12.0.0 (2315452416)
        vendorID          = 0x10de
        deviceID          = 0x1380
        deviceType        = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName        = NVIDIA GeForce GTX 750 Ti
        pipelineCacheUUID = e1b3e8ed-3cf0-cbc3-9cd7-33f3274361af
Comment 1 V Stuart Foote 2024-05-16 13:37:00 UTC
Created attachment 194151 [details]
WinDbg stack and analyze
Comment 2 V Stuart Foote 2024-05-16 13:40:30 UTC
@Mike, I can routinely reproduce on this 2Gb GPU / 4K display combo, do you have cycles to revisit?
Comment 3 Mike Kaganski 2024-05-16 14:37:31 UTC
(In reply to V Stuart Foote from comment #0)
> Attached soffice.bin to WinDbg with symbols. Look to be hitting the OOM
> assert at vcl/skia/gdiimpl.cxx 485 that Mike K. put in with
> https://gerrit.libreoffice.org/c/core/+/161516

Please note that I didn't put any asserts there. I added code to *not* assert in a specific case, hoping that these cases could get fixed - but that code isn't run here.

Given the description that developers made for the 'oomed()', I do not see what I could do here. An expert is needed, not me.
Comment 4 V Stuart Foote 2024-05-16 18:53:29 UTC
(In reply to Mike Kaganski from comment #3)
> (In reply to V Stuart Foote from comment #0)
> > Attached soffice.bin to WinDbg with symbols. Look to be hitting the OOM
> > assert at vcl/skia/gdiimpl.cxx 485 that Mike K. put in with
> > https://gerrit.libreoffice.org/c/core/+/161516
> 
> Please note that I didn't put any asserts there. I added code to *not*
> assert in a specific case, hoping that these cases could get fixed - but
> that code isn't run here.
> 
> Given the description that developers made for the 'oomed()', I do not see
> what I could do here. An expert is needed, not me.

OK thanks, guess I misread the commit, and I'm certainly no expert either. 

But thought you were on the right track. After an initial skia GrDirectContext oomed() context return [1], you'd tested for > 10 Skia operations to flush with a default at 1000 ops, and then if still oomed() divide that by 2--and get context again just once? And only then fail if still OOM with oomed() context?

If the flush is to work, maybe a factor of 10 on oomed() so if > 10 /= 10 --> and reduce the count of resize steps to carry before the flush? 

Or wishful thinking and I am way off track with being able to flush GPU memory... and maybe rather than the oomed() the Skia releaseResourcesAndAbandonContext() is another way to recover when OOM.

=-ref-=
[1] https://api.skia.org/classGrDirectContext.html
Comment 5 Julien Nabet 2024-05-17 22:37:45 UTC
Can't help here=>uncc myself.
Comment 6 Mike Kaganski 2024-05-18 10:18:28 UTC
(In reply to V Stuart Foote from comment #4)

That change gust ignored the OOM state, and all the operations that led to it, and simply halved the number of operations before flush. so next batch of operations would flush after 500, then 250, ... - no more than 8 attempts before the number is lower than 10.

Given that oomed() call is documented to reset the state, I don't see what I can do here. Replacing canvas's context is not something that looks safe.
Comment 7 V Stuart Foote 2024-05-18 11:27:39 UTC
(In reply to Mike Kaganski from comment #6)
> (In reply to V Stuart Foote from comment #4)
> 
> That change gust ignored the OOM state, and all the operations that led to
> it, and simply halved the number of operations before flush. so next batch
> of operations would flush after 500, then 250, ... - no more than 8 attempts
> before the number is lower than 10.
> 
> Given that oomed() call is documented to reset the state, I don't see what I
> can do here. Replacing canvas's context is not something that looks safe.

OK can agree to that, and thanks for walking me through it.

Setting => WF as it is annoying/concerning but a corner case in Vulkan usage and no one else has confirmed with STR.