Bug 158033 - PDF: Writer Docx to PDF export fails to render asian fonts in V7
Summary: PDF: Writer Docx to PDF export fails to render asian fonts in V7
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0 all versions
Hardware: All Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-02 01:16 UTC by Prashanna
Modified: 2023-11-14 21:41 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Example docx use to replicate the bug (25.53 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2023-11-02 01:20 UTC, Prashanna
Details
Another example docx used to replicate the bug (41.65 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2023-11-02 01:22 UTC, Prashanna
Details
Dockerfile to reproduce the bug (736 bytes, text/plain)
2023-11-09 16:38 UTC, Prashanna
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Prashanna 2023-11-02 01:16:32 UTC
Description:
Asian character font rendering in the Writer Docx to PDF export appears to consistently break after a specific point in the PDF document, resulting in tofu box characters, while still rendering correctly in the Writer application.

Note, the character set and font is the same across the pages where the breakage occurs.

I have tried multiple linux distros and a range of libreoffice versions and distributables (both distro and libreoffice.org maintained packages).

This regression is only present in Libreoffice v7, reverting to 6.4.7.2 solve the issue.

Steps to Reproduce:
1. Ensure all asian character fonts is installed and supported e.g. (apt-get install fonts-noto-cjk)
2. Export docx to PDF either via the GUI or command line
3. Open PDF to observe the issue near the end of the document.

Actual Results:
After a certain number of pages in the exported PDF document, the expected asian font character are replaced with tofu box characters.

Expected Results:
We should see the correctly encoded asian characters.


Reproducible: Always


User Profile Reset: Yes

Additional Info:
$ libreoffice --help
LibreOffice 7.4.7.2 40(Build:2)
Comment 1 Prashanna 2023-11-02 01:20:11 UTC
Created attachment 190596 [details]
Example docx use to replicate the bug
Comment 2 Prashanna 2023-11-02 01:22:09 UTC
Created attachment 190597 [details]
Another example docx used to replicate the bug
Comment 3 kdub 2023-11-09 15:05:32 UTC
Hello Prashanna,

Thank you for reporting the bug. Unfortunately I can't reproduce it. After exporting the PDF, I am using Evince 42.3 to view it. All characters look fine to me. I am using: 

Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: f3811e06b27afcbac7f63c2d184db4b1f8b01a1f
CPU threads: 4; OS: Linux 6.2; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

Could you please try to reproduce it with a master build from https://dev-builds.libreoffice.org/daily/master/current.html ?
You can install it along side the standard version.
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' if the bug is still present in the master build
Comment 4 Prashanna 2023-11-09 16:38:31 UTC
Created attachment 190771 [details]
Dockerfile to reproduce the bug
Comment 5 Prashanna 2023-11-09 16:52:11 UTC
Thanks for bringing this bug to your attention, much appreciated.

I've installed the master build as requested and I was able to replicate the error (The last page fails to interpret the font in Japanese_Korean.docx even though it is identical to the page prior) with the following version:

LibreOfficeDev 24.2.0.0.alpha0 aea53c0ed1527ed1f8233972a27128e14d645e8f

I've attached a dockerfile to help replicate the error.

The problematic output PDF appears evident regardless of PDF viewer i.e. tested with Evince 43.1-2 Debian.

Interestingly the master build emits debug warnings which may help narrow down the cause:

root@8a75daaaeb6d:/# libreofficedev24.2 --headless --convert-to pdf --outdir /tmp/mount /tmp/mount/Japanese_Korean.docx 
javaldx: Could not find a Java Runtime Environment!
Warning: failed to read path from javaldx
warn:xmloff:849:849:sax/source/fastparser/fastparser.cxx:1233: unknown attribute vid={B3B32D58-CE17-43BB-8D3C-451204B3B300}
warn:legacy.osl:849:849:oox/source/helper/storagebase.cxx:67: StorageBase::StorageBase - missing base input stream
convert /tmp/output/Japanese_Korean.docx as a Writer document -> /tmp/output/output/Japanese_Korean.pdf using filter : writer_pdf_Export
warn:vcl.fonts:849:849:vcl/source/fontsubset/sft.cxx:1262: Endless loop found in a compound glyph.
warn:vcl.fonts:849:849:vcl/source/fontsubset/sft.cxx:1262: Endless loop found in a compound glyph.
warn:vcl.fonts:849:849:vcl/source/fontsubset/sft.cxx:1262: Endless loop found in a compound glyph.
warn:vcl.fonts:849:849:vcl/source/fontsubset/sft.cxx:1262: Endless loop found in a compound glyph.


The commit pertaining the warning check appears recent (1 year ago).
It may be a clue as to why I'm observing the bug only with V7 onward.
https://github.com/LibreOffice/core/commit/3a371df3ecce456c9329a493f48600431d2ade69
Comment 6 Prashanna 2023-11-09 16:58:58 UTC
Just to be clear, when you attempted to replicate the error, did you check the last page in the output PDF for the attached Japanese_Korean.docx file?
Comment 7 Prashanna 2023-11-09 16:59:12 UTC Comment hidden (obsolete)
Comment 8 Buovjaga 2023-11-13 17:14:57 UTC
The documents use Microsoft fonts such as MS Gothic. Everything exports just fine on Windows where all fonts are present.

Prashanna: what is the idea behind this report? Is it to test font fallback, when you are missing Microsoft fonts?

Set to NEEDINFO.
Change back to UNCONFIRMED after you have provided the information.
Comment 9 Prashanna 2023-11-13 19:09:44 UTC
Hi Buovjaga,
As stated in the report, this export bug isn't present on Windows and Mac, only on Linux, and more importantly only on v7 and above of Libre Office, v6 works fine. In both cases the same system fonts are installed, so that variable is controlled (along with every other variable). Hence why I'm suspicious that this is a regression in Libreoffice on Linux.

The document was originally constructed on Windows in MS word with embedded MS fonts then exported on Linux.

The fact that the same text with the same font works then suddenly breaks after a seemingly arbitrary page break is baffling.
Again only on V7 libreoffice Linux, works fine on V6.

This bug was caught by our internal regression test suite after libreoffice version was upgraded.
Comment 10 Buovjaga 2023-11-14 08:36:18 UTC
Then I suggest to use bibisecting to discover the exact code change that caused it: https://wiki.documentfoundation.org/QA/Bibisect/Linux

If you need help, let me know.
Comment 11 Prashanna 2023-11-14 21:41:01 UTC
Ah, thanks for the suggestion.
I'll try give that a go the next chance I get.