Summary: | Arabic Text Scrambled and Unreadable in PDF Files Opened by LibreOffice Draw | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | Khaldoun <knimer> |
Component: | Draw | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | eyalroz1, vsfoote |
Priority: | medium | ||
Version: | 7.3.3.2 release | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
See Also: | https://bugs.documentfoundation.org/show_bug.cgi?id=104597 | ||
Whiteboard: | |||
Crash report or crash signature: | Regression By: | ||
Bug Depends on: | |||
Bug Blocks: | 99746, 112810 | ||
Attachments: |
PDF Sample File with Arabic text.
Lam-Alef and Lam-Hamza issue and Splitting singles words |
Description
Khaldoun
2022-06-04 22:04:56 UTC
Created attachment 180566 [details]
PDF Sample File with Arabic text.
Thanks for filing, but a known and long running PDF import filter issue for RTL text runs. *** This bug has been marked as a duplicate of bug 104597 *** While this bug is about PDF import of RTL language text runs - it is not the same problem described in 104597. There, the problem is the reversal of order in text runs. Here we have additional problems, like character repetitions, shifting, excessive and insufficient (horizontal) spacing. So, this is not clearly a dupe. Perhaps the fix for 104597 will resolve this one as well, but - perhaps not. I think the more careful relation between the bugs is dependence. Hello How can I check the 104597 fix and decide if this is as well is solved?? How the new commit will be delivered as a new LO version? @Eyal Rozenberg The 2022-10-14 nightly [1] imports the sample PDF to Draw pretty well. Some font glitches and obvious spots where combining glyphs get separated from their root glyph. Overall greatly improved, but please consider LibreOffice is *NOT* a PDF editor, the filter import to Draw produces an ODF holding sdraw text objects arranged on a document canvas. Version: 7.5.0.0.alpha0+ (x64) / LibreOffice Community Build ID: 8991cbb7986d3967bc6c3719d95254ff04428d1a CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win Locale: en-US (en_US); UI: en-US Calc: threaded =-ref-= [1] https://dev-builds.libreoffice.org/daily/master/ Hello @ V Stuart Foote Thanks a lot for the link to the build. I am not seeking to LO to be a PDF editor, but properly display RTL (Arabic in my case). I can assure that many Arabic users are not using Arabic because of such issues they are not facing with other apps. Anyways, I downloaded the 2022-10-14 build: Version: 7.5.0.0.alpha0+ / LibreOffice Community Build ID: a09c5c69e3b5fbf448cae1d6c476f39067e40023 CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: gtk3 Locale: en-US (en_US.utf8); UI: en-US Calc: threaded The text rendering is much better but still the reverse order did not handle all the letters properly. Please note the added attachment that describes an issue in handling specific 2 letter combinations. Also, there is an issue of splitting the same word over multiple blocks rather coming into 1 block. Hello, I agree, Draw is not a PDF editor. But Draw still show handle the RTL/Arabic letters properly. Which is not yet 100% fixed in this fix. In Arabic when a "Lam" letter is followed by a 'Alef" letter or "Hamza" letter, both letters are combined into a new form/shape. This looks like not being handled yet properly in this fix. Also, another issue appears that Draw sometimes split the "same word" into multiple blocks. NB. I call it a "block" but it can be named: frame, box.. etc. I am attaching new file that describes both issues. IMPORTANT: This commit fixes a big portion of the issue. It deserves to go live. Created attachment 183055 [details]
Lam-Alef and Lam-Hamza issue and Splitting singles words
For got to mention: Version: 7.5.0.0.alpha0+ / LibreOffice Community Build ID: a09c5c69e3b5fbf448cae1d6c476f39067e40023 CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: gtk3 Locale: en-US (en_US.utf8); UI: en-US Calc: threaded @Khaldoun, thanks for the analysis. I did notice the 1st issue. I don't know if that is a font fallback, or just manifestation of the way the glyphs are being extracted from the PDF--where the logic for handling the glyph transformations is probably not present. For the second, best to think of them as partial text runs or snippets. Glyphs are encoded into the PDF with no sense of source script. We filter import them (using poppler libs) into LibreOffice as just a run of text, all lexical context is missing. Normal break iterators are not parsed even if present. They end up recorded into the draw canvas as text box objects--disjointed by which glyphs get strung together. So, given the coarseness of the filter import, just getting them into the correct RTL sequence (for bug 104597) is a great improvement. Assembling them into lexically useful strings, sentences and paragraphs is work still to be done, work done for bug 118370 is not doing well with assembling the RTL textboxes, suspect that needs additional logic to do so. I'm interested in Khaled's take on things at this juncture. The first attachment ("PDF sample file with Arabic text") is already kind of scrambled to begin with. Specifically, observe how, on line 2, the % sign overlaps the two aleef characters. Also, the text is not in the Arabic language, and I doubt it is properly in any language. So, let's please start with a proper PDF document (with Arabic, or Farsi or whatever), then analyze any problems. The primary issue of the reversed text runs is corrected for the 7.4.3 release, with additional work in master against a 7.5 release. Any residual formatting or conversion of extracted RTL text runs should be opened as new issues against 7.4.3 *** This bug has been marked as a duplicate of bug 104597 *** |