Bug 156478 - PDF export: Drop bloat of "default" /MediaBox in root node of the page tree
Summary: PDF export: Drop bloat of "default" /MediaBox in root node of the page tree
Status: NEEDINFO
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:24.2.0
Keywords: difficultyBeginner, easyHack, filter:pdf, skillCpp
Depends on:
Blocks: PDF-Export
  Show dependency treegraph
 
Reported: 2023-07-26 11:23 UTC by Mike Kaganski
Modified: 2023-11-25 02:09 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2023-07-26 11:23:27 UTC
Export any document from LibreOffice, and inspect the PDF in a plain text editor.

There are separate page elements for each exported page there, starting with '<</Type/Page/Parent', each having own /MediaBox. Also, there is a root node of the page tree, starting with '<</Type/Pages', listing all the document pages under its Kids, and having an own /MediaBox.

The latter /MediaBox of the root is redundant and should not be emitted.
The code pointer: PDFWriterImpl::emitCatalog in vcl/source/gdi/pdfwriter_impl.cxx.

Rationale:

The PDF 1.7 standard [PDF 32000-1:2008] has this under 7.7.3 "Page Tree", 7.7.3.4 "Inheritance of Page Attributes":

> Some page attributes ... are designed as *inheritable*. ...
> EXAMPLE    A document may specify the same media box for all of its pages by
>            including a MediaBox entry in the root node of the page tree. If
>            necessary, an individual page object may override this inherited
>            value with a MediaBox entry of its own.

Note that the MediaBox entry in the root is optional. Note also, that LibreOffice always outputs the individual pages' MediaBox entries, so the "default" value is never used.

Initially, the output of this entry was introduced in commit df0f52d3aadea5c4d5f600d1533901af1087b464 (#100608# preparations for PDF export, 2002-07-08); there, it used hardcoded values of A4 page (i.e., it didn't consider actual document page sizes at all).

Later, in commit 98468607f7a8d0b1f5f7e3ecd09f756aea904d00 (INTEGRATION: CWS vcl87 (1.122.16); FILE MERGED, 2008-04-03, related to i#75941), document pages started to be taken into account - to workaround a third-party bug in ImageMagick. The procedure to calculate the MediaBox values was strange, allowing to generate maximal width independently of maximal height, and thus, possibly to get the "default" size not matching any single actual page size in the document.

In commit 4830592b780833cf5eee2aef30bc9c5d444dfb24 (PDF export: fix handling of page sizes larger than 508 cm, 2020-04-16), the values started to take UserUnit into account - but again, in a wrong way, because maximal vertical size set its corresponding UserUnit independently of maximal horizontal size; and it was possible to have a super-wide page (e.g., 5500 mm wide, 210 mm high), followed by a relatively high but narrow page (e.g., 210 x 297 mm), and the final UserUnit applied to both of the dimensions was taken from the last maximal size, which happened to be the height of the last page, and which defined UserUnit of 1 (instead of 2 required for 5500 mm, which is larger than 14400 pt).

All in all, this is just a bloat, creating quite some unneeded maintenance effort (in addition to the mentioned commits, see e.g. commit 48aaca5ba9c27a247ed502c3db827c6ac9f34df9 - tdf#148033: Loss of precision in /MediaBox (PDFWriterImpl::emitCatalog()), 2022-03-18); and all the same, not done properly. It is not (or must not be) used by any software in the presence of the individual pages' MediaBox entries. It must be dropped - except in case of "sanity check failure", when there's no pages in the document (it can't happen, but the check is there, so be safe and put a hardcoded A4 MediaBox in that case).
Comment 1 Stéphane Guillou (stragu) 2023-07-26 12:23:39 UTC
Confirmed in:

Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 24d0a62bd75b9a895c419aa165da648ab18f134d
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

Exported PDF uncompressed with:

qpdf --qdf --object-streams=disable

...contains identical /Mediabox [] elements on each page.
Comment 2 Commit Notification 2023-11-03 05:08:36 UTC
Tobias Kokolakis committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/fa7ae4638431bf4f4d40d16ab0017ec59e233039

tdf#156478 Remove unused default values for MediaBox and UserUnit

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 3 Matt K 2023-11-25 02:09:23 UTC
(In reply to Commit Notification from comment #2)

Is this fixed now?  Setting to NEEDINFO so it won't appear in beginner easy hacks list.