160623 – Issue while converting DOCX or PPTX files to PDF files

Bug 160623 - Issue while converting DOCX or PPTX files to PDF files

Summary: Issue while converting DOCX or PPTX files to PDF files

Status:	UNCONFIRMED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	6.4.7.2 release
Hardware:	All Linux (All)

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2024-04-11 06:45 UTC by Pasupalati Sampath
Modified:	2024-04-11 09:50 UTC (History)
CC List:	1 user (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pasupalati Sampath 2024-04-11 06:45:01 UTC

Description:
I'm utilizing the command "libreoffice --headless --convert-to pdf --outdir ${outputFolderPath} input.docx" to convert DOCX files to PDF format. However, I've noticed that the conversion process significantly slows down for files exceeding 1 MB in size. Previously, on our production server, the conversion of a 20 MB file to PDF typically took around 3 to 4 minutes. However, recently, the process seems to be taking much longer than usual.

Steps to Reproduce:
1. Open the terminal and connect to the Linux server and past this command "libreoffice --headless --convert-to pdf --outdir resultFolderPath input.docx"
2. For input.docx provide any docx file path.
3. Run the final command and check for results.

Actual Results:
size	time	pages
1.2mb	1ms	9
4.5 mb	230 ms	11
4 mb	227 secs	1195
10mb	452 secs	392
<1mb	1ms	1


Expected Results:
The conversion time should be less


Reproducible: Always


User Profile Reset: No

Additional Info:
Please resolve this ASAP as I have production deployment planned.

Comment 1 david 2024-04-11 09:17:36 UTC

Hello,

I think you should include DOCX examples in your issue.
And you could add some precision in the title of your issue, at least add the keyword "Performance issue" or something like that :) 

I am also interested in solving this problem. I will share my research in the next message.

Thank you.

Comment 2 david 2024-04-11 09:42:32 UTC

For me, LibreOffice could be 10x times faster :) 

I spent a few days last year researching this subject. Here are my conclusions:

Most of the time, especially with long documents containing very large tables, the main bottleneck is here:

- File: dev/core/svl/source/items/stylepool.cxx
- Function: Node* Node::findChildNode

From my understanding, this function is called each time a new "style" is parsed from a part of the document (a paragraph, a table cell, etc.). 
The goal of this function is to find an existing style among already parsed styles. If not found, it creates a new style node. 
Problem, this search operation is extremely costly because:

- Comparing styles is slow, often requiring the comparison of complex heterogeneous structures.
- It performs an O(n) search using a simple for-loop. This loop is executed for each parsed style in the document, making it effectively O(n^2). The longer the document, the greater the search cost.

I don't know the LibreOffice code well enough to optimize this myself.
Ideally, I would like to create a short hash of each style node, with a hash index to find an existing style.
I am also wondering about the impact of creating a new Node style instead of reusing an existing one to avoid the search.

We are willing to pay a few thousand euros to a company that can solve this problem. 
Please contact me privately for this.

I would also like to help solve this problem 🙏

Thank you.