Bug 76021 - FORMATTING: Libre Office Writer: save As HTML results in interlaced <strike> and <span> tags
Summary: FORMATTING: Libre Office Writer: save As HTML results in interlaced <strike> ...
Status: RESOLVED DUPLICATE of bug 160017
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.2.1.1 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-11 09:51 UTC by Patrick Goetz
Modified: 2024-05-20 11:11 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
A Libre Office document which, when saved as HTML, produces interlaced <strike> and <span> tags. (23.57 KB, application/vnd.oasis.opendocument.text)
2014-03-11 09:51 UTC, Patrick Goetz
Details
.docx file used for "Export to xhtml" example discussed in the comment. (13.60 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2014-03-15 10:21 UTC, Patrick Goetz
Details
.odt document with abnormal line breaks and span tags (12.07 KB, application/vnd.oasis.opendocument.text)
2020-07-31 11:10 UTC, Tyco72
Details
Screenshot of HTML source in Firefox 88 (61.71 KB, image/png)
2021-05-18 12:38 UTC, Stéphane Guillou (stragu)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick Goetz 2014-03-11 09:51:25 UTC
Created attachment 95585 [details]
A Libre Office document which, when saved as HTML, produces interlaced <strike> and <span> tags.

Problem description: 

I am saving *.docx files as html using Libre Office 4.2.1.1.  Much to my surprise, I noticed that I'm getting horrifically invalid html, with interlaced tags.  As an experiment, I copy&pasted some of the offending text into a Libre Office document, saved as ODT, and then saved again as HTML.  The behavior appears to be the same.  Here is an example of what I'm talking about under current behavior.

Notice that the <strike> and <span> tags are interlaced, something which should never happen and which makes the file impossible to parse, say using xslt.


Steps to reproduce:
1. See attached Libre Office document
2. Save as HTML
3. Check the resulting HTML document using a text editor

I will test this using the linux version of Libre Office Writer

Current behavior:

<p class="western" style="margin-bottom: 0in; line-height: 110%"><b>Advisor</   b>&nbsp;shall
mean a person designated to support, assist, consult with<span style="display:  inline-block; border: none; padding: 0in"><strike>&nbsp;</strike><strike>and</span></strike>,

Expected behavior:

There are various ways this might be formatted with HTML; it doesn't matter as long as the tags aren't interlaced.        
Operating System: Windows XP
Version: 4.2.1.1 release
Comment 1 Urmas 2014-03-11 13:02:20 UTC
HTML is not XML and therefore doesn't require nested tags or XML document structure.
Comment 2 Patrick Goetz 2014-03-11 15:30:07 UTC
> HTML is not XML and therefore doesn't require nested tags or XML document structure.

While this might very well have been true in 1998, all modern versions of HTML are also valid XML with DTD's and Doctypes.  In any case, users expect to get valid output, and often the reason someone is doing Save as HTML in the first place is the document is going to be parsed.  It makes no sense to start out with a document that must be valid xml and end up with invalid HTML

This is quite embarrassing.  I've been recommending that people upgrade to Libre Office from MS Office, but in this case at least Microsoft is putting out valid HTML.  I don't understand what happened, I don't recall seeing this with previous versions of Open Office.
Comment 3 Patrick Goetz 2014-03-11 15:56:14 UTC
I checked Google Docs as well, converting the same document to HTML and checking to see if the tag structure is xml-valid.  While the HTML output from Google Docs can best be described as bizarre (every possible text formatting is set up as a class and applied using <span class=>), the file is nevertheless valid xml.
Comment 4 Julien Nabet 2014-03-11 22:07:21 UTC
On pc Debian x86-64 with master sources updated today, I can reproduce this.
Comment 5 Tomaz Vajngerl 2014-03-12 09:18:00 UTC
Heh - it's even a bigger mess when you add bold, italics and underline into the mix.
Comment 6 Patrick Goetz 2014-03-12 09:26:18 UTC
I've been doing this -- in particular, coding, and working with XML/HTML -- for a long time.  This smells of horrifically bad coding that probably needs to be rewritten from scratch.  No sensible XML parser would start with valid XML and end up with invalid HTML -- that doesn't make sense.
Comment 7 Julien Nabet 2014-03-12 11:13:17 UTC
I wonder if export->xhtml and save as->html calls the same part.
I think having read in a bug that it could be 2 different parts (one uses xslt file)

Miklos: any idea?
Comment 8 Tomaz Vajngerl 2014-03-13 15:00:38 UTC
I agree that HTML export in LO is reallybad, hasn't been worked on since Netscape was king and it probably needs rewriting to better use CSS and SVG, not use deprecated HTML features and to use new HTML5 tags where appropriate (easily choosing between HTML4 and HTML5). This probably will take some time..

However, if you are trying to parse HTML with a XML parser then it is your own fault. HTML is not XML - there are subtle differences like tags are case sensitive in XML but on HTML, no need for "/" if element has no body (for example: <br> is valid HTML but not XML) and nesting tags is allowed in HTML. In other words: it is recommended today to write HTML as XML but not mandated so you can not rely on that.

If you want a valid XML document export it as XHTML, which is actually using XML as a base.
Comment 9 Tomaz Vajngerl 2014-03-13 15:03:12 UTC
(In reply to comment #7)
> I wonder if export->xhtml and save as->html calls the same part.
> I think having read in a bug that it could be 2 different parts (one uses
> xslt file)
> 
> Miklos: any idea?

Yes, export->xhtml is using XSLT and they aren't using the same code paths.
Comment 10 Patrick Goetz 2014-03-15 10:21:04 UTC
Created attachment 95845 [details]
.docx file used for "Export to xhtml" example discussed in the comment.
Comment 11 Patrick Goetz 2014-03-15 10:26:39 UTC
> If you want a valid XML document export it as XHTML, which is actually using XML as a base.

The problem with this is that the xhtml I get when I use "Export to xhtml" is, in my opinion, quite bizarre (however, similar to what you get with "Publish to the Web" using Google Docs).  Using the attached .docx file as a starting point, this is what I get when I export to xhtml (snippet of file):

<p class="P1"><span class="T1">Complainant</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">shall mean (a)</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T3">the</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T4">any</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">person or persons from whom the Intake Officer receives information concerning an Offense</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T4">and who, upon consent of that person(s), is designated a Complainant by the Intake Officer</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">or (b) any Injured Person designated by the Bishop Diocesan who in the Bishop Diocesan’s discretion, should be afforded the status of a Complainant, provided, however, that any Injured Person so designated may decline such designation.</span></p>

(Ignoring that vim on the Windows XP machine I'm using is not reading the UTF-8 characters correctly), notice that common tags such as <b> and <i> are being inserted as classes using the <span> tag.  In this case, .T1 maps to single CSS attribute:
	.T1 { font-weight:bold; }

In a longer version of the same document (i.e. including more text from the same original document) you get more complex classes:
	.T1 { font-size:10pt; font-weight:bold; }
	.T13 { font-style:italic; }
	.T14 { font-style:italic; }
	.T15 { font-style:italic; }
	.T16 { font-style:italic; text-decoration:underline; }
	.T17 { font-style:italic; text-decoration:underline; }
	.T18 { font-style:italic; }
	.T19 { font-style:italic; font-weight:bold; }
	.T20 { font-style:italic; font-weight:bold; }
	.T21 { font-style:italic; font-weight:bold; }
	.T22 { font-style:italic; font-weight:bold; }
	.T26 { padding:0in; border-style:none; }
	.T27 { text-decoration:underline; }
	.T28 { text-decoration:underline; padding:0in; border-style:none; }
	.T29 { font-style:italic; text-decoration:underline; }

This is both unreadable and hard to parse.  Moreover, if I take exactly the same document and add some text, then all these classes change!  Also note the strange duplication of classes that do exactly the same thing (.T13,.T14,.T15,.T18)

In my application, what I need to do is extract the text, preserving simple formatting such as <p>, <b>, <i>, and (deprecated) <strike> in order to paste this content into another xml document.  This is do-able using the exported xhtml, but extremely onerous; since, for example, it will require at least 2 passes through a parser: first to add the simple xhtml tags I want (<b>, <i>) that weren't included in the first place, then another pass to strip out all the remaining classes and other xhmtl coding that I don't want.

I can't fathom why KISS isn't being applied here:  use basic xhtml tags whenever possible in order to keep the output readable and sane. I've written a fair amount of XML parsing code myself, so do know something about it.  I can't help but think this is an example of incredibly lazy programming (unless I'm missing something).
Comment 12 Patrick Goetz 2014-03-17 17:12:51 UTC
Intellectual curiosity leads me to add that I'd love for the person who wrote the "Export to xhmtl" code to explain why they went with a purely CSS class-based approach; especially since the Google Docs people (who I know have plenty of resources) did the same thing.
Comment 13 Julien Nabet 2014-03-17 22:16:45 UTC
(In reply to comment #12)
> Intellectual curiosity leads me to add that I'd love for the person who
> wrote the "Export to xhmtl" code to explain why they went with a purely CSS
> class-based approach; especially since the Google Docs people (who I know
> have plenty of resources) did the same thing.

Patrick: if it's ooo2wordml_text.xsl which does the job, it might be explained like this:
when we look at the history of this file (see http://opengrok.libreoffice.org/history/core/filter/source/xslt/export/wordml/ooo2wordml_text.xsl), we can see it's been created in 2004 and, if you leave the license changes, the last change was in March 2005. (9 years ago!)
Comment 14 Patrick Goetz 2014-03-17 22:26:49 UTC
ooo2wordml_text.xsl sounds like an XSL script which converts ODF to OOXML -- surely this woudn't be the same XSL used to export to xhtml?
Comment 15 Julien Nabet 2014-03-18 06:39:01 UTC
Patrick: Oups, you're right of course! :-)
Comment 16 Rev. Bob 2015-04-20 02:22:20 UTC
(In reply to Tomaz Vajngerl from comment #5)
> Heh - it's even a bigger mess when you add bold, italics and underline into
> the mix.

Something tells me this is related to the behavior I describe in bug 89069, especially where bold and italic are treated differently than the other inline formatting options. I was specifically looking at start-of-line behavior, but there may well be more to it...
Comment 17 QA Administrators 2016-09-20 09:32:47 UTC Comment hidden (obsolete)
Comment 18 Tyco72 2020-07-31 11:10:15 UTC Comment hidden (off-topic)
Comment 19 Tyco72 2020-08-01 14:54:19 UTC Comment hidden (off-topic)
Comment 20 Stéphane Guillou (stragu) 2021-05-18 12:38:51 UTC
Created attachment 172129 [details]
Screenshot of HTML source in Firefox 88

Reproducible with LO 7.2 Alpha0+. Firefox 88 even highlights the offending closing tags in red (see attachment).

Version: 7.2.0.0.alpha0+ / LibreOffice Community
Build ID: 6b09276d157abada74e1a4989700139167207778
CPU threads: 8; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2021-05-14_04:32:30
Calc: threaded
Comment 21 Tyco72 2021-05-18 20:46:47 UTC Comment hidden (off-topic)
Comment 22 Miklos Vajna 2021-05-19 08:20:45 UTC Comment hidden (obsolete)
Comment 23 Tyco72 2021-05-19 11:30:19 UTC Comment hidden (obsolete)
Comment 24 Aron Budea 2021-05-23 05:40:27 UTC
(In reply to Tyco72 from comment #18)
> I have the same issue, tested with LO 6.3.6 and 6.4.5. It is a critical bug
> when you have to paste the content for example in the Wordpress and have to
> work with  HTML!
> The code looks messed up as shown in the comment #11, and the pasted text
> looks broken in a lot of rows with few characters each one (in the
> preformatted block of Wordpress). It makes the LO documents useless.
Surely this isn't the same as the originally reported bug, please open a new bug report for your issue.
Comment 25 Tyco72 2021-05-23 09:37:36 UTC
> Surely this isn't the same as the originally reported bug, please open a new
> bug report for your issue.

Hi, thank you. I have created the bug:
Bug 142443
https://bugs.documentfoundation.org/show_bug.cgi?id=142443
Comment 26 QA Administrators 2023-05-24 03:14:45 UTC Comment hidden (obsolete)
Comment 27 Tyco72 2023-07-12 16:12:20 UTC
Hello,
I have tested it with LO 7.5.3.2 (Win 64bit)and the bug is still present, as I have described in comment #18
https://bugs.documentfoundation.org/show_bug.cgi?id=76021#c18

It was present also in LO 7.4. I don't know for older versions, but the comment #20 reports that it happened also with LO 7.2 then I would say that the bug is inherited.
Comment 28 Stéphane Guillou (stragu) 2024-05-20 11:11:22 UTC
Reproduced with attachment 95585 [details] in:

Version: 7.6.7.2 (X86_64) / LibreOffice Community
Build ID: dd47e4b30cb7dab30588d6c79c651f218165e3c5
CPU threads: 8; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

Resolved in:

Version: 24.2.3.2 (X86_64) / LibreOffice Community
Build ID: 433d9c2ded56988e8a90e6b2e771ee4e6a5ab2ba
CPU threads: 8; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: CL threaded

Resolved by 6ebe0eceb1ae4a3e544c733be37e5f02c5f46e80 in 24.2.2, cherrypick of:

commit 6d797c83d9fb891b783de39646b42d34a895c81e
author	Mike Kaganski 	Mon Mar 04 12:20:13 2024 +0600
committer	Mike Kaganski 	Mon Mar 04 13:26:06 2024 +0100
tdf160017: make sure to emit the closing tags in correct order
Reviewed-on: https://gerrit.libreoffice.org/c/core/+/164325

Thanks Mike!

*** This bug has been marked as a duplicate of bug 160017 ***