Bug 140708

Summary:	The REGEX function accepts all (ismx) but one (w) flags and only directly in the regular expression and does not allow all matches to be found at once
Product:	LibreOffice	Reporter:	Igor <eeigor>
Component:	Calc	Assignee:	Not Assigned <libreoffice-bugs>
Status:	UNCONFIRMED ---
Severity:	enhancement	CC:	79045_79045, erack, himajin100000
Priority:	medium
Version:	7.0.4.2 release
Hardware:	All
OS:	All
Whiteboard:
Crash report or crash signature:		Regression By:
Bug Depends on:
Bug Blocks:	108827

Description Igor 2021-02-28 08:11:30 UTC

Description:
We have:
 REGEX(Text;Expression[;[Replacement][;Flags|Occurrence]])
 Flag settings: "g" only (means "Global")
Desirable:
 REGEX(Text;Expression[;[Replacement][;Flags][;Occurrence]])
 Flag settings: "g" + "ismxw"

Flag Settings - Description
i - Ignore case (case insensitive)
s - Make . match newline too (single-line, dot all)
m - Make begin/end {^, $} consider each line
x - Allow comment in regex
w - Make {\w, \W, \b, \B} follow Unicode rules

Steps to Reproduce:
See "Actual Results".

Actual Results:
1. Either the first occurrence or the given one is extracted. Now if the replacement parameter is not specified, the flag "g" is ignored.
2. All flags (ismx) work if you insert them directly into a regular expression: "(?ismx)…" or "(?ismx:…)" when the corresponding option is enabled. Except for one (w).
3. Flag "w". E.g.:
=REGEX("The quick (""brown"") fox can’t jump 32.3 feet, right?";"(?w)\b\w+\b";;5)
returns "jump", not "can't". Why?


Expected Results:
1. When the "g" flag is set, all occurrences should also be returned. Parameters "Flags|Occurrence" should be isolated.
2. Flag settings: "g" + "ismxw"
3. Word boundaries are recognized as in the example above according to the specification (https://www.unicode.org/reports/tr29/tr29-33.html#Word_Boundaries).


Reproducible: Always


User Profile Reset: No



Additional Info:
The use of the "w" flag remains unclear. For example, words with an accent in a word are recognized with the "w" flag disabled (?-w), and the examples of the words above are not recognized at all.

Comment 1 Igor 2021-02-28 08:32:33 UTC

But words with an accent in a word are recognized with the "w" flag disabled (?-w)
=REGEX("А́ Е́ И́ О́ У́ Ы́ Э́ Ю́ Я́ а́ е́ и́ о́ у́ ы́ э́ ю́ я́";"(?-w)\b\w+\b";;2)
returns "Е́" (Cyrillic). Why?

Comment 2 Igor 2021-02-28 08:35:04 UTC

Unfortunately, the accents have shifted when pasting text. Here is correct:
https://forum.openoffice.org/en/forum/viewtopic.php?f=9&t=104622&p=507209#p507209

Comment 3 Eike Rathke 2021-04-12 13:31:47 UTC

You are confusing the parameter Flags with pattern option flags. The ixsmw option flags can always be given in the pattern (like you did with (?-w) in your example) and there can be multiple options at different places, it does not make sense to have those repeated as function-wide flags. The Flags parameter currently implements only the "g" Global argument as known from sed for replacements. Maybe Flags should be renamed to not be confused. It was never meant to have pattern option flags be passed in the Flags parameter or have this "g" act on extraction.

I'd find it doubtable to have REGEX("string";".";;"g") extract every single character of "string", or the result of REGEX("barbaz";"a";;"g") be "aa".

For your question about word boundaries I can only refer to ICU and its documentation, http://userguide.icu-project.org/strings/regexp or new https://unicode-org.github.io/icu/userguide/strings/regexp.html
If unclear please ask them.
The "accents have shifted when pasting text" indicates you used combining accents instead of single character Unicode letters (and indeed that's what one gets when copying the sample string from the comment), that may be related and might explain why in your example the second occurrence of a word is the one letter. Again, to be sure I'd suggest you ask in an ICU or Unicode forum or mailing list.

Comment 4 Igor 2021-04-12 21:26:14 UTC

Eike Rathke,
1) At least one flag ("i") duplicates the corresponding pattern option.
E.g.
With com.sun.star.i18n.TransliterationModules
    oOptions.transliterateFlags = .IGNORE_CASE
End With

And how to use the following constants, I still do not understand:
 Const Long REG_NOT_BEGINOFLINE = 0x00000800
 Const Long REG_NOT_ENDOFLINE = 0x00001000
But this appears to be our flag "m"...

And if the analogy with Python is appropriate here, then the approach that I described is used there, only the sequence of flags ("ismxw") as a string is used.
https://docs.python.org/3/library/re.html

2) REGEX("barbaz";"a";;"g") returns not "aa", but an array of 2 matches {a; a}, and if one cell is selected, it will get the first value from the array according to the array processing rule. But the user can join an array of matches and output it as a string.
Why is this needed? In order not to iterate over the return values one by one during multiple calls to the REGEX function when the total number of matches is unknown. To do this, you need to organize a loop and write a macro.
And if REGEX with the "g" (Global) flag replaces all occurrences, wouldn't it be logical to extract all of them also?

Comment 5 Eike Rathke 2021-04-13 17:33:26 UTC

(In reply to Igor from comment #4)
> 1) At least one flag ("i") duplicates the corresponding pattern option.
> E.g.
> With com.sun.star.i18n.TransliterationModules
>     oOptions.transliterateFlags = .IGNORE_CASE
> End With
How is that related to the REGEX() spreadsheet function?

> And how to use the following constants, I still do not understand:
>  Const Long REG_NOT_BEGINOFLINE = 0x00000800
>  Const Long REG_NOT_ENDOFLINE = 0x00001000
You don't. The REGEX() function does not use the UNO API's css::util::SearchFlags.

> And if the analogy with Python is appropriate here, then the approach that I
> described is used there, only the sequence of flags ("ismxw") as a string is
> used.
> https://docs.python.org/3/library/re.html
That's not much different to how ICU handles it, is it? Prefixing the pattern with (?ismxw) does exactly that, and is flexible as it can be switched on/off at arbitrary positions. I see no benefit of flags parameter arguments that do the same but only over all.


> 2) REGEX("barbaz";"a";;"g") returns not "aa", but an array of 2 matches {a;
> a},
Makes some sense.

> And if REGEX with the "g" (Global) flag replaces all occurrences, wouldn't
> it be logical to extract all of them also?
At least a possibility ;-)