Summary: | The REGEX function accepts all (ismx) but one (w) flags and only directly in the regular expression and does not allow all matches to be found at once | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | Igor <eeigor> |
Component: | Calc | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | UNCONFIRMED --- | ||
Severity: | enhancement | CC: | 79045_79045, erack, himajin100000 |
Priority: | medium | ||
Version: | 7.0.4.2 release | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Crash report or crash signature: | Regression By: | ||
Bug Depends on: | |||
Bug Blocks: | 108827 |
Description
Igor
2021-02-28 08:11:30 UTC
But words with an accent in a word are recognized with the "w" flag disabled (?-w) =REGEX("А́ Е́ И́ О́ У́ Ы́ Э́ Ю́ Я́ а́ е́ и́ о́ у́ ы́ э́ ю́ я́";"(?-w)\b\w+\b";;2) returns "Е́" (Cyrillic). Why? Unfortunately, the accents have shifted when pasting text. Here is correct: https://forum.openoffice.org/en/forum/viewtopic.php?f=9&t=104622&p=507209#p507209 You are confusing the parameter Flags with pattern option flags. The ixsmw option flags can always be given in the pattern (like you did with (?-w) in your example) and there can be multiple options at different places, it does not make sense to have those repeated as function-wide flags. The Flags parameter currently implements only the "g" Global argument as known from sed for replacements. Maybe Flags should be renamed to not be confused. It was never meant to have pattern option flags be passed in the Flags parameter or have this "g" act on extraction. I'd find it doubtable to have REGEX("string";".";;"g") extract every single character of "string", or the result of REGEX("barbaz";"a";;"g") be "aa". For your question about word boundaries I can only refer to ICU and its documentation, http://userguide.icu-project.org/strings/regexp or new https://unicode-org.github.io/icu/userguide/strings/regexp.html If unclear please ask them. The "accents have shifted when pasting text" indicates you used combining accents instead of single character Unicode letters (and indeed that's what one gets when copying the sample string from the comment), that may be related and might explain why in your example the second occurrence of a word is the one letter. Again, to be sure I'd suggest you ask in an ICU or Unicode forum or mailing list. Eike Rathke, 1) At least one flag ("i") duplicates the corresponding pattern option. E.g. With com.sun.star.i18n.TransliterationModules oOptions.transliterateFlags = .IGNORE_CASE End With And how to use the following constants, I still do not understand: Const Long REG_NOT_BEGINOFLINE = 0x00000800 Const Long REG_NOT_ENDOFLINE = 0x00001000 But this appears to be our flag "m"... And if the analogy with Python is appropriate here, then the approach that I described is used there, only the sequence of flags ("ismxw") as a string is used. https://docs.python.org/3/library/re.html 2) REGEX("barbaz";"a";;"g") returns not "aa", but an array of 2 matches {a; a}, and if one cell is selected, it will get the first value from the array according to the array processing rule. But the user can join an array of matches and output it as a string. Why is this needed? In order not to iterate over the return values one by one during multiple calls to the REGEX function when the total number of matches is unknown. To do this, you need to organize a loop and write a macro. And if REGEX with the "g" (Global) flag replaces all occurrences, wouldn't it be logical to extract all of them also? (In reply to Igor from comment #4) > 1) At least one flag ("i") duplicates the corresponding pattern option. > E.g. > With com.sun.star.i18n.TransliterationModules > oOptions.transliterateFlags = .IGNORE_CASE > End With How is that related to the REGEX() spreadsheet function? > And how to use the following constants, I still do not understand: > Const Long REG_NOT_BEGINOFLINE = 0x00000800 > Const Long REG_NOT_ENDOFLINE = 0x00001000 You don't. The REGEX() function does not use the UNO API's css::util::SearchFlags. > And if the analogy with Python is appropriate here, then the approach that I > described is used there, only the sequence of flags ("ismxw") as a string is > used. > https://docs.python.org/3/library/re.html That's not much different to how ICU handles it, is it? Prefixing the pattern with (?ismxw) does exactly that, and is flexible as it can be switched on/off at arbitrary positions. I see no benefit of flags parameter arguments that do the same but only over all. > 2) REGEX("barbaz";"a";;"g") returns not "aa", but an array of 2 matches {a; > a}, Makes some sense. > And if REGEX with the "g" (Global) flag replaces all occurrences, wouldn't > it be logical to extract all of them also? At least a possibility ;-) |