Bug 155087 - Autocorrection in Romanian applies to existing words
Summary: Autocorrection in Romanian applies to existing words
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.5.2.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.6.0
Keywords:
Depends on:
Blocks: AutoCorrect-Complete
  Show dependency treegraph
 
Reported: 2023-04-30 09:57 UTC by cipricus
Modified: 2023-06-02 19:49 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description cipricus 2023-04-30 09:57:56 UTC
Description:
I don’t see this in English and French, but in Romanian it happens often. For example, the latest occurrence of this, *oua* (the correct form to say *a oua*= “to lay an egg”, or: *ar oua*=would lay an egg etc) is automatically corrected to *ouă* (eggs).

That is related to a trend where auto-correction is used for Romanian to get a word with diacritics by typing the form without them, but it leads to errors like the above. (That trend might be related instead to a past period when diacritics/keyboard layouts for Romanian where less easily accessible.)

This happens a lot, I cannot give many examples from the past, but they follow this model where a word without diacritics is replaced automatically with one with diacritics, which makes sense sometimes, but in many cases leads to replacing the correct word with the wrong one.



Steps to Reproduce:
1. type in Romanian "ar oua" ("would lay eggs", verb)
2. press "Space"

Actual Results:
"oua" is corrected to "ouă" ("eggs", noun)

Expected Results:
no correction should occur for existing words


Reproducible: Always


User Profile Reset: Yes

Additional Info:
Existing words should not be auto-corrected.

I have specified a LO version, but this is not version-specific.
Comment 1 cipricus 2023-04-30 10:13:29 UTC
That specific error oua>ouă is listed here: https://opengrok.libreoffice.org/xref/core/extras/source/autocorr/lang/ro/DocumentList.xml?r=44355a90#6314

As said here (https://ask.libreoffice.org/t/where-and-how-to-report-errors-in-defaults-of-autocorrection/91034/7?u=cipricus), default autocorrect rules are part of LO source code, therefore this is a bug.
Comment 2 cipricus 2023-04-30 10:14:52 UTC
I said that "I have specified a LO version, but this is not version-specific."
That may be wrong I guess.
Comment 3 Julien Nabet 2023-04-30 11:44:08 UTC
Nagy: I noticed ee002215ce6379ffcba990035eeb71854441f265 from 2013.
Any thoughts here?
Comment 4 cipricus 2023-04-30 20:46:00 UTC
(In reply to Julien Nabet from comment #3)
> Nagy: I noticed ee002215ce6379ffcba990035eeb71854441f265 from 2013.
> Any thoughts here?

As I already said, the idea behind the majority of these auto-corrections is to be able to type without diacritics and get them by auto-correction. That is fine as long as the form without diacritics is not a word and can be safely replaced, but it triggers errors when the non-diacritic form is an existing word which gets automatically changed. -- I have noticed already another such case: "condamnam" (I/we were condemning) is corrected to "condamnăm" (I/we are condemning). 

It would be great of course if there was an automated method to remove from that list of forms to be corrected all entries that are present in a dictionary. 

On the other hand maybe I was wrong naming the bug as I did - identifying the problem as "correction of existing words" - when it's just about the fact that the list contains errors, no matter what kind. For example, there seem to be other errors, more basic and severe ones (e.g. correction of "ansortie" to "absorție", which is not a correct form, the correct form being "absorbție" (absorption).
Comment 5 cipricus 2023-04-30 20:48:18 UTC
(In reply to cipricus from comment #4)
> (e.g. correction of
> "ansortie" to "absorție", which is not a correct form, the correct form
> being "absorbție" (absorption).

What I meant was:

"absortie" to "absorție", which is not a correct form, the correct form
> being "absorbție" (absorption)
Comment 6 cipricus 2023-04-30 21:12:11 UTC
(In reply to Julien Nabet from comment #3)
> Nagy: I noticed ee002215ce6379ffcba990035eeb71854441f265 from 2013.
> Any thoughts here?

The link you posted is to a very short list compared to the total list. What is that about?

What can I do? List as many errors here as possible? Where to report them?

I don't have time to read all the list, this could be maybe automated by looking for existing words in the left column.

I have already found others in my initial list, where the first word is ok and should not be corrected (a verb tense "corrected" to another tense, an articulated noun or adjective corrected into a non-articulated one): 

"aclamam" block-list:name="aclamăm"/>
Activași" block-list:name="Activați"/>
"acumulata" block-list:name="acumulată"/>
"acuta" block-list:name="acută"/>
"dezgustata" block-list:name="dezgustată"/>

The cases similar to the last three could be argued as not being errors, because the "corrected" form is very very rare (an articulated adjective, e.g. "the accumulated one", fem., "the acute one", "the disgusted one" etc.) that could safely be corrected into a much frequent one, but they are nonetheless correct forms.
Even this error gets more severe as the form to be corrected may be in fact rather frequent, like in the case of "nedumerita" (the amazed one: girl/woman) corrected to "nedumerită" (amazed, fem.).
Comment 7 Julien Nabet 2023-04-30 21:16:21 UTC
(In reply to cipricus from comment #6)
> (In reply to Julien Nabet from comment #3)
> > Nagy: I noticed ee002215ce6379ffcba990035eeb71854441f265 from 2013.
> > Any thoughts here?
> 
> The link you posted is to a very short list compared to the total list. What
> is that about?
>...

@cipricus: my comment wasn't for you but for Nagy which I put in cc.
Comment 8 BogdanB 2023-05-01 17:56:28 UTC
About the example with "oua", there are 2 cases:
- the term needs to be changed by autocorrection: "obiceiul de a oua". If LibreOffice suggests "ouă", you just need to press Ctrl+Z and you get the right version, without autocorrection
- the term needs to be changed by autocorrection: "a cumparat 10 oua". If LibreOffice corrects here, it's ok.

So, it's better to have the duo for "oua"->"ouă" and you decide when to apply and when not, than to not have the autocorrection.

About this, I agree: "absortie" to "absorție", which is not a correct form, the correct form
> being "absorbție" (absorption)
DEX: Cuvântul absortie nu este în dicționar.

"aclamam" block-list:name="aclamăm"/> This is correct. "Noi aclamăm..." is a correct form. Why not?
Activași" block-list:name="Activați"/> Many people press "ș" instead of "ț". Why to reject this case? It is not the same word, but is better than the first.

"acumulata" block-list:name="acumulată"/> "Pierderea financiara acumulata de-a lungul ultimilor ani este de ..." Why not?

"acuta" block-list:name="acută"/> "O problema acută ce necesită rezolvare". Why not?

"dezgustata" block-list:name="dezgustată"/> "Privea dezgustată spre ...". Why not? The variant without diacritics, it is NOT used.
Comment 9 cipricus 2023-05-01 20:33:22 UTC
(In reply to BogdanB from comment #8)

We cannot argue in this way. I was just giving a few examples for a more generic problem that has to be tackled on a matter of principle: it makes no sense correcting existing (correct) words! 

But your statements can be refuted one by one too. Just an example:

> If LibreOffice suggests "ouă", you just need to press Ctrl+Z and you get the
> right version, without autocorrection

Why should we have to press Ctrl-Z in order to write a common form of a very common word, namely the infinitive of a verb? (And "oua" is not just the infinitive, it is also the paste tense! "Găina oua un ou pe zi" = "The chicken was laying one egg every day". Why should one use shortcut to write a simple phrase like that?) Correction is made in most cases in order to replace something wrong with something correct (that's correcting!) not something less frequent with something more frequent - and only considered so by subjective impression or opinion, without the slightest chance of an objective criterion. 

The auto-correction tool should not be involved in the specification of the verb tense! - One should be able to write a specific tense without triggering the auto-correct tool that would change to a different tense or turn the verb into a noun etc..

There is also the matter of consistency. If "oua" (to lay eggs; also: was laying eggs) is to be the object of correction, why not all other verbs with the same structure, like "ploua" (was raining, to rain) into "plouă" (is raining)? Or even "lua" (to take; was taking) into ... "luă"! Those are not corrected, and for good reason: it makes no sense. But not less sense that for "oua"! - Anyway, we have no objective criterion to separate these cases. The only criterion is the aforementioned: no correction for existing words! 

There is also the matter of consistency with other languages! Is there the case in English that any correct word is automatically "corrected"?

> - the term needs to be changed by autocorrection: "a cumparat 10 oua". If
> LibreOffice corrects here, it's ok.

No, it's not. Not ALL errors of writing should be corrected. 

Maybe you are trying to NOT use the Romanian keyboard layout, but use only English keyboard to write diacritics?
 
> 
> So, it's better to have the duo for "oua"->"ouă" and you decide when to
> apply and when not, than to not have the autocorrection.

That makes no sense. I want to write what I want, one or the other, both are correct, autocorrection has nothing to do here.


> "aclamam" block-list:name="aclamăm"/> This is correct. "Noi aclamăm..." is a
> correct form. Why not?

Again the same error. The fact that "aclamăm" is correct doesn't mean that ANY form, including a correct one like "aclamam" (I was acclaiming) should be automatically corrected into it!

> Activași" block-list:name="Activați"/> Many people press "ș" instead of "ț".
> Why to reject this case? It is not the same word, but is better than the
> first.

What do you mean by "better"??? It's just different: "you/singular (recently) activated it" vs. "you/plural are activating" etc!

And so forth. You seem to miss the main point, sorry.
Comment 10 cipricus 2023-05-01 20:44:16 UTC
> no correction for existing words

What I mean is: "no correction of existing words!". 

Existing words = correct words. Correcting correct words makes no sense.
Comment 11 cipricus 2023-05-01 21:00:22 UTC
By the way, I am not totally against the trend that seems to dominate the greatest part of the list on which is based the Romanian auto-correction tool, namely that it seems to serve mainly the purpose of one being able to write with a keyboard layout that lacks Romanian diacritics (like the English keyboard layout, instead of writing with a proper Romanian layout), and to get the diacritics by auto-correction - although I don't think that should be the main goal of a such tool.

But that trend should not go as far as to impose correction of existing words, no matter the argument for that (e.g. frequency, which is debatable anyway).

All this is rather ridiculous. It's like replacing in English "drank" with "drunk"! (There are more drinkers than drunkards after all!)
Comment 12 cipricus 2023-05-01 21:14:19 UTC
(In reply to BogdanB from comment #8)

> - the term needs to be changed by autocorrection: "a cumparat 10 oua". If
> LibreOffice corrects here, it's ok.

All your argument is summarized by the above. 

The problem is this: it is not enough for a correction to be ok once or even many times, it must ALWAYS be ok. It should ALWAYS replace a frequent erroneous form and NEVER replace correct one (an existing word!) no matter its frequency.
Comment 13 cipricus 2023-05-02 06:04:55 UTC
The English equivalent of automatically replacing Romanian "oua" with "ouă" would be the replacement of "laid" (or of "lay") with... "layer"!
Comment 14 cipricus 2023-05-02 10:36:11 UTC
I HAVE READ THE WHOLE LIST!

I am mentioning below most if not all cases that I think should be taken out. I have already tried to articulate the reason for that. 

Although bringing case-by-case examples and arguments should not the way to go about this, and the decision whether to correct or not existing words should only be made on a general principle (that is: NO VALID FORMS SHOULD EVER BE CORRECTED), the problematic entries can also be treated one by one because they are not that many after all.
 
The replacement of articulated nouns just because they are not frequent enough (based on subjective and inconsistent criteria) is always wrong, but in certain cases it is more strikingly so, when the articulated form is obviously equally frequent. 

That happens based on specific rules of Romanian. For example, an adjective can be “substantived” – that is, made to act like a noun and become the subject, like in English (dead – the dead):

moarta>moartă (the dead woman > dead, adj., fem.)
(Moarta era întinsă pe pat.= The dead woman was laying on the bed.)

prevenita-prevenită
the arrested woman>warned, arrested person, adj., fem.
Prevenita nu era de față=The arrested woman was not present.

negativa>negativă
the negative form/one>negative, adj., fem.
Negativa nu este valabilă= The negative form is not valid.	

ridicata > ridicată
a ridica=to raise up, ridicată=raised up, adj., fem.
”cu ridicata”=wholsale

That happens often in the case of colors, where the form of the adjective is articulated and acts as a short-hand or generic noun ("the black"):

alba>albă (the white [one]>white, adj. fem.)
neagra>neagră (the black [one].>black adj., fem.,)
"Neagra/alba e mai scumpă"= the black/white [one] (e.g. the black or white car) is more expensive.
- the same with other colors: albastra - ”the blue [one]”, which the corrector changes to albastră=blue! (But then it ignores other colors.) 

It is a common rule in Romanian for adjectives to change  word order and be articulated when a possessive pronoun (my, mine, his, hers) is used. One can indifferently say “trista mea situație” or ”situația mea tristă” (”my sad situation”), ”blonda mea soție” or ”soția mea blondă” (my blonde wife).  – I wonder why the corrector is not correcting ”trista” to ”tristă”, and ”blonda” to ”blondă”, given that is doing it for alba>albă  and
neagra>neagră, as well as for other forms – see below!

As I said in another comment: not only these corrections are wrong, but they are inconsistent – they are unexpected, but, IF ACCEPTED, they are also unexpectedly absent in other cases. – ”Alba” and ”neagra” are no different from something like ”blonda” (blonde girl/woman), which is (rightly so) NOT corrected to ”blondă” (blonde, adjective).

absoluta>absolută
the absolute [one], fem.>absolute, adj., fem.
Absoluta lui încredere=his absolute confidence

singuratica > singuratică
the lonely [one], fem. > lonely, adj., fem.
Singuratica lui viață=His lonely life.

temeinica > temeinică (steadfast, well-founded, adj., fem.)
Temeinica lui decizie=his steadfast decision

vaga > vagă (vague, adj., fem.)
Vaga ta propunere=your vague proposition

valabila > valabilă (valid, adj., fem.)
Valabila ta depoziție = your valid statement

multa>multă (numerous, big/adj., fem)
Often rather archaic but very frequent in the Bible, and in religious and other literary  speech: "Multa mea durere" (my big sorrow)

amoroasa>amoroasă
Amoroasa sa soție=his loving/amorous wife

regala>regală
the royal [one]>royal, adj., fem.
Regala sa prezență=His/her royal stature
This ”correction” is doubly wrong because ”regala” is also a verb: to feast, treat royally, cf. French: ”(se) régaler”

ciudata>ciudată (odd, bizarre, adj., fem.)
Ciudata sa atitudine=his bizarre atitude

This word order/articulation change also happens with the “demonstrative pronouns” (this, that):

Ciudata asta nu vorbește cu mine.=This bizarre girl/woman won’t speak to me.

toleranta (tolerant, adj. fem., definite article) >"toleranța" (tolerance, n., fem.)
Toleranta sa poziție=his tolerant position

While the corrector erroneously replaces “ciudata”, because of inconsistency (within this erroneous trend) it doesn’t replace “frumoasa” (the beautiful [one, fem.]), “proasta” (the stupid one), ”drogata” (the drugged one) etc, – but arbitrarily DOES (and IT SHOULDN’T) replace ”contagioasa" (the contagious [one, fem.]), ”religioasa” (the religious one), ”rezolvata” (the resolved/solutioned one), ”ridicata” (the raised/upper one), ”salvata” (the saved one), ”zoologica” (the zoologic one) with their non-articulated forms!

That such correct words are replaced just because they have the feminine definite article is beyond comprehension. Some of these articulated forms set to be replaced are not very frequent  (e.g. ”greceasca”=”the Greek [thing, fem.]”, or ”ruseasca”=the Russian [thing, fem.]), but THAT IS NOT A REASON to ”correct” them. 

ONLY INCORRECT FORMS SHOULD BE CORRECTED! (One cannot say that ”greceasca”  is erroneous, even if one never uses it: it is just the articulated form of the adjective ”grecească”, and it makes no sense to change the articulated form into the non-articulated one.)

As already said, another error is the auto-correction of verb tenses:

completa>completă
to complete, was completing>complete, adj., fem.

The already mentioned:

aclamam (I/we were acclaiming)>aclamăm (we are acclaiming)

activași (you have just activated)>activați (you are activating; also: activated, masculine,plural)

condamnam (I/we were condemning)-condamnăm (we are condemning)

*****************************************************************

The above errors are based on an erroneous line of argument. The following are blunt errors that need no arguing:

maestra (master/teacher, n., fem. definite article)>maestră (the same, non-articulated)
(”Aşa am cunoscut-o pe maestra mea de la Milano, Mildela D'Amico“, explică soprana,="That's how I met my teacher from Milan, Mildela D'Amico", explains the soprano.)

struna > strună
string, noun, fem., definite article > string, noun, fem.

muschetar >mușchetar (musqueteer)
both terms are correct

  "ași" (aces, plural of “as”=ace)>"își" (to oneself)

 "atacat" (attacked) >"atăcat" (???)

"pastorul" (reverend, protestant priest) – "păstorul"(shepherd)
For no reason only the articulated forms are affected.

 "regala"="regală"/>
    already mentioned

"rida" (to wrinkle) – "râdă" (to lough) 

“valva" (valve, n. fem., definite article) > "vâlvă" (uproar)

"tai"  (I/you cut) – ”tăi" (yours, plural)

tara (fault, imperfection, with definite article, fem., cf. French ”tare”=”défectuosité”) > țara (country, definite article, feminine)
oddly, this error includes also an inconsistency with the global trend of correcting to non-articulated form (which here would be ”țară”)

“vad” (river ford) > "văd" (I/they see)

”taică-meu" ="taica-meu"  (my father)

this is a simple error by inversion of the model of previous entries in the list:
 <block-list:block block-list:abbreviated-name="taica-miu" block-list:name="taică-miu"/>
  <block-list:block block-list:abbreviated-name="Taica-miu" block-list:name="Taică-miu"/>
  <block-list:block block-list:abbreviated-name="taica-tau" block-list:name="taică-tău"/>
  <block-list:block block-list:abbreviated-name="Taica-tau" block-list:name="Taică-tău"/>
 

"absortia" – "absorția"
both are wrong
All possible errors ("absortie" – "absorție", apsortie, apsorție) should be corrected to ”absorbție”
Comment 15 Julien Nabet 2023-05-02 17:30:47 UTC
@cipricus: please don't confirm your own bug.

@Lucian: sorry to add you in cc but we need native Romanian to provide some feedback here

@Sophie: if you have Romanian contacts, they may be very useful here.
Comment 16 cipricus 2023-05-02 20:11:54 UTC
(In reply to Julien Nabet from comment #15)
> @cipricus: please don't confirm your own bug.

Sorry. I imagined that what I cannot is already disabled.

Romanian speakers are a must (I have done what I could in my previous comment myself), but that is not enough. Some general principles in the elaboration of those auto-correction lists have to be stated first, otherwise we end up arguing against subjective opinions.
Comment 17 Julien Nabet 2023-05-02 20:20:29 UTC
(In reply to cipricus from comment #16)
> (In reply to Julien Nabet from comment #15)
> > @cipricus: please don't confirm your own bug.
> 
> Sorry. I imagined that what I cannot is already disabled.
No pb.
> 
> Romanian speakers are a must (I have done what I could in my previous
> comment myself), but that is not enough. Some general principles in the
> elaboration of those auto-correction lists have to be stated first,
> otherwise we end up arguing against subjective opinions.
You're right, that's why I also "cced" Sophie which is an expert in localization domain and either will have the background necessary to respond here or at least will certainly know someone who may help here.
Comment 18 cipricus 2023-05-02 20:22:37 UTC
BogdanB has tried to argue defining some reasons or rules behind the present entries in that list. I want to argue against that: the inconsistency with which such reasons are in fact applied proves that they are not at all at play there and that the entries that I consider erroneous have ended up there by error and by chance.
 
Without that inconsistency hundreds of other correct Romanian words (especially infinitive and past tense verbs) would have ended up there and automatically "corrected" into another tense or an adjective. - A rule or reason for the inconsistent application of which we must be glad is not a real rule at all.
Comment 19 cipricus 2023-05-02 20:26:04 UTC
(In reply to Julien Nabet from comment #17)

> know someone who may help here.

I can say that after a rapid overview of that entire list the erroneous entries are less numerous than I feared at the moment of my initial post. In my long comment where I try to be very comprehensive on the matter most of these are mentioned.
Comment 20 Tex2002ans 2023-05-02 22:17:48 UTC
Hey, I became aware of this bug because of:

- https://www.reddit.com/r/libreoffice/comments/135kn9j/is_the_autocorrection_tool_of_many_languages/

I gave some very detailed AutoCorrect/Spellcheck/Grammarcheck responses in cipricus's Reddit thread, but I'll try to summarize a little of it here.

(Please see the link above for *much* more specifics.)

>> If LibreOffice suggests "ouă", you just need
>> to press Ctrl+Z and you get the right version,
>> without autocorrection [...]
>
> Why should we have to press Ctrl-Z in order to
> write a common form of a very common word, [...]

Yes, I agree.

Typo Correction is split into 3 layers:

- AutoCorrect
- Spellchecking/Dictionaries
- Grammarchecking

LibreOffice's AutoCorrect should focus on:

- invalid words + common typos
--- alot -> a lot
--- becasue -> because
--- cheif -> chief
--- commitee -> committee

while Spellchecking/Dictionaries should focus on:

- valid words
--- (Red squigglies + Right-Click suggestions!)
--- I made a *misteak*. (misteak -> mistake) 

while Grammarchecking should focus on:

- valid words, but used in the wrong context
--- (Green squigglies!)
--- I stood in line for an *our*. (our -> hour)
--- I *runs* away from the dog. (runs -> run/ran)

One valid word to another valid word—like "ouă" vs. "oua"—shouldn't be in the AutoCorrect category!

An error like that should be taken care of at the Grammarchecking-level with green squigglies.

(Luckily, LanguageTool supports Romanian! :) )

- - -

cipricus: Now, we just sit and patiently wait.

Like Julien + BogdanB said, we wait for input from more knowledgeable localizers.

We heard the recommendations, and I have no doubt Romanian AutoCorrect will be made better based on your suggestions. :)

(Take a deep breath, and a deep breath out. Nobody here wants to "argue with" or "antagonize" you. We just want to all help make LO become better! :) )

And it's only been 3 days. People have real lives outside of LibreOffice you know!!! lol.

- - -
Comment 21 Mike Kaganski 2023-05-03 08:52:44 UTC
(In reply to BogdanB from comment #8)
> About the example with "oua", there are 2 cases:
> - the term needs to be changed by autocorrection: "obiceiul de a oua". If
> LibreOffice suggests "ouă", you just need to press Ctrl+Z and you get the
> right version, without autocorrection
> - the term needs to be changed by autocorrection: "a cumparat 10 oua". If
> LibreOffice corrects here, it's ok.
> 
> So, it's better to have the duo for "oua"->"ouă" and you decide when to
> apply and when not, than to not have the autocorrection.

I believe that this is abusing autocorrection, for what grammar checking is for. Additionally, this requires additional effort from the writer, that they need to track what was "corrected", to catch and act on that. The autocorrection tool for any language must be prepared to require the least possible effort from user: the replacements that the tool makes must be correct on 100% cases (well, 99.998% would probably be OK).
Comment 22 Gabriel Masei 2023-05-03 12:30:29 UTC
As there was a request for comments from romanian speaking community members I give my opinion below:

I didn't check all of them but the examples given by cipricus seem to be correct. Overall I agree with cipricus that this is an issue: a correct sentence could be transformed into an incorrect one from semantic/grammar point of view and the user has to manually correct it. In this case it doesn't look as a helpful tool. On the contrary.

On the other hand AutoCorrect provides to the user a possibility of adding/removing rules for replacements. If a user wants to include in AutoCorrect a transition from "oua" to "ouă" then he can do that and we can't do anything unless we remove that option. So, in my opinion the problem is related to the DEFAULT values that LibreOffice provides. Only in this case I agree with cipricus. We should not provide replacements that could result in wrong corrections.

IMHO there are a few principles that have to be followed when defining DEFAULT replacements:

1. There must be a consistence in behavior for all supported languages. For example between english and romanian. If AutoCorrect corrects only invalid words and typos for english then it should behave in the same way for romanian and all the other languages.

2. The essence of AutoCorrect is to "correct" user's mistakes and not make wrong corrections. If there is a probability, however small, that the suggestion could be wrong or the existing form could be a valid one then no auto-correction should be performed.

3. The suggestion that the user could use Ctrl-Z to correct something that AutoCorrect made wrong is not a good idea for three reasons:
  a. the user has to correct a mistake made by application and not by him.
  b. the user has to do an extra step to correct that mistake.
  c. the user has to pay attention all the time on what the AutoCorrect tool does in order to correct mistakes.

I underline again that the three principles are for DEFAULT replacements, that are part of the installation package.

Cheers,
Gabriel
Comment 23 cipricus 2023-05-03 13:00:07 UTC
(In reply to Gabriel Masei from comment #22)

Thank you for your helpful intervention, which fully satisfies my expectations.
 
> If there is a probability, however small, that the
> suggestion could be wrong or the existing form could be a valid one then no
> auto-correction should be performed.

For the purpose of this bug report, this principle is largely enough to suport my specific propositions. Except the few erroneous entries which are obviously caused by human error, the rest are misguided by the idea that valid forms may be corrected. (My initial example is one of the most obvious: "a oua" is a very commonly used verb meaning "to lay eggs", while "ou"=egg, is a neuter noun, that is, with plural form identical to that of the feminine: "ouă"=eggs, "ouăle"=the eggs. ”Găina oua”= the chicken was laying eggs. Thus, ”oua” is the correct form of two tenses of that verb, much more than a small probability of correctness.)

> the three principles are for DEFAULT replacements,
> that are part of the installation package.

Why is that specification necessary? What other than the default list (part of the code) can be the object of these principles? (You mean the users must still feel free to keep changing that list as they please? Or is it something else that you mean?)
Comment 24 Mike Kaganski 2023-05-03 13:12:48 UTC
(In reply to Gabriel Masei from comment #22)
> If there is a probability, however small, that the suggestion could be wrong or
> the existing form could be a valid one then no auto-correction should be performed.

Please note that the following is just nitpicking on the "however small".

Consider English replacement i->I. There *is* a non-zero probability, that the author actually wanted to have the "i" in their text. One case is using it as a Roman numeral; another is just showing an English alphabet letter in the text, and so on. But the replacement rule is useful, because the frequency when i was used incorrectly (I was intended) is *much* higher than the expected use of i.

So there is *some* margin of allowable errors here :)
Comment 25 Ákos 2023-05-03 13:25:43 UTC
As author of this bug, I agree with cipricus that this is a bug. At the beginning I merged some external code and tools in libreoffice and I don't observe this problematic words.
Comment 26 cipricus 2023-05-03 13:43:19 UTC
(In reply to Mike Kaganski from comment #24)
> (In reply to Gabriel Masei from comment #22)
> > If there is a probability, however small, that the suggestion could be wrong or
> > the existing form could be a valid one then no auto-correction should be performed.
> 
> Please note that the following is just nitpicking on the "however small".
> 
> Consider English replacement i->I. There *is* a non-zero probability, that
> the author actually wanted to have the "i" in their text. One case is using
> it as a Roman numeral; another is just showing an English alphabet letter in
> the text, and so on. But the replacement rule is useful, because the
> frequency when i was used incorrectly (I was intended) is *much* higher than
> the expected use of i.
> 
> So there is *some* margin of allowable errors here :)

To adjust your observation to the principle, in 

> the existing form could be a valid one

"valid" should be read as "word existing in the language". Given that "i" is not a word, it can be corrected in spite of the principle.
Comment 27 cipricus 2023-05-03 13:48:07 UTC
(In reply to cipricus from comment #26)
> (In reply to Mike Kaganski from comment #24)
> > (In reply to Gabriel Masei from comment #22)

>  Given that "i" is
> not a word, it can be corrected in spite of the principle.

That is true in English though, not in Romanian, where "i" is a word, a short form of the pronoun "lui" = to him: "i se pare"=it appears to him/he has the impression ...
Comment 28 sophie 2023-05-03 13:51:04 UTC
(In reply to Julien Nabet from comment #17)
> (In reply to cipricus from comment #16)
> > (In reply to Julien Nabet from comment #15)
> > > @cipricus: please don't confirm your own bug.
> > 
> > Sorry. I imagined that what I cannot is already disabled.
> No pb.
> > 
> > Romanian speakers are a must (I have done what I could in my previous
> > comment myself), but that is not enough. Some general principles in the
> > elaboration of those auto-correction lists have to be stated first,
> > otherwise we end up arguing against subjective opinions.
> You're right, that's why I also "cced" Sophie which is an expert in
> localization domain and either will have the background necessary to respond
> here or at least will certainly know someone who may help here.

Thanks Julien :) My take is this list is only there to correct typos and not to spell check a document or to correct its grammar. 
If the correction of the typo is ambiguous and could be a valid word, then it should not be the Autocorrect list which is called but either the spell checker or the grammar tool.
My advice would be to revisit the list and only keep words with common letter inversion or invariable words, etc. in it for a better user experience and let users customize what they want. Maybe revisiting the list could be a workshop for the next LibOCon :)
Comment 29 cipricus 2023-05-03 13:52:05 UTC
(In reply to cipricus from comment #27)

> short form of the pronoun "lui" = to him: "i se pare"=it appears to him/he
> has the impression ...

In fact "i" covers the feminine too:  

short form of the pronoun "lui/ei"=to him/her: "i se pare"=it appears to him/her -
he/she has the impression ...
Comment 30 cipricus 2023-05-03 13:55:15 UTC
(In reply to sophie from comment #28)

If considered useful I could do again a full review as soon as I can of the entire Romanian Autocorrect list and post a list of erroneous entries to be removed in a more clear format.
Comment 31 Mike Kaganski 2023-05-03 14:00:55 UTC
(In reply to cipricus from comment #30)

Maybe it would be better if you just prepare a patch to fix this. For cases like this, you don't even need to download and build LibreOffice: this can be done directly in Wen UI.

See https://libreoffice-dev.blogspot.com/2020/05/create-patch-for-libreoffice-directly.html if you decide to try this :)
Comment 32 Gabriel Masei 2023-05-03 14:06:43 UTC
(In reply to cipricus from comment #23)
> (In reply to Gabriel Masei from comment #22)
...
> > If there is a probability, however small, that the
> > suggestion could be wrong or the existing form could be a valid one then no
> > auto-correction should be performed.
...
> Why is that specification necessary? What other than the default list (part
> of the code) can be the object of these principles? (You mean the users must
> still feel free to keep changing that list as they please? Or is it
> something else that you mean?)

Yep. To exclude changes made by users to the replacements list. This is their responsibility.
Comment 33 Gabriel Masei 2023-05-03 14:14:22 UTC
(In reply to Mike Kaganski from comment #24)
> (In reply to Gabriel Masei from comment #22)
> > If there is a probability, however small, that the suggestion could be wrong or
> > the existing form could be a valid one then no auto-correction should be performed.
> 
> Please note that the following is just nitpicking on the "however small".
> 
> Consider English replacement i->I. There *is* a non-zero probability, that
> the author actually wanted to have the "i" in their text. One case is using
> it as a Roman numeral; another is just showing an English alphabet letter in
> the text, and so on. But the replacement rule is useful, because the
> frequency when i was used incorrectly (I was intended) is *much* higher than
> the expected use of i.
> 
> So there is *some* margin of allowable errors here :)

Shouldn't that be performed at grammar checking level ? As at that level this distinction can be made.

Anyway, let's suppose that it should be made at auto-correction level. In this case I understand that there could be exceptions. But they should be treated as such: exceptions. This means that an exception of this kind should be discussed and agreed exceptionally. Otherwise, if we define a principle that will give room to interpretations then we'll face the same issues.

So the principles should remain but maybe adding a fourth principle would be helpful: there could be exceptions but they should be treated as such and extra feedback and acceptance steps should be performed before accepting them. And those extra steps should be explicitly stated.
Comment 34 Tex2002ans 2023-05-03 14:23:20 UTC
@Gabriel I agree completely with your:

- AutoCorrect analysis
- + 3 principles for "DEFAULT replacements".

:)

@Ákos Thanks for the input again after all these years. :)

@cipricus Definitely follow Mike's advice in Comment #31.

> The autocorrection tool for any language must be
> prepared to require the least possible effort from
> user: the replacements that the tool makes must
> be correct on 100% cases (well, 99.998% would probably be OK).

No! Nothing below 99.999% should be allowed!!!

Okay, okay, we can compromise—I'll take 99.998%. :P

> Consider English replacement i->I. There *is* a non-zero
> probability, that the author actually wanted to have the
> "i" in their text.

Yes, but then it will hopefully be caught at the other layers too! (Like grammarchecking!)

- "i went to the park."

vs.

- "The variable i says..."

Grammarcheck will see the word "variable" before 'i', and know that lowercase 'i' was probably intended! No green squiggly!

You can't 100% rely on any 1 of the layers! You need to use all 3 together!

- - -

Side Note: For more info on that, see Daniel Naber's fantastic talk:

FOSSDEM 2014: "How we found a million style and grammar errors in the English Wikipedia"
- https://www.youtube.com/watch?v=2xmPwefktXI

(He's the original creator of LanguageTool!)

- - -

Funny Side Note: With LanguageTool, I was pulling my hair out over:

- AI

Of course, everyone will be speaking about Artificial Intelligence... but there's actually a *very rare* English word:

- ai

which is a type of "three-toed sloth" in South America.

While I was saying 99.99% of people want:

- AI + AIs + AI's

the LanguageTool developer wanted to also add:

- ai + ais + ai's

because "it's valid English"... and "people MIGHT be taking about the sloths"!!!

- - -

The 3 different layers can have different tolerances for what constitutes "an error".

In my mind, it's like an inverse pyramid:

- AutoCorrect should be very narrow/strict.
- Spellchecking can be medium.
- Grammarchecking could be wide/lax, allowing all sorts of valid words + parts of speech.

If grammarchecking gives you a bad green squiggly or a not 100% correct suggestion, that's tolerable.

But if AutoCorrect is constantly "correcting" you with wrong—and automatic—"fixes"... that gets frustrating as a user REAL fast.

(So, like Gabriel/MikeKaganski said, AutoCorrect should lean heavily towards the 100% correct side by default.)

- - -

Anyway, I'll be watching this bug from the sidelines now.

Looks like the Romanian AutoCorrect will be fixed after all! :)

(Hopefully this can inspire others to look at updating AutoCorrect in other lesser-used languages too! Just like sophie said in Comment #28.)

Thanks for the great comments, everyone. :)
Comment 35 cipricus 2023-05-15 05:03:07 UTC
(In reply to Mike Kaganski from comment #31)
> (In reply to cipricus from comment #30)
> 
> Maybe it would be better if you just prepare a patch to fix this. For cases
> like this, you don't even need to download and build LibreOffice: this can
> be done directly in Wen UI.
> 
> See
> https://libreoffice-dev.blogspot.com/2020/05/create-patch-for-libreoffice-
> directly.html if you decide to try this :)

I am giving it a try. Following the blog instructions ("Possibly, you are a great C++ developer or conversely you write your first strings in C++ and you want make LibreOffice better" - although only the last part applies to my case: "you want etc" !!!), I have arrived at the stage https://gerrit.libreoffice.org/c/core/+/151770,edit, and there, selecting "OPEN" I I have to select file path or upload a file.

What should I do at this point? What is the link of the file? 
Trying /core/extras/source/autocorr/lang/ro/DocumentList.xml it opens an empty file. 

Should I create a xml file containing the raw text of https://opengrok.libreoffice.org/raw/core/extras/source/autocorr/lang/ro/DocumentList.xml?r=44355a90, edit and upload?
Comment 36 Mike Kaganski 2023-05-15 05:26:20 UTC
(In reply to cipricus from comment #35)
> Trying /core/extras/source/autocorr/lang/ro/DocumentList.xml it opens an
> empty file. 

Use extras/source/autocorr/lang/ro/DocumentList.xml - everything else you did is correct. Thank you!
Comment 37 cipricus 2023-05-15 06:49:06 UTC
(In reply to Mike Kaganski from comment #36)
> (In reply to cipricus from comment #35)
> > Trying /core/extras/source/autocorr/lang/ro/DocumentList.xml it opens an
> > empty file. 
> 
> Use extras/source/autocorr/lang/ro/DocumentList.xml - everything else you
> did is correct. Thank you!

Thank you. I've got it now. But, after starting the systematic editing, I realize that I find it easier to edit the file locally in Kate text editor and upload the final form when I'm done, rather than publish one change after another. 

I am using the LO Writer spell checking to identify correct forms that are wrongly listed there, and there are many more than expected, so the work will take some time, but I am committed to it.

Therefore, I'll only publish the final result.
Comment 38 cipricus 2023-05-15 06:49:43 UTC
(In reply to cipricus from comment #37)
> (In reply to Mike Kaganski from comment #36)
> > (In reply to cipricus from comment #35)
> > > Trying /core/extras/source/autocorr/lang/ro/DocumentList.xml it opens an
> > > empty file. 
> > 
> > Use extras/source/autocorr/lang/ro/DocumentList.xml - everything else you
> > did is correct. Thank you!
> 
> Thank you. I've got it now. But, after starting the systematic editing, I
> realize that I find it easier to edit the file locally in Kate text editor
> and upload the final form when I'm done, rather than publish one change
> after another. 
> 
> I am using the LO Writer spell checking to identify correct forms that are
> wrongly listed there, and there are many more than expected, so the work
> will take some time, but I am committed to it.
> 
> Therefore, I'll only publish the final result.

By the way, will my changes be automatically accepted? Is there some Romanian linguist or otherwise accepted expert that needs to confirm my changes? Do I have to list my arguments for various changes? How does this work?
Comment 39 cipricus 2023-05-15 07:53:46 UTC
(In reply to Gabriel Masei from comment #33)
> (In reply to Mike Kaganski from comment #24)

I think another rule can be identified: a wrong form should never be auto-corrected when multiple correct forms are possible.
Comment 40 cipricus 2023-05-15 09:07:49 UTC
(In reply to cipricus from comment #39)
> (In reply to Gabriel Masei from comment #33)
> > (In reply to Mike Kaganski from comment #24)
> 
> I think another rule can be identified: a wrong form should never be
> auto-corrected when multiple correct forms are possible.

But that aspect is not the object of the present problem (focussing on not changing existing/correct forms).

It could be topic of a separate bug report (removing all corrections that are not the only possible ones).
Comment 41 cipricus 2023-05-15 10:15:48 UTC
> But that aspect is not the object of the present problem (focussing on not
> changing existing/correct forms).
> 
> It could be topic of a separate bug report (removing all corrections that
> are not the only possible ones).

https://bugs.documentfoundation.org/show_bug.cgi?id=155315
Comment 42 cipricus 2023-05-15 11:18:42 UTC
(In reply to Mike Kaganski from comment #36)
> (In reply to cipricus from comment #35)
> > Trying /core/extras/source/autocorr/lang/ro/DocumentList.xml it opens an
> > empty file. 
> 
> Use extras/source/autocorr/lang/ro/DocumentList.xml - everything else you
> did is correct. Thank you!

I have finished the editing of the DocumentList.xml file.

I have accessed that address and in the end have replaced the old form with the edited one, saved and published. But cannot tell if all went fine. (It's the first time I do something like this.)
Comment 43 cipricus 2023-05-15 11:21:25 UTC
(In reply to cipricus from comment #42)
> (In reply to Mike Kaganski from comment #36)
> > (In reply to cipricus from comment #35)
> > > Trying /core/extras/source/autocorr/lang/ro/DocumentList.xml it opens an
> > > empty file. 
> > 
> > Use extras/source/autocorr/lang/ro/DocumentList.xml - everything else you
> > did is correct. Thank you!
> 
> I have finished the editing of the DocumentList.xml file.
> 
> I have accessed that address and in the end have replaced the old form with
> the edited one, saved and published. But cannot tell if all went fine. (It's
> the first time I do something like this.)

Here is the new file /core/extras/source/autocorr/lang/ro/DocumentList.xml:

https://www.dropbox.com/s/u0owvh3iqo654xs/DocumentList.xml?dl=0
Comment 44 cipricus 2023-05-15 11:46:06 UTC
By comparison to the English and especially French auto-correction lists, the Romanian one is huge. 
The very probable reason for this is that the Romanian one is not only intended to correct frequent writing errors of the type we see in French and English, but is intended as a tool to write Romanian diacritics, without actually typing them, by letting the corrector make the changes. 

The number of entries in the Romanian corrector is not dictated by the number of expected errors but by that of the correct words with diacritics which are expected to be "written" with the help of the auto-corrector. - Many "errors" listed there are intended errors (that is words intentionally written without diacritics) meant to be corrected by the auto-correction tool.

(Of course, that was only been partially implemented, or otherwise that list would have included 2 thirds of all Romanian words!)

My changes were represented in the great majority by the removal of what I considered bad entries, namely forms that don't need correction:

    • Words that can be confirmed to be correct - existing in Romanian dictionaries:

- articulated feminine singular adjectives (ending in -a, that  should not be changed automatically to non-articulated form ending in -ă) 
- nouns, especially articulated forms of feminine singular (ending in -a, that  should not be changed automatically to non-articulated form ending in -ă)
- various verb forms (in many cases singular first person, past tense, ending in -am, that should not be changed automatically to plural present first person, ending in -ăm)
- a few two-word expressions that are correct  
- some rare words (mostly nouns, but also some verbs)
- a few rather common words (of all kinds, present there for less apparent reason)
- some proper names.

   
    • A few blatant errors. Some of these didn't require removal of entries, just some changes.
Comment 45 Mike Kaganski 2023-05-15 12:05:14 UTC
(In reply to cipricus from comment #42)
> I have finished the editing of the DocumentList.xml file.

When you made the necessary edits and published the changes, do not forget to mark the change active - before that, your change (https://gerrit.libreoffice.org/c/core/+/151770) is in WIP state, and potential reviewers do not look at it, thinking it's still in a state not ready for review.

Thanks!
Comment 46 cipricus 2023-05-15 12:11:46 UTC
(In reply to Mike Kaganski from comment #45)
> (In reply to cipricus from comment #42)
> > I have finished the editing of the DocumentList.xml file.
> 
> When you made the necessary edits and published the changes, do not forget
> to mark the change active - before that, your change
> (https://gerrit.libreoffice.org/c/core/+/151770) is in WIP state, and
> potential reviewers do not look at it, thinking it's still in a state not
> ready for review.
> 
> Thanks!

I did it!
Comment 47 cipricus 2023-05-15 12:16:01 UTC
I would also like to add that the aforementioned instrumentation of the auto-correction tool for the purpose of writing with diacritics is what lead to most of the errors that I have tried to remove: the goal of getting the diacritics gained more importance in the mind of the initial author of that list than the fact that forms without diacritics, which were meant to be replaced by forms with diacritics, were in fact correct words: for example, in order to write "vacanțe"=holidays, the list contains the form "vacante", meant to be written just like that, possibly on an English keyboard that lacked Romanian diacritics, only to be replaced by the corrector. But "vacante" means "vacant", plural, feminine, a correct word, and as frequent as the other! - Thus, the list of forms to be corrected contain very few words with diacritics, but contains (contained) many words without diacritics that are in fact correct. 

I don't want to imply that this is necessarily a misuse of the Romanian auto-correction tool, but in a sense I do think that. 

Without the errors entailed by this use of the tool, the final result may be in fact useful to people writing in Romanian without a Romanian keyboard layout. (Most Romanians use in fact English keyboards, and I imagine that not all are able to use Romanian layout on that, where real keys do not fit.) - On the other hand, that is a wrong, partial and desperate solution, given that all diacritic words cannot be written in that way, and in the end people will either write Romanian without diacritics at least partially (that is incorrectly), or use a proper kb layout, which makes this whole approach meaningless.
Comment 48 Gabriel Masei 2023-05-15 18:03:27 UTC
(In reply to cipricus from comment #39)
> (In reply to Gabriel Masei from comment #33)
> > (In reply to Mike Kaganski from comment #24)
> 
> I think another rule can be identified: a wrong form should never be
> auto-corrected when multiple correct forms are possible.

I think that this case falls under the rule no. 2: if you have multiple replacements for the same wrong form and you have to choose one of them then there is a chance that the choice could be a wrong one. So I think that there is no need for a separate rule.

However, if there is a good chance that this case could be unintentionally skipped then an extra sentence to the rule no. 2 could be added. Something like: This rule covers also the case when there could be multiple replacements for the same wrong form.
Comment 49 cipricus 2023-05-15 18:24:48 UTC
(In reply to Gabriel Masei from comment #48)

> I think that this case falls under the rule no. 2

I agree. In fact that is not a new rule, but to be enforced (even just for the Romanian auto-corrector) it requires a new bug report (given that this one is only  about useless correction of correct words). I have posted this: https://bugs.documentfoundation.org/show_bug.cgi?id=155315

It would require that I take out a few more entries from the list that I just published: https://gerrit.libreoffice.org/c/core/+/151770/2/extras/source/autocorr/lang/ro/DocumentList.xml
Comment 50 cipricus 2023-06-02 12:16:55 UTC
(In reply to Julien Nabet)

Is there a way to manually modify my installed LO so that the changes I made are applied? In what version will they be default?
Comment 51 Commit Notification 2023-06-02 16:26:04 UTC
Cip Cipricus committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e7c9f78e4ce5ab5ecd3ccfd06fef71f10f5df8db

tdf#155087 Autocorrection in Romanian applies to existing words

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 52 Julien Nabet 2023-06-02 17:01:20 UTC
(In reply to cipricus from comment #50)
> (In reply to Julien Nabet)
> 
> Is there a way to manually modify my installed LO so that the changes I made
> are applied? In what version will they be default?

Sorry, I've got no idea how to convert xml into .dat file as expected by LO.
Comment 53 cipricus 2023-06-02 17:56:32 UTC
(In reply to Julien Nabet from comment #52)
> (In reply to cipricus from comment #50)
> how to convert xml into .dat file as expected by LO.

I have found how: the xml is inside the dat, which is a zip archive. Once a autocorrection listis edited, a corresponding dat per user is created in ~/.config/libreoffice/4/user/autocorr (Linux), or C:\Program Files\LibreOffice\share\autocorr (windows).
Comment 54 Julien Nabet 2023-06-02 19:49:31 UTC
(In reply to cipricus from comment #53)
> (In reply to Julien Nabet from comment #52)
> > (In reply to cipricus from comment #50)
> > how to convert xml into .dat file as expected by LO.
> 
> I have found how: the xml is inside the dat, which is a zip archive. Once a
> autocorrection listis edited, a corresponding dat per user is created in
> ~/.config/libreoffice/4/user/autocorr (Linux), or C:\Program
> Files\LibreOffice\share\autocorr (windows).
Good to know!