What’s displayed in the screenshots are typographically relevant variations of the whitespace. 0x00A0 is the non-breaking space, and 0x202F is a narrow whitespace. Both have their own Wikipedia pages where you can look up what to use them for. And while browsers may display them the same as a regular whitespace, because they just suck at typography, the same is not true for word processing or layout software.
Granted, while they can be detected and used as watermarks, although I regularily used them in my text formatting software as well. The resulting text flows better and will not be wrapped in illogical ways. So I don’t think they are intended as watermarks, and I also don’t think they will be going away again. It’s just such a tremendous improvement of the output.
My guess is that one of the major motivations of this is to identify their own texts when training future AI models. Training LLMs on LLM-generated data is harmful to their performance and leads to regression, but more and more data they scrape from the internet is LLM-generated. With measures like this they may be able to filter out a significant chunk of the data they generated themselves from future training data.
I wouldn’t really call these watermarks. If these are watermarks, then someone might call the longer than usual dash a watermark, too:
That long dash is called an em dash — like this one.
Using identically displayed but differently encoded characters is a way to watermark texts. It was used in a lawsuit a few years ago (SZ-Bericht). The suing company eventually lost because they didn’t actually own the rights to the texts they had watermarked.
As @luckystarr@feddit.org points out, these whitespaces may make quite a difference, so not likely to be a watermark. Methods for watermarking LLM-generated Text are more subtle anyway, involving altering word frequencies.
Halbgeviertstrich?
Nein, der Geviertstrich