• luckystarr@feddit.org
    link
    fedilink
    Deutsch
    arrow-up
    8
    ·
    3 days ago

    What’s displayed in the screenshots are typographically relevant variations of the whitespace. 0x00A0 is the non-breaking space, and 0x202F is a narrow whitespace. Both have their own Wikipedia pages where you can look up what to use them for. And while browsers may display them the same as a regular whitespace, because they just suck at typography, the same is not true for word processing or layout software.

    Granted, while they can be detected and used as watermarks, although I regularily used them in my text formatting software as well. The resulting text flows better and will not be wrapped in illogical ways. So I don’t think they are intended as watermarks, and I also don’t think they will be going away again. It’s just such a tremendous improvement of the output.

  • excral@feddit.org
    link
    fedilink
    arrow-up
    2
    ·
    3 days ago

    My guess is that one of the major motivations of this is to identify their own texts when training future AI models. Training LLMs on LLM-generated data is harmful to their performance and leads to regression, but more and more data they scrape from the internet is LLM-generated. With measures like this they may be able to filter out a significant chunk of the data they generated themselves from future training data.

  • cron@feddit.org
    link
    fedilink
    arrow-up
    5
    ·
    3 days ago

    I wouldn’t really call these watermarks. If these are watermarks, then someone might call the longer than usual dash a watermark, too:

    That long dash is called an em dash — like this one.

    • General_Effort@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      2 days ago

      Using identically displayed but differently encoded characters is a way to watermark texts. It was used in a lawsuit a few years ago (SZ-Bericht). The suing company eventually lost because they didn’t actually own the rights to the texts they had watermarked.

      As @luckystarr@feddit.org points out, these whitespaces may make quite a difference, so not likely to be a watermark. Methods for watermarking LLM-generated Text are more subtle anyway, involving altering word frequencies.