• Probius@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    223
    ·
    4 days ago

    This type of large-scale crawling should be considered a DDoS and the people behind it should be charged with cyber crimes and sent to prison.

    • FauxLiving@lemmy.world
      link
      fedilink
      English
      arrow-up
      76
      ·
      4 days ago

      If it’s disrupting their site, it is a crime already. The problem is finding the people behind it. This won’t be some guy on his dorm PC and they’ll likely be in places interpol can’t reach.

    • eah@programming.dev
      link
      fedilink
      English
      arrow-up
      19
      ·
      4 days ago

      Applying the Computer Fraud and Abuse Act to corporations? Sign me up! Hey, they’re also people, aren’t they?

    • isolatedscotch@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      28
      ·
      4 days ago

      good luck with that! not only is a company doing it, which means no individual person will go to prison, but it’s from a chinese company with no regard for any laws that might get passed

      • humanspiral@lemmy.ca
        link
        fedilink
        English
        arrow-up
        15
        arrow-down
        3
        ·
        4 days ago

        The people determining US legislation have said, “how can we achieve skynet if our tech trillionaire company sponsors can’t evade copyright or content licensing?” But they also say if “we don’t spend every penny you have on achieving US controlled Skynet, then China wins.”

        Speculating on “Huawei network can solve this”, doesn’t mean that all the bots are Chinese, but does confirm that China has a lot of AI research, and Huawei GPUs/NPUs are getting used, and successfully solving this particular “I am not a robot challenge”.

        It’s really hard to call “amateur coding challenge” competition web site a national security threat, but if you hype Huawei enough, then surely the US will give up on AI like it gave up on solar, and maybe EVs. “If we don’t adopt Luddite politics and all become Amish, then China wins” is a “promising” new loser perspective on media manipulation.

  • sp3ctr4l@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    26
    ·
    3 days ago

    Do we all want the fucking Blackwall from Cyberpunk 2077?

    Fucking NetWatch?

    Because this is how we end up with them.

    …excuse me, I need to go buy a digital pack of cigarettes for the angry voice in my head.

  • 0_o7@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    37
    arrow-down
    1
    ·
    4 days ago

    I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they’re now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.

    Soon everything on the internet will be behind a wall.

      • aev_software@programming.dev
        link
        fedilink
        English
        arrow-up
        20
        ·
        4 days ago

        In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible… which is what the scalpers are causing.

        Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they’d be a bit more careful about the damage they cause.

        But they aren’t, because capitalism.

        • Natanael@infosec.pub
          link
          fedilink
          English
          arrow-up
          5
          ·
          4 days ago

          If they had the slightest bit of survival instinct they’d share a archive.org / Google-ish scraper and web cache infrastructure, and pull from those caches, and everything would just be scraped once, repeated only occasionally.

          Instead they’re building maximally dumb (as in literally counterproductive and self harming) scrapers who don’t know what they’re interacting with.

          At what point will people start to track down and sabotage AI datacenters IRL?

    • ILikeTraaaains@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      3 days ago

      Not necessarily compromised, I saw a VPN provider (don’t remember the name) that offered a free tier where the client accepts being used for this.

      And I suspect that in the future some VPN companies will be exposed doing the same but with their paid customers.

  • Blackmist@feddit.uk
    link
    fedilink
    English
    arrow-up
    25
    ·
    4 days ago

    Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.

    • floofloof@lemmy.ca
      link
      fedilink
      English
      arrow-up
      84
      arrow-down
      1
      ·
      4 days ago

      But they bring profits to tech billionaires. No action will be taken.

      • BodilessGaze@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        13
        arrow-down
        1
        ·
        4 days ago

        No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that’s dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There’s nothing we can do legally about Chinese scrapers.

          • BodilessGaze@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            12
            ·
            edit-2
            3 days ago

            We do, somewhat. We haven’t gone as far as a blanket ban of Chinese CIDR ranges because there’s a lot of risks and bureaucracy associated with a move like that. But it probably makes sense for a small company like Codeberg, since they have higher risk tolerance and can move faster.

    • Programmer Belch@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      40
      ·
      4 days ago

      I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

      The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

        • who@feddit.org
          link
          fedilink
          English
          arrow-up
          18
          ·
          edit-2
          4 days ago

          Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

          • redjard@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            9
            ·
            4 days ago

            Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …

            • who@feddit.org
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              3 days ago

              Crawl-delay

              It’s a nonstandard extension without consistent semantics or wide support, but I suppose it’s good to know about anyway. Thanks for mentioning it.

          • S7rauss@discuss.tchncs.de
            link
            fedilink
            English
            arrow-up
            4
            ·
            4 days ago

            I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

      • ulterno@programming.dev
        link
        fedilink
        English
        arrow-up
        8
        ·
        4 days ago

        If the site is getting slowed at times (regardless of whether it is when you scrape), you might want to not scrape at all.

        Probably not a good idea to download the whole site, but then that depends upon the site.

        • If it is a static site, if you just setup your scraper to not download CSS/JS and images/videos, that should make a difference.
        • For a dynamically created site, there’s nothing I can say
        • Then again, if you try to reduce your download to what you are using, as much as possible, that might be good enough
        • Since sites are originally made for human consumption, you might have considered keeping the link traversal rates similar to that
        • The best would be if you could ask the website dev whether they have an API available.
          • Even better, ask them to provide an RSS feed.
        • Programmer Belch@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          3
          ·
          4 days ago

          As far as I know, the website doesn’t have an API but I just download the HTML and format the result with a simple Python script, it makes around 10 to 20 requests, one for each series I’m following at each time.

          • limerod@reddthat.com
            link
            fedilink
            English
            arrow-up
            2
            ·
            4 days ago

            You can use the cache feature in curl/wget so it does not download the same css, html, twice. Also, can ignore JavaScript, and image files to save on unnecessary requests.

            I would reduce the frequency to once every two days to further reduce the impact.

          • ulterno@programming.dev
            link
            fedilink
            English
            arrow-up
            1
            ·
            4 days ago

            That might/might not be much.
            Depends upon the site, I’d say.

            e.g. If it’s something like Netflix, I wouldn’t think much, because they have the means to serve the requests.
            But for some PeerTube instance, even a single request seems to be too heavy for them. So if that server does not respond to my request, I usually wait for an hour or so before refreshing the page.

      • Flax@feddit.uk
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        4 days ago

        The problem is these are constant army hordes / datacentres. You have one tool. Sending a few requests from your device wouldn’t even dent a raspberry pi, nevermind a beefier server

        I think the intention of traffic is also important. Your tool is so you can consume the content freely provided by the website. Their tool is so they can profit off of the work on the website.

  • gressen@lemmy.zip
    link
    fedilink
    English
    arrow-up
    83
    arrow-down
    2
    ·
    4 days ago

    Write TOS that state that crawlers automatically accept a service fee and then send invoices to every crawler owner.

    • BodilessGaze@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      42
      ·
      4 days ago

      Huawei is Chinese. There’s literally zero chance a European company like Codeberg is going to successfully collect from a company in China over a TOS violation.

      • wischi@programming.dev
        link
        fedilink
        English
        arrow-up
        15
        ·
        4 days ago

        It’s not even a company. It’s a non-profit “eingetragener Verein”. They have very limited resources, especially money because they purely live on membership fees and donations.

        • BodilessGaze@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          8
          ·
          edit-2
          4 days ago

          I really doubt it. Lawsuits are expensive, and proving responsibility is difficult, since plausible deniability is easy. All scrapers need to do is use shared IPs (e.g. cloud providers), preferably owned by a company in a different legal jurisdiction. That could be the case here: a European company could be using Huawei Cloud to mask the source of their traffic.

          • Venia Silente@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            5
            ·
            4 days ago

            All scrapers need to do is use shared IPs (e.g. cloud providers),

            Simple: just charge the cloud provider.

            Once that gets strong enough they’ll start placing terms against scraping in their TOS.

            • wischi@programming.dev
              link
              fedilink
              English
              arrow-up
              5
              ·
              edit-2
              4 days ago

              And then they just throw it in the bin because there was never a contract between you and them. What to do then? Sue Microsoft, Amazon and Google

              I’m sure Codeberg, a German non-profit Verein, has time and money to do that 🤣.

              • Venia Silente@lemmy.dbzer0.com
                link
                fedilink
                English
                arrow-up
                1
                ·
                2 days ago

                Sure but that’s a whole different part of the system. Society as a whole has to change (some guillotines would help) and no matter how cool Codeberg is, they can’t do all that on their own.

                In the meantime, what the elites visibly respond to and that is more readily accessible is monetary costs. Make it costly (operationally or legally) to scrape sites, and they’ll stop if at least to whine.

    • wischi@programming.dev
      link
      fedilink
      English
      arrow-up
      38
      arrow-down
      1
      ·
      4 days ago

      They typically don’t include a billing address in the User Agent when crawling 🤣

      • gressen@lemmy.zip
        link
        fedilink
        English
        arrow-up
        11
        arrow-down
        2
        ·
        4 days ago

        That’s a technicality. The billing address can be discovered for a nominal fee as well.

        • wischi@programming.dev
          link
          fedilink
          English
          arrow-up
          7
          ·
          edit-2
          4 days ago

          I’m sure it can’t, especially for foreign IP addresses, VPNs, and a ton of other situations. Even if directly connect to the internet just via your ISP, many countries in Europe (don’t know about US) have laws that would require you to have very good reasons and a court order to get the info you need from the ISP - for a single(!) case.

          If it would be possible to simply get the address of all digital visitors, we wouldn’t have to develop all this anti scrape tech and just sue them.

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    20
    ·
    4 days ago

    If someone just wants to download code from Codeberg for training, it seems like it’d be way more efficient to just clone the git repositories or even just download tarballs of the most-recent releases for software hosted on Codeberg than to even touch the Web UI at all.

    I mean, maybe you need the Web UI to get a list of git repos, but I’d think that that’d be about it.

    • witten@lemmy.world
      link
      fedilink
      English
      arrow-up
      27
      ·
      4 days ago

      Then they’d have to bother understanding the content and downloading it as appropriate. And you’d think if anyone could understand and parse websites in realtime to make download decisions, it be giant AI companies. But ironically they’re only interested in hoovering up everything as plain web pages to feed into their raw training data.

      • Natanael@infosec.pub
        link
        fedilink
        English
        arrow-up
        17
        ·
        4 days ago

        The same morons scrape Wikipedia instead of downloading the archive files which trivially can be rendered as web pages locally

  • MonkderVierte@lemmy.zip
    link
    fedilink
    English
    arrow-up
    18
    ·
    edit-2
    4 days ago

    I just thought that having a client side proof-of-work (or even only a delay) bound to the IP might deter the AI companies to choose to behave instead (because single-visit-per-IP crawlers get too expensive/slow and you can just block normal abusive crawlers). But they already have mind-blowing computing and money ressources and only want your data.

    But if there was a simple-to-use integrated solution and every single webpage used this approach?

    • witten@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      ·
      4 days ago

      Believe me, these AI corporations have way too many IPs to make this feasible. I’ve tried per-IP rate limiting. It doesn’t work on these crawlers.

    • Taldan@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 days ago

      Are you planning to just outright ban IPv6 (and thus half the world)?

      Any IP based restriction is useless with IPv6

    • explodicle@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 days ago

      What if we had some protocol by which the proof-of-work is transferable? Then not only would there be a cost to using the website, but also the operator would receive that cost as payment.

      • Taldan@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        4 days ago

        It’s theoretically viable, but every time that has been tried has failed

        There are a lot of practical issues, mainly that it’s functionally identical to a crypto miner malware

    • daniskarma@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      8
      ·
      4 days ago

      Solution was invented long ago. It’s called a captcha.

      A little bother for legitimate users, but a good captcha is still hard to bypass even using AI.

      And I think for the final user standpoint I prefer to lose 5 seconds in a captcha, than the browser running an unsolicited heavy crypto challenge on my end.

        • daniskarma@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          edit-2
          3 days ago

          I tried, and not really.

          I had to scrape a site that have some captcha and no AI was able to consistently solve it.

          In order to be able to “crack it” I had to replicate the captcha generation algorithm best I could and train a custom model to solve it. Only then I could crack it open. And I was lucky the captcha generation algorithm wasn’t to complex and it was easy to replicate.

          This amount of work is a far greater load than Anubis crypto challenges.

          Take into account that AI drive ocr drinks from existing examples, if your captcha is novel enough they are going to have a hard time solving it.

          It also would drain power, which is the only point of anubis.

          • mholiv@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            3 days ago

            There is a difference between you (or me) sitting at home working on this and a team of highly motivated people with unlimited money.

            • daniskarma@lemmy.dbzer0.com
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              1
              ·
              edit-2
              3 days ago

              The thing is not that it cannot be done, the thing is that the cost is most likely higher than Anubis.

  • chicken@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    36
    arrow-down
    1
    ·
    4 days ago

    Seems like such a massive waste of bandwidth since it’s the same work being repeated by many different actors to piece together the same dataset bit by bit.

      • daniskarma@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        28
        ·
        4 days ago

        On the contrary. Open community based block lists can be very effective. Everyone can contribute to them and asphyxiate people with malicious intents.

        If you think something like, “if the blocklist is available then malicious agents simply won’t use that ips” I don’t think if that makes a lot of sense. As the malicious agent will know any of their IPs being blocked as soon as they use them.

        • pedz@lemmy.ca
          link
          fedilink
          English
          arrow-up
          9
          ·
          3 days ago

          Just to give an example of public lists that are working, I have an IRC server and it’s getting bombarded with spam bots. It’s horrible around the superbowl for some reason, but it just continues year round.

          So I added a few public anti spamming lists like dronebl to the config, and the vast majority of the bots are automatically G-Lined/banned.

        • adminofoz@lemmy.cafe
          link
          fedilink
          English
          arrow-up
          3
          ·
          3 days ago

          Im sure there are many but I just learned of Crowdsecs WAF this year which is has a shared ban list. Its pretty cool. Im using it in prod right now. Im not saying it’s the be all end all, but as part of a multilayered approach, it works pretty well.

  • rozodru@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    4 days ago

    I run my own gitea instance on my own server and within the past week or so I’ve noticed it just getting absolutely nailed. One repo in particular, a Wayland WM I built. Just keeps getting hammered over and over by IPs in China.