
Because the world wouldn’t be literally on fire, that’s hyperbole.
Basically a deer with a human face. Despite probably being some sort of magical nature spirit, his interests are primarily in technology and politics and science fiction.
Spent many years on Reddit before joining the Threadiverse as well.
Because the world wouldn’t be literally on fire, that’s hyperbole.
You’re still setting a high standard here. What counts as a “well trained” human and how many SO commenters count as that? Also “easier to teach” is complicated. It takes decades for a human to become well trained, an LLM can be trained in weeks. And an individual computer that’ll be running the LLM is “trained” in minutes, it just needs to load the model into memory. Once you have an LLM you can run as many instances of it as you want to spend money on.
There’s no guarantee LLM will get reliably better at everything
Never said they would. I said they’re as bad as they’re ever going to be, which allows for the possibility that they don’t get any better.
Even if they don’t, though, they’re still good enough to have killed Stack Overflow.
It still makes some mistakes today that it did when introduced and nobody knows how to fix that yet
And humans also make mistakes. Do we know how to fix that yet?
If they aren’t comfortable with their Discord messages being public, perhaps they shouldn’t have posted those messages in a public forum that the public can access.
Good thing human teachers never have hidden biases.
How does this play out when you hold a human contributor to the same standards? They also often fail to summarize information accurately or bring up the wrong thing. Lots of answers on Stack Overflow are just plain wrong, or focus on the wrong thing, or don’t reference the correct sources (when they reference anything at all). The most common criticism of Stack Overflow I’m seeing is how its human contributors direct people to other threads and declare that the question is “already answered” there when it isn’t really.
LLMs can do a decent job. And right now they are as bad as they’re ever going to be.
That’s the neat thing, you don’t.
LLM training is primarily about getting the LLM to understand concepts. When you need it to be factual, or are working with it to solve novel problems, you can put a bunch of relevant information into the LLM’s context and it can use that even if it wasn’t explicitly trained on it. It’s called RAG, retrieval-augmented generation. Most of the general-purpose LLMs on the net these days do that, when you ask Copilot or Gemini about stuff it’ll often have footnotes in the response that point to the stuff that it searched up in the background and used as context.
So for a future Stack Overflow LLM replacement, I’d expect the LLM to be backed up by being able to search through relevant documentation and source code.
As I said above:
mobs of angry people ignorant of both the technical details and legal issues involved in it.
Emphasis added.
They do not “steal” anything when they train an AI off of something. They don’t even violate copyright when they train an AI off of something, which is what I assume you actually meant when you sloppily and emotively used the word “steal.”
In order to violate copyright you need to distribute a copy of something. Training isn’t doing that. Models don’t “contain” the training material, and neither do the outputs they produce (unless you try really hard to get it to match something specific, in which case you might as well accuse a photocopier manufacturer of being a thief).
Training an AI model involves analyzing information. People are free to analyze information using whatever tools they want to. There is no legal restriction that an author can apply to prevent their work from being analyzed. Similarly, “style” cannot be copyrighted.
A world in which a copyright holder could prohibit you from analyzing their work, or could prohibit you from learning and mimicking their style, would be nothing short of a hellish corporate dystopia. I would say it baffles me how many people are clamoring for this supposedly in the name of “the little guy”, but sadly, it doesn’t. I know how people can be selfish and short-sighted, imagining that they’re owed for their hard work of shitposting on social media (that they did at the time for free and for fun) now that someone else is making money off of it. There are a bunch of lawsuits currently churning through courts in various jurisdictions claiming otherwise, but let us hope that they all get thrown out like the garbage they are because the implications of them succeeding are terrible.
The world is not all about money. Art is not all about money. It’s disappointing how quickly and easily masses of people started calling for their rights to be taken away in exchange for the sliver of a fraction of a penny that they think they can now somehow extract. The offense they claim to feel over someone else making something valuable out of something that is free. How dare they.
And don’t even get me started about the performative environmental ignorance around the “they’re disintegrating all the water!” And “each image generation could power billions of homes!” Nonsense.
It’s a great new technology that unfortunately has become the subject of baying mobs of angry people ignorant of both the technical details and legal issues involved in it.
It has drawn some unwarranted hype, sure. It’s also drawn unwarranted hate. The common refrain of “it’s stealing from artists!” Is particularly annoying, it’s just another verse in the never-ending march to further monetize and control every possible scrap of peoples’ thoughts and ideas.
I’m eager to see all the new applications for it unfold, and I hope that the people demanding it to be restricted with draconian new varieties of intellectual property law or to be solely under the control of gigantic megacorporations won’t prevail (these two groups are the same group of people, they often don’t realize this).
This is an area where synthetic data can be useful. For example, you could scrape the documentation and source code for a Python library and then use an existing LLM to generate questions and answers about the content to train future coding assistants on. As long as the training data gets well curated for quality it’s perfectly useful for this kind of thing, no need for an actual forum.
AI companies have a lot of clever people working for them, they’re aware of these problems.
I don’t see how this comment is related to the content of this article. This is a bunch of information about how LLMs work under the hood, it has nothing to do with how they’re supposedly “sucking up and ingesting whatever’s out there unquestioningly.” I don’t see anything about LLM training mentioned here, it’s about how they function once they have been trained.
I’m a fan of the Machete Order.
There may be some spoilers in that blog post, it’s been a while since I read it, so here it is in summary:
Phantom Menace is omitted because it’s the weakest of the prequel trilogy and everything that happens in it is summarized at the beginning of Attack of the Clones anyway. If you want to be a completionist then watch it between Empire Strikes Back and Attack of the Clones.
There’s good reasons for following this order, but it’s hard to describe them without spoiling anything. Basically, Lucas assumed you’d watched the original trilogy when he made the prequels, so it’s got a bunch of spoilers in it that the Machete Order preserves quite nicely.
That’s why I blocked her.
There will eventually be enough public domain content that AI will be at the quality it is today with public materials alone.
So, AI will always be ~95 years behind the times?
Except the AIs produced by Disney et al, of course. And those produced by Chinese companies with the CCP stamp of approval. They’ll be up to date.
Many people with positive sentiments towards AI also want that.
If you think death is the answer the polite thing is to not force everyone to go along with you.
I imagine there’s also an element of “what can we start building right now,” as opposed to waiting a couple of years for R&D before setting up the production lines. A weapon system can be the most wonderful and powerful thing on paper but if you’re under attack you can’t deploy a piece of paper.
It’s also nice that it turns out old American tech is perfectly capable of dominating Russia’s current tech.
They’re probably still waiting to see if they can pin this on Democrats or immigrants in some manner.
Thanks for asking. My comment was off the top of my head based on stuff I’ve read over the years, so first I did a little fact-checking of myself to make sure. There’s a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they’re capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run, it’s just that well-curated high-quality raw data is already available.
This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA’s Nemotron-4 models as another example.
IMO this is fine, it’s not really a pension plan’s role to be trying to manipulate what industries are doing well. A pension plan should be primarily focused on getting good long-term returns.
If you want that to not happen then you should focus on policies that make carbon-producing industries not produce good long-term returns in the first place. Then the pension plans and everyone else will stop investing in them as a natural consequence.
If they remain profitable and your pension plan stops investing in them, that just means you’re handing free money to the people who remain willing to invest in them.