How would they launder it? Just declare it their own property because a few lines of code look similar? When there’s no established connection between the developers and anyone who has access to the closed-source code?
That makes no sense. Please tell me that wouldn’t hold up in court.
I believe what they’re referring to is the training of models on open source code, which is then used to generate closed source code.
The break in connection you mention makes it not legally infringement, but now code derived from open source is closed source.
Because of the untested nature of the situation, it’s unclear how it would unfold, likely hinging on how the request was formed.
We have similar precedent with reverse engineering, but the non sentient tool doing it makes it complicated.
That makes sense. I see the problem with that, and I don’t have a good solution for it. It is a divergence of topic though, as we were discussing open-source programmers using LLMs which are potentially trained on closed-source code.
LLMs trained on open-source code is worth its own discussion, but I don’t see how it fits in this thread. The post isn’t about closed-source programmers using LLMs.
Besides, closed-source code developers could’ve been stealing open-source code all along. They don’t really need AI to do that.
Still, training LLMs on open-source code is a questionable practice for that reason, particularly when it comes to training commercial models on GPL code. But it’s probably hard to prove what code was used in their datasets, since it’s closed-source.
First of all, who is going to discover the closed source use of gpl code and create a lawsuit anyway?
Second, the llm ingests the code, and then spits it back out, with maybe a few changes. That is how it benefits from copyleft code while stripping the license.
Maybe a human could do the same thing, but it would take much longer.
Wait, did you just move the goalposts? I thought the issue we were talking about was open-source developers who use LLM-generated code and unwittingly commit changes that contain allegedly closed-source snippets from the LLM’s training data.
Now you want to talk about LLM training data that uses open-source code, and then closed-source developers commit changes that contain snippets of GPL code? That’s fine. It’s a change of topic, but we can talk about that too.
Just don’t expect what I said before about the previous topic of discussion to apply to the new topic. If we’re talking about something different now, I get to say different things. That’s how it works.
But you flipped the situation, making it an entirely different discussion, and then you went on as if you thought my previous point was still supposed to apply to the new topic that you introduced.
It’s not that I don’t like it; we can talk about the issues with training commercial LLMs on GPL code. It was just an unannounced change of topic. Like you were trying to score points, so you brought up something irrelevant to pretend I’m arguing against, which I wasn’t.
Corporations have been able to steal open-source code without the help of AI, and the same issues arise due to lack of transparency. It’s a problem, sure, but it wasn’t the problem we were discussing. And you acting like I’m somehow arguing against it being a problem is a strawman, because it’s not what the thing I said was in reference to.
Pretty convenient.
This is how copyleft code gets laundered into closed source programs.
All part of the plan.
How would they launder it? Just declare it their own property because a few lines of code look similar? When there’s no established connection between the developers and anyone who has access to the closed-source code?
That makes no sense. Please tell me that wouldn’t hold up in court.
I believe what they’re referring to is the training of models on open source code, which is then used to generate closed source code.
The break in connection you mention makes it not legally infringement, but now code derived from open source is closed source.
Because of the untested nature of the situation, it’s unclear how it would unfold, likely hinging on how the request was formed.
We have similar precedent with reverse engineering, but the non sentient tool doing it makes it complicated.
That makes sense. I see the problem with that, and I don’t have a good solution for it. It is a divergence of topic though, as we were discussing open-source programmers using LLMs which are potentially trained on closed-source code.
LLMs trained on open-source code is worth its own discussion, but I don’t see how it fits in this thread. The post isn’t about closed-source programmers using LLMs.
Besides, closed-source code developers could’ve been stealing open-source code all along. They don’t really need AI to do that.
Still, training LLMs on open-source code is a questionable practice for that reason, particularly when it comes to training commercial models on GPL code. But it’s probably hard to prove what code was used in their datasets, since it’s closed-source.
First tell us how much money you have. Then we’ll be able to predict whether the courts will find in your favor or not
Sad but true…
First of all, who is going to discover the closed source use of gpl code and create a lawsuit anyway?
Second, the llm ingests the code, and then spits it back out, with maybe a few changes. That is how it benefits from copyleft code while stripping the license.
Maybe a human could do the same thing, but it would take much longer.
Wait, did you just move the goalposts? I thought the issue we were talking about was open-source developers who use LLM-generated code and unwittingly commit changes that contain allegedly closed-source snippets from the LLM’s training data.
Now you want to talk about LLM training data that uses open-source code, and then closed-source developers commit changes that contain snippets of GPL code? That’s fine. It’s a change of topic, but we can talk about that too.
Just don’t expect what I said before about the previous topic of discussion to apply to the new topic. If we’re talking about something different now, I get to say different things. That’s how it works.
I was responding specifically to this part
showing what would happen when the llm regurgitates open source code into close source projects.
Sorry if you didn’t like that.
But you flipped the situation, making it an entirely different discussion, and then you went on as if you thought my previous point was still supposed to apply to the new topic that you introduced.
It’s not that I don’t like it; we can talk about the issues with training commercial LLMs on GPL code. It was just an unannounced change of topic. Like you were trying to score points, so you brought up something irrelevant to pretend I’m arguing against, which I wasn’t.
Corporations have been able to steal open-source code without the help of AI, and the same issues arise due to lack of transparency. It’s a problem, sure, but it wasn’t the problem we were discussing. And you acting like I’m somehow arguing against it being a problem is a strawman, because it’s not what the thing I said was in reference to.