The title of the article is extraordinary wrong that makes it click bait.
There is no “yes to copilot”
It is only a formalization of what Linux said before: All AI is fine but a human is ultimately responsible.
" AI agents cannot use the legally binding “Signed-off-by” tag, requiring instead a new “Assisted-by” tag for transparency"
The only mention of copilot was this:
“developers using Copilot or ChatGPT can’t genuinely guarantee the provenance of what they are submitting”
This remains a problem that the new guidelines don’t resolve. Because even using AI as a tool and having a human review it still means the code the LLM output could have come from non GPL sources.
That’s probably why they say “a human is responsible” not “a human must validate it.” I certainly agree that validation is not always possible. And this problem will get worse in time.
The title of the article is extraordinary wrong that makes it click bait.
It’s the pain in the ass with some of those fucking tech/video/showbiz news outlets and then rules in some fora where you cannot make “editorialized” post titles, even though it’s so tempting to correct the awful titling.
Yeah, that’s also my question. Partially because I am a former-lawyer-turned-software-developer… but, yeah. How are the kernel maintainers supposed to evaluate whether a particular PR contains non-GPL code?
Granted, this was potentially an issue before LLMs too, but nowhere near the scale it will be now.
(In the interests of full disclosure, my legal career had nothing to do with IP law or software licensing - I did public interest law).
They don’t, just like they don’t with human submitted stuff. The point of the Signed-off-by is the author attests they have the rights to submit the code.
Which I’m guessing they cannot attest, if LLMs truly have the 2-10% plagiarism rate that multiple studies seem to claim. It’s an absurd rule, if you ask me. (Not that I would know, I’m not a lawyer.)
In my experience code generation is most affected by the local context (i.e. the codebase you are working on). On top of that a lot of code is purely mechanical - code generally has to have a degree of novelty to be protected by copyright.
Imagine how broken it would be otherwise. The first person to write a while loop in any given language would be the owner of it. Anyone else using the same concept would have to write an increasingly convoluted while loop with extra steps.
If it’s flagged as “assisted by <LLM>” then it’s easy to identify where that code came from. If a commercial LLM is trained on proprietary code, that’s on the AI company, not on the developer who used the LLM to write code. Unless they can somehow prove that the developer had access to said proprietary code and was able to personally exploit it.
If AI companies are claiming “fair use,” and it holds up in court, then there’s no way in hell open-source developers should be held accountable when closed-source snippets magically appear in AI-assisted code.
Granted, I am not a lawyer, and this is not legal advice. I think it’s better to avoid using AI-written code in general. At most use it to generate boilerplate, and maybe add a layer to security audits (not as a replacement for what’s already being done).
But if an LLM regurgitates closed-source code from its training data, I just can’t see any way how that would be the developer’s fault…
How would they launder it? Just declare it their own property because a few lines of code look similar? When there’s no established connection between the developers and anyone who has access to the closed-source code?
That makes no sense. Please tell me that wouldn’t hold up in court.
I believe what they’re referring to is the training of models on open source code, which is then used to generate closed source code.
The break in connection you mention makes it not legally infringement, but now code derived from open source is closed source.
Because of the untested nature of the situation, it’s unclear how it would unfold, likely hinging on how the request was formed.
We have similar precedent with reverse engineering, but the non sentient tool doing it makes it complicated.
First of all, who is going to discover the closed source use of gpl code and create a lawsuit anyway?
Second, the llm ingests the code, and then spits it back out, with maybe a few changes. That is how it benefits from copyleft code while stripping the license.
Maybe a human could do the same thing, but it would take much longer.
Wait, did you just move the goalposts? I thought the issue we were talking about was open-source developers who use LLM-generated code and unwittingly commit changes that contain allegedly closed-source snippets from the LLM’s training data.
Now you want to talk about LLM training data that uses open-source code, and then closed-source developers commit changes that contain snippets of GPL code? That’s fine. It’s a change of topic, but we can talk about that too.
Just don’t expect what I said before about the previous topic of discussion to apply to the new topic. If we’re talking about something different now, I get to say different things. That’s how it works.
The title of the article is extraordinary wrong that makes it click bait.
There is no “yes to copilot”
It is only a formalization of what Linux said before: All AI is fine but a human is ultimately responsible.
" AI agents cannot use the legally binding “Signed-off-by” tag, requiring instead a new “Assisted-by” tag for transparency"
The only mention of copilot was this:
“developers using Copilot or ChatGPT can’t genuinely guarantee the provenance of what they are submitting”
This remains a problem that the new guidelines don’t resolve. Because even using AI as a tool and having a human review it still means the code the LLM output could have come from non GPL sources.
That’s probably why they say “a human is responsible” not “a human must validate it.” I certainly agree that validation is not always possible. And this problem will get worse in time.
It’s the pain in the ass with some of those fucking tech/video/showbiz news outlets and then rules in some fora where you cannot make “editorialized” post titles, even though it’s so tempting to correct the awful titling.
Yeah, that’s also my question. Partially because I am a former-lawyer-turned-software-developer… but, yeah. How are the kernel maintainers supposed to evaluate whether a particular PR contains non-GPL code?
Granted, this was potentially an issue before LLMs too, but nowhere near the scale it will be now.
(In the interests of full disclosure, my legal career had nothing to do with IP law or software licensing - I did public interest law).
They don’t, just like they don’t with human submitted stuff. The point of the Signed-off-by is the author attests they have the rights to submit the code.
Which I’m guessing they cannot attest, if LLMs truly have the 2-10% plagiarism rate that multiple studies seem to claim. It’s an absurd rule, if you ask me. (Not that I would know, I’m not a lawyer.)
Where are you seeing the 2-10% figure?
In my experience code generation is most affected by the local context (i.e. the codebase you are working on). On top of that a lot of code is purely mechanical - code generally has to have a degree of novelty to be protected by copyright.
Imagine how broken it would be otherwise. The first person to write a while loop in any given language would be the owner of it. Anyone else using the same concept would have to write an increasingly convoluted while loop with extra steps.
Sounds like an origin story for recursion.
If it’s flagged as “assisted by <LLM>” then it’s easy to identify where that code came from. If a commercial LLM is trained on proprietary code, that’s on the AI company, not on the developer who used the LLM to write code. Unless they can somehow prove that the developer had access to said proprietary code and was able to personally exploit it.
If AI companies are claiming “fair use,” and it holds up in court, then there’s no way in hell open-source developers should be held accountable when closed-source snippets magically appear in AI-assisted code.
Granted, I am not a lawyer, and this is not legal advice. I think it’s better to avoid using AI-written code in general. At most use it to generate boilerplate, and maybe add a layer to security audits (not as a replacement for what’s already being done).
But if an LLM regurgitates closed-source code from its training data, I just can’t see any way how that would be the developer’s fault…
Pretty convenient.
This is how copyleft code gets laundered into closed source programs.
All part of the plan.
How would they launder it? Just declare it their own property because a few lines of code look similar? When there’s no established connection between the developers and anyone who has access to the closed-source code?
That makes no sense. Please tell me that wouldn’t hold up in court.
I believe what they’re referring to is the training of models on open source code, which is then used to generate closed source code.
The break in connection you mention makes it not legally infringement, but now code derived from open source is closed source.
Because of the untested nature of the situation, it’s unclear how it would unfold, likely hinging on how the request was formed.
We have similar precedent with reverse engineering, but the non sentient tool doing it makes it complicated.
First tell us how much money you have. Then we’ll be able to predict whether the courts will find in your favor or not
Sad but true…
First of all, who is going to discover the closed source use of gpl code and create a lawsuit anyway?
Second, the llm ingests the code, and then spits it back out, with maybe a few changes. That is how it benefits from copyleft code while stripping the license.
Maybe a human could do the same thing, but it would take much longer.
Wait, did you just move the goalposts? I thought the issue we were talking about was open-source developers who use LLM-generated code and unwittingly commit changes that contain allegedly closed-source snippets from the LLM’s training data.
Now you want to talk about LLM training data that uses open-source code, and then closed-source developers commit changes that contain snippets of GPL code? That’s fine. It’s a change of topic, but we can talk about that too.
Just don’t expect what I said before about the previous topic of discussion to apply to the new topic. If we’re talking about something different now, I get to say different things. That’s how it works.
I was responding specifically to this part
showing what would happen when the llm regurgitates open source code into close source projects.
Sorry if you didn’t like that.
I get why they are passing this by though, since you don’t know the provenance of that Stack Overflow snippet, either.
Yup.
I would also just point out that this doesnt change the legal exposure to the Linux kernel to infringing submissions from before the advent of LLMs.
Fundamentally not how LLMs work, it’s not a database of code snippets.
“Derivative works”