

So what does the signed-off-by magically solve here, that doesn’t require either you or the contributor to legally review every line by an LLM? If you’re not a lawyer, is your contributor going to be one?
I code and do art things. Check https://private.horse64.org/u/ell1e for the person behind this content. For my projects, https://codeberg.org/ell1e has many of them.


So what does the signed-off-by magically solve here, that doesn’t require either you or the contributor to legally review every line by an LLM? If you’re not a lawyer, is your contributor going to be one?


So do you want to legally review every line by an LLM to see if it meets the fair use criterion, since you have to assume it was probably stolen? And would you do this for a known plagiarizing human contributor too…?


I agree. However, I think the natural conclusion is an LLM ban. See also here.


If you had a contributor that plagiarized at a 2-10%, would you really go “eh it has to have a degree of novelty to be a problem” rather than just ban them? The different standards baffle me sometimes.
You can find various rates mentioned here: https://dl.acm.org/doi/10.1145/3543507.3583199 and here: https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/


Which I’m guessing they cannot attest, if LLMs truly have the 2-10% plagiarism rate that multiple studies seem to claim. It’s an absurd rule, if you ask me. (Not that I would know, I’m not a lawyer.)


Would you also say that to this lawyer reviewing Co-Pilot in 2026? https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567
Disclaimer: this isn’t legal advice.


If the accountability cannot be practically fulfilled, the reasonable policy becomes a ban.
What good is it to say “oh yeah you can submit LLM code, if you agree to be sued for it later instead of us”? I’m not a lawyer and this isn’t legal advice, but sometimes I feel like that’s what the Linux Foundation policy says.


If you would have written it yourself the same way, why not write it yourself? (And there was autocomplete before the age of LLMs, anyway.)
The big problems start with situations where it doesn’t match what you would have written, but rather what somebody else has written, character by character.


It’s less extremist if you look at how easily these LLMs will just plagiarize 1:1, apparently:
https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567
Some see “AI slop” as “identified by the immediate problems of it that I can identify right away”.
Many others see “AI slop” as bringing many more problems beyond the immediate ones. Then seeing LLM output as anything but slop becomes difficult.


Whatever it is, it doesn’t mean LLMs are a sane or “inevitable” answer.


Ultimately, the policy legally anchors every single line of AI-generated code
How would that even be possible? Given the state of things:
https://dl.acm.org/doi/10.1145/3543507.3583199
Our results suggest that […] three types of plagiarism widely exist in LMs beyond memorization, […] Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, […] Plagiarized content can also contain individuals’ personal and sensitive information.
https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/
Four popular large language models—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—have stored large portions of some of the books they’ve been trained on, and can reproduce long excerpts from those books. […] This phenomenon has been called “memorization,” and AI companies have long denied that it happens on a large scale. […]The Stanford study proves that there are such copies in AI models, and it is just the latest of several studies to do so.
The court confirmed that training large language models will generally fall within the scope of application of the text and data mining barriers, […] the court found that the reproduction of the disputed song lyrics in the models does not constitute text and data mining, as text and data mining aims at the evaluation of information such as abstract syntactic regulations, common terms and semantic relationships, whereas the memorisation of the song lyrics at issue exceeds such an evaluation and is therefore not mere text and data mining
https://www.sciencedirect.com/science/article/pii/S2949719123000213#b7
In this work we explored the relationship between discourse quality and memorization for LLMs. We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate.
https://arxiv.org/abs/2601.02671
recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures […]. We investigate this question […] our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
How does merely tagging the apparently stolen content make it less problematic, given I’m guessing it still won’t have any attribution of the actual source (which for all we know, might often even be GPL incompatible)?
But I’m not a lawyer, so I guess what do I know. But even from a non-legal angle, what is this road the Linux Foundation seems to embrace of just ignoring the license of projects? Why even have the kernel be GPL then, rather than CC0?
I don’t get it. And the article calling this “pragmatism” seems absurd to me.


I don’t think we’ve been reading the same link. In any case, I don’t think this conversation is going in a useful direction, so I’ll part ways here.


Point taken and I appreciate the correction, but it still seems to include e.g. all URLs which could leak all your search queries and other rather invasive conclusions. If anything, this makes me feel like it confirms Mozilla does sell data it shouldn’t. I’m not trying to impose my personal conclusions on others, however.


The terms of use link goes nowhere, so I honestly don’t know.
But I feel like it doesn’t matter whether it is, for the sake of discussing where gen AI seems to be leading FOSS…
Lemmy currently seems to consider embracing it too, sadly. Feels potentially short-sighted to me, idk.


Yeah, I find it scary how many projects embrace gen AI despite all the training data controversies. I’ve tried to convince some not to, but it’s hard. Even the Linux kernel appears to be using it now. It’s sad.


I suppose, but then the “training data plagiarism” moral question remains. Rate seems to be like 2-5% for what they can pin down: https://dl.acm.org/doi/10.1145/3543507.3583199 I’m guessing that means the hidden actual rate might be higher… and there are the high profile incidents.


I assumed it was real because people seem to be doing this for real: https://www.theregister.com/2026/03/06/ai_kills_software_licensing/
And outside of that supposed “clean room” AI trend, all the gen AI coders seem to be ignoring that apparently AI is plagiarizing training data too. And it seems to happen randomly and unpredictably even for who you would think are expert users, Microslop themselves: https://www.pcgamer.com/software/ai/microsoft-uses-plagiarized-ai-slop-flowchart-to-explain-how-github-works-removes-it-after-original-creator-calls-it-out-careless-blatantly-amateuristic-and-lacking-any-ambition-to-put-it-gently/


Seems like it might be satire after all: https://malus.sh/blog.html Will update my post in a second. But the trend seems to be real, with others discussing the effects on the GPL too: https://writings.hongminhee.org/2026/03/legal-vs-legitimate/


Yeah, or the leaked Windows XP source code. Every day any gen AI code use feels more to me like license laundering, if anything then of the training data that was involved. And I mean this purely as a gut feeling, I have no idea what a court would say. But it feels wrong.
I don’t have much more to say other than I doubt the data backs up what you’re saying at all.