You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd just like you to know that code with permissive licensing with attribution requirements are possibly unsuitable for training set inclusion. I'm bringing this to your attention not as a lawyer, but as a maintainer. Ask your own council. However, attribution requirements usually means derivatives must retain attribution of the original author. LLMs are apparently well-known to occasionally spit out exact derivatives, but without satisfying attribution requirements, which suggests this practice could be illegal.
Hi @ell1e, thanks for pointing this out. We worked on a set of tools that should allow users to properly attribute sources if the model generated verbatim copies from the training dataset:
I assume this would need to be integrated right with the usual query mechanism for people to actually regularly use it. Is that currently the case with the current auto complete plugins or wherever this is commonly used?
Also, to my knowledge and I'm not at all a lawyer, but I thought copyright law doesn't just apply to verbatim copies but any notable derivatives, as long as it still is somewhat "clearly" related in the eyes of a normal human, and/or is more vaguely a derivative but still could be considered a substitute, or something like that. (Don't ask me how the exact rules work, but I don't think it's just verbatim copies.) How do you deal with that?
As long as there aren't any good answers to that being all dealt with out of the box in a somewhat reliable way for most actual users of starcoder, I suggest you keep honoring opt-out request more aggressively.
I'd just like you to know that code with permissive licensing with attribution requirements are possibly unsuitable for training set inclusion. I'm bringing this to your attention not as a lawyer, but as a maintainer. Ask your own council. However, attribution requirements usually means derivatives must retain attribution of the original author. LLMs are apparently well-known to occasionally spit out exact derivatives, but without satisfying attribution requirements, which suggests this practice could be illegal.
I therefore request you at the very least process opt-out requests in retrospect for pre-existing data sets to fix this. However, just to stress this again, I'm not a lawyer and this isn't legal advice. But at least from the outside, this looks troubling.
For example, it appears you included repositories of mine that have attribution requirements:
I don't understand how StarCoder would possibly satisfy them.
The text was updated successfully, but these errors were encountered: