The GitHub CoPilot system generates sophisticated enough code that we would consider it copyrightable if a human wrote it.
But the case law around this issue is relatively fresh, afaict. Does Microsoft or GitHub believe they own any copyright on the outputted work?
@cwebber there's no way my current employer, nor any of my previous employers, would allow editor contexts and retraining information about our internal codebases to be sent to some corporate cloud service for analysis.
@cwebber seems like they really ought to prominently specify the licensing when you set up copilot tbh
especially considering it's github
@cwebber oh actually they do! it's in the "protecting originality" section of the faq
Who owns the code GitHub Copilot helps me write?
GitHub Copilot is a tool, like a compiler or a pen. The suggestions GitHub Copilot generates, and the code you write with its help, belong to you, and you are responsible for it. We recommend that you carefully test, review, and vet the code, as you would with any code you write yourself.
Do I need to credit GitHub Copilot for helping me write code?
No, the code you create with GitHub Copilot’s help belongs to you. While every friendly robot likes the occasional word of thanks, you are in no way obligated to credit GitHub Copilot. Just like with a compiler, the output of your use of GitHub Copilot belongs to you.
i guess that answers that?
@00dani So Microsoft/GitHub believe that then. I wonder if everyone else running ML systems will agree.
That's a disclaimer of sorts I suppose, but a legal waiver would be even better.
That's where fair use kicks in. It's not a coincidence that Github spends a significant amount of time showing that the chance their algorithm will copy your material wholesale is low.
If under the hood, it really just searched for open-source code and copied it that would be clear copyright infringement. I still think that case is not rare enough to brush aside, and they ought to bundle a tool to check.
@cwebber free software licenses depend on copyright
how should that work in practice? every so called intellectual property is public domain?
can we still have copyleft somehow?
microsoft would still not give us their code but use ours without contributing back. so without strong copyleft licenses we seem to be worse off
@davidak Copyright didn't exist on software until 1980. Before then, everything was public domain, yes. That's also when the plague of proprietary software became serious.
If you could go back in time and choose between a timeline where intellectual restrictions laws either did or didn't apply to software altogether, which reality would you choose? Which would be better for user freedom?
@davidak In such a timeline, it would also not be illegal to reverse engineer blobs.
The only thing you'd "lose" is the ability to use copyright to do the copyleft hack of requiring source distribution.
But that almost never happens these days anyway; the cases where copyleft is enforced is rare. And what of things like genetic programming, where there is no distinction between the "blob" and source?
Copyleft is a useful tool to *counteract* copyright in the world we *do* live in.
@cwebber i don't want to defend copyright, but what would we gain in practice?
we would loose copyleft, so google and amazon etc. will use more public code. closed software profits
we could use reverse engineering, but is that such big advantage? it is legal in some cases
i'm really not sure about this topic, that's why i ask.
It relates to copyright infringement vs. Google Books where US courts ruled "fair use" was applicable to ML datasets.
Part where Nat Friedman, CEO of Github, announces is interesting: https://news.ycombinator.com/item?id=27676939
Friedman: "In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler. [..] We expect that IP and AI will be an interesting policy discussion [..]"
@humanetech I would not take this statement seriously because (i) it is not binding and GitHub/Microsoft can decide on a very different policy 24h later (ii) it is vague.
@davidak Copyright is a system for intellectual restrictions/monopolies based on enforcement through state violence.
So one thing gained by abolishing it (which I think is unlikely sadly, but this is a mental exercise to better consider what it is we value) is being free from that. What is also gained is increased agency/autonomy and the ability to collaborate. Copyright, with very few exceptions, is the crushing opposite of that.
@cwebber @davidak Coming from an end-user empowering perspective, I'd like to have chosen a third way of ensuring transparency for users on "how the sausage's been made". Such a value or right does not need to be protected by a violent state per se, but could also be engraved into the principles of the communities we're (inter)acting in/with.
As long as we're not in some kind of rightwing-libertarian ancap dystopia, people are still acting as collectives or groups with interdependencies.
In a world of server-side software though, it would mean no AGPL and no source code, so I can see that side of te argument as well.
(Not trying to take sides here, just random thoughts)
@tzafrir @davidak @cwebber Trade secrets (only software on servers/private machines to avoid easy RE) and trademarks (marketing) aren't particularly important to software. Software patents were a costly mistake, though. Small to medium firms/projects fight off robber baron trolls while huge multinationals skirt the issue with M.A.D. style patent proliferation.
@davidak @cwebber I like to think about mathematics when trying to imagine a future where software cannot be owned by anyone. Software is more like maths and less like a novel. As soon as a piece of software is published there would be no law preventing reverse engineering, distribution or modification. 😍
Wait that was 1980 too?
Because that's when a bunch of things started going rotten here in America....
@alienghic @cwebber @davidak Nope. It's because software was having a life of its own. Originally software was just something you get with the hardware. Part of the package. But then some folks at Bell Labs write a different OS for PDP-11. And this Harvard student was selling extra software for the Altair. Five years later the US Congress decided to declare software copyrightable because there was business there.
Krugman is a moderate economist who supports the idea that workers need to be paid in order to have a functioning economy, which puts him on the economic left for the NYT.
And the observation that about 1980 is where things started going bad for the common person is supported by other metrics too.
@cwebber @davidak I have a hard time with this because although I don't like copyright very much, I imagine that in the present day in the no-copyright universe we would all be sitting in front of dumb terminals paying an expensive subscription to use software hosted remotely by IBM.
Probably Free Software would still exist for the same reasons it does now but I fear something less powerful than the Raspberry Pi would be the height of technology for general purpose computers in the home.
@cwebber @davidak That is to say, IMHO abolishing copyright would *have* to come with other legal and social changes to yield any benefit. Assuming no other legal and social changes, I think I prefer the world I live in now where copyright exists but is usually lightly-enforced can be mostly-safely ignored by individuals.
Of course political groups who favour abolishing copyright usually have some pretty good ideas about what other changes would need to come along with that.
@danielcassidy @cwebber @davidak IBM entered the PC market with the mindset that the hardware was the thing of value -- software was an afterthought that they outsourced. Microsoft really "pioneered" the proprietary software model, and that was well into the PC era. So I doubt it would be all dumb terminals, except perhaps to the extent that a browser is the modern equivalent of a dumb terminal.
@cwebber @davidak Point being, without copyright Microsoft wouldn't have gotten a foot in the door, because selling software off-the-shelf would be a non-starter. But nevertheless at some point someone would have figured out that the real value is in software, and without copyright, they'd probably have moved to software-as-a-subscription-service much earlier than happened in our timeline. I was being facetious when I suggested that that someone would be IBM but also it wouldn't surprise me.
@cwebber @davidak There would have been less software for home computers in the 80s because nobody would have any idea how to make money from making software for the home, consequently only really devoted enthusiasts would ever buy home computers, consequently home computers would improve much more slowly, there'd be fewer games to drive development of faster and more capable hardware, and GPUs might never have shown up in the home at all.
@cwebber an issue i see is that the "AI" draws from existing codebases, almost all of which are going to have their own licensing, and i think a major concern around copyright is that this thing is just a license violator as a service. it'll give you code that _could_ be nearly identical to code in some existing GPL repo from its training set but you have no idea what the repo is, and definitely won't be following the terms of its license
and there's another question around using people's code to train this model without their consent....
Does GitHub Copilot recite code from the training set?
GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions.
@cwebber they trained on a ton of GPL code, by their own admission. I bet a lawyer could argue that it's a derivative work
@nfd @cwebber "Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License." https://docs.github.com/en/github/copilot/research-recitation
@cwebber I predict they'll be able to get the violation rate down to well below that of the average developer adapting examples from stack overflow.
Which will make this a successful machine for laundering the GPL right out of code. With rent collection built right into it because it's going to be a paid service.
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!