Follow

The GitHub CoPilot system generates sophisticated enough code that we would consider it copyrightable if a human wrote it.

But the case law around this issue is relatively fresh, afaict. Does Microsoft or GitHub believe they own any copyright on the outputted work?

This is a serious question. I would want to know the answer before I had any code submitted to any projects I ran. I only work on FOSS projects, but if I ran a proprietary software development company, I'd want to know then, too.

@cwebber there's no way my current employer, nor any of my previous employers, would allow editor contexts and retraining information about our internal codebases to be sent to some corporate cloud service for analysis.

@cwebber seems like they really ought to prominently specify the licensing when you set up copilot tbh

especially considering it's github

@cwebber oh actually they do! it's in the "protecting originality" section of the faq

Who owns the code GitHub Copilot helps me write?

GitHub Copilot is a tool, like a compiler or a pen. The suggestions GitHub Copilot generates, and the code you write with its help, belong to you, and you are responsible for it. We recommend that you carefully test, review, and vet the code, as you would with any code you write yourself.

Do I need to credit GitHub Copilot for helping me write code?

No, the code you create with GitHub Copilot’s help belongs to you. While every friendly robot likes the occasional word of thanks, you are in no way obligated to credit GitHub Copilot. Just like with a compiler, the output of your use of GitHub Copilot belongs to you.

i guess that answers that?

@cwebber there's no section ids or anything on the page, so i can't link directly to the relevant part :blobcattilt:​ but if you go to copilot.github.com/ and search "protecting originality" you'll land in the right spot

@00dani So Microsoft/GitHub believe that then. I wonder if everyone else running ML systems will agree.

That's a disclaimer of sorts I suppose, but a legal waiver would be even better.

@cwebber @00dani

It's rooted in fair use. It's also code where a human writes the prompt. It would make as much sense to claim ownership as a musical instrument maker claiming ownership to the songs you make with it.

@zzz @cwebber @00dani That does hit upon the surprising and confusing case law around songs and sampling, though. What happens when someone finds sufficiently large chunks of what look like their own open source code in another project with incompatible attribution or licensing?

@jaycie @cwebber @00dani

That's where fair use kicks in. It's not a coincidence that Github spends a significant amount of time showing that the chance their algorithm will copy your material wholesale is low.

If under the hood, it really just searched for open-source code and copied it that would be clear copyright infringement. I still think that case is not rare enough to brush aside, and they ought to bundle a tool to check.

@shlee @00dani we could skip debating whether or not robots deserve to hold copyrights for their works by abolishing copyright

@cwebber free software licenses depend on copyright

how should that work in practice? every so called intellectual property is public domain?

can we still have copyleft somehow?

microsoft would still not give us their code but use ours without contributing back. so without strong copyleft licenses we seem to be worse off

@davidak Copyright didn't exist on software until 1980. Before then, everything was public domain, yes. That's also when the plague of proprietary software became serious.

If you could go back in time and choose between a timeline where intellectual restrictions laws either did or didn't apply to software altogether, which reality would you choose? Which would be better for user freedom?

@davidak In such a timeline, it would also not be illegal to reverse engineer blobs.

The only thing you'd "lose" is the ability to use copyright to do the copyleft hack of requiring source distribution.

But that almost never happens these days anyway; the cases where copyleft is enforced is rare. And what of things like genetic programming, where there is no distinction between the "blob" and source?

Copyleft is a useful tool to *counteract* copyright in the world we *do* live in.

@cwebber i don't want to defend copyright, but what would we gain in practice?

we would loose copyleft, so google and amazon etc. will use more public code. closed software profits

we could use reverse engineering, but is that such big advantage? it is legal in some cases

i'm really not sure about this topic, that's why i ask.

@davidak @cwebber

FYI lotsa license discussion on HN too: news.ycombinator.com/item?id=2

(not too much mention about copyleft, occurs twice on 2nd page)

@davidak @cwebber

This HN mention is notable towardsdatascience.com/the-mos

It relates to copyright infringement vs. Google Books where US courts ruled "fair use" was applicable to ML datasets.

Part where Nat Friedman, CEO of Github, announces is interesting: news.ycombinator.com/item?id=2

Friedman: "In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler. [..] We expect that IP and AI will be an interesting policy discussion [..]"

@davidak @cwebber

Reading between the lines..

"An army of our lawyers has thoroughly prepared for the many fights that CoPilot will trigger, and we are ready for them."

@davidak @cwebber

cc @dachary might be some nasty new network effects and FOMO to overcome by #FedeProxy to get people to leave GH if this takes off (and it looks like it will).

@davidak @cwebber @dachary

OTOH maybe mostly in the corporate and OSS side of things, not FOSS depending on pricing and the licensing issues discussed in this thread.

@humanetech I would not take this statement seriously because (i) it is not binding and GitHub/Microsoft can decide on a very different policy 24h later (ii) it is vague.

@davidak Copyright is a system for intellectual restrictions/monopolies based on enforcement through state violence.

So one thing gained by abolishing it (which I think is unlikely sadly, but this is a mental exercise to better consider what it is we value) is being free from that. What is also gained is increased agency/autonomy and the ability to collaborate. Copyright, with very few exceptions, is the crushing opposite of that.

@cwebber @davidak Coming from an end-user empowering perspective, I'd like to have chosen a third way of ensuring transparency for users on "how the sausage's been made". Such a value or right does not need to be protected by a violent state per se, but could also be engraved into the principles of the communities we're (inter)acting in/with.

As long as we're not in some kind of rightwing-libertarian ancap dystopia, people are still acting as collectives or groups with interdependencies.

@cwebber @davidak I mean, it's as much of state violence as any other thing that is related to law enforcement though, right?

@cwebber @davidak abolishing copyright in a world of desktop software would make not publishing source code pointless, since the end product could be freely copied.

In a world of server-side software though, it would mean no AGPL and no source code, so I can see that side of te argument as well.

(Not trying to take sides here, just random thoughts)

@davidak @cwebber There would be no copyrights, but there would still be patents, trade secrets and trademarks.

@tzafrir @davidak @cwebber Trade secrets (only software on servers/private machines to avoid easy RE) and trademarks (marketing) aren't particularly important to software. Software patents were a costly mistake, though. Small to medium firms/projects fight off robber baron trolls while huge multinationals skirt the issue with M.A.D. style patent proliferation.

@davidak @cwebber I like to think about mathematics when trying to imagine a future where software cannot be owned by anyone. Software is more like maths and less like a novel. As soon as a piece of software is published there would be no law preventing reverse engineering, distribution or modification. 😍

@alienghic @cwebber @davidak Nope. It's because software was having a life of its own. Originally software was just something you get with the hardware. Part of the package. But then some folks at Bell Labs write a different OS for PDP-11. And this Harvard student was selling extra software for the Altair. Five years later the US Congress decided to declare software copyrightable because there was business there.

@dpwiz @cwebber @davidak

Krugman is a moderate economist who supports the idea that workers need to be paid in order to have a functioning economy, which puts him on the economic left for the NYT.

And the observation that about 1980 is where things started going bad for the common person is supported by other metrics too.

@cwebber @davidak the period coincides with the rise of disaster capitalism

@cwebber @davidak I have a hard time with this because although I don't like copyright very much, I imagine that in the present day in the no-copyright universe we would all be sitting in front of dumb terminals paying an expensive subscription to use software hosted remotely by IBM.
Probably Free Software would still exist for the same reasons it does now but I fear something less powerful than the Raspberry Pi would be the height of technology for general purpose computers in the home.

@cwebber @davidak That is to say, IMHO abolishing copyright would *have* to come with other legal and social changes to yield any benefit. Assuming no other legal and social changes, I think I prefer the world I live in now where copyright exists but is usually lightly-enforced can be mostly-safely ignored by individuals.
Of course political groups who favour abolishing copyright usually have some pretty good ideas about what other changes would need to come along with that.

@danielcassidy @cwebber @davidak IBM entered the PC market with the mindset that the hardware was the thing of value -- software was an afterthought that they outsourced. Microsoft really "pioneered" the proprietary software model, and that was well into the PC era. So I doubt it would be all dumb terminals, except perhaps to the extent that a browser is the modern equivalent of a dumb terminal.

@cwebber @davidak Point being, without copyright Microsoft wouldn't have gotten a foot in the door, because selling software off-the-shelf would be a non-starter. But nevertheless at some point someone would have figured out that the real value is in software, and without copyright, they'd probably have moved to software-as-a-subscription-service much earlier than happened in our timeline. I was being facetious when I suggested that that someone would be IBM but also it wouldn't surprise me.

@cwebber @davidak There would have been less software for home computers in the 80s because nobody would have any idea how to make money from making software for the home, consequently only really devoted enthusiasts would ever buy home computers, consequently home computers would improve much more slowly, there'd be fewer games to drive development of faster and more capable hardware, and GPUs might never have shown up in the home at all.

@cwebber @davidak To be clear I don't think the above is the *only* possible alternative, I just think if you went back to 1980 and abolished copyright and made no other changes, the above would be the likely result.

@cwebber @davidak (Also, not important, but I would like to point out that Microsoft and others were selling proprietary software in the late 70s, at the very birth of home computers.)

@cwebber @davidak I would want the second, the law that software must always be accompanied by corresponding source.

@cwebber an issue i see is that the "AI" draws from existing codebases, almost all of which are going to have their own licensing, and i think a major concern around copyright is that this thing is just a license violator as a service. it'll give you code that _could_ be nearly identical to code in some existing GPL repo from its training set but you have no idea what the repo is, and definitely won't be following the terms of its license
and there's another question around using people's code to train this model without their consent....

@haskal @cwebber hmm, they claim license violation as a service isn't a major problem? but i don't trust them obviously

Does GitHub Copilot recite code from the training set?

GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions.

@cwebber they trained on a ton of GPL code, by their own admission. I bet a lawyer could argue that it's a derivative work

@noiob @cwebber strange. Microsoft is typically allergic to Random GPL Code in my experience

@nfd @cwebber "Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License." docs.github.com/en/github/copi

@noiob @cwebber how spectacularly predictable of a result, given that they already did that, lol

yeah i'm very interested in how crawling code without a fat MIT/Apache/CC0/whatever tag got past low/middle management

@cwebber I predict they'll be able to get the violation rate down to well below that of the average developer adapting examples from stack overflow.

Which will make this a successful machine for laundering the GPL right out of code. With rent collection built right into it because it's going to be a paid service.

Sign in to participate in the conversation
Octodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!