When AI Writes Your Code, Who Really Wrote It?

Reading Time: 6 minutes.

GitHub (owned by Microsoft) and OpenAI launched a new tool last year, which is capable of generating code based on existing committed code. Software developers use tools like GitHub, GitLab, or BitBucket to manage programming projects. If you’re a software engineer, you’re likely familiar with one, if not all three of the examples I listed. GitHub is perhaps the most popular of the three. Or, it was. After Microsoft bought the company, sentiment among a number of programmers shifted. People feared that Microsoft, known for their AI, could use the code you write to sell something.

With “Copilot,” an autocomplete tool from GitHub and OpenAI, that’s exactly what happened.

Copilot works from within Microsoft’s Visual Studio to help software developers complete their code as they write it. Almost like how autocomplete can help you type out words more quickly on a smartphone. However, it’s not without its issues. Even when released, Copilot’s website admitted that the software could “sometimes produce undesired outputs, including biased, discriminatory, abusive, or offensive outputs,” emphasis added. Copilot could also suggest what will seem like private information submitted to GitHub, which, by the way, is just another reason for why you should never hard code or commit email addresses, keys, or other secure or private strings.

Critics already took issue with the potential invasion of privacy and bias that AI introduces everywhere it goes. However, Software Freedom Conservancy and other open source advocates have spoken out about another obvious problem: the AI is “borrowing” code without permission to train their models. That potentially breaks some licenses.

Stealing Code?

Software Freedom Conservancy (SFC) is a non-profit group of open source software advocates. Open source software is software that shares its source code, so others can review it or, under certain licenses, re-use it in their own projects. Open source software is often at the heart of many of the technologies you use and love. SFC announced they’ve withdrawn from GitHub, encouraging other software engineers who care about open source technology to do the same. According to them, Microsoft trained their Copilot on open source code without the users permission, potentially ignoring licensing. As the code uses snippets created from open source software, without following the demands of open source licensing, some consider this unlicensed use of someone else’s code. AKA: stealing code.

At its core, this looks like stealing code, but it’s also a murky area. If you were writing a research paper, you certainly would look through multiple sources to compile information. However, you name these sources, and any direct quotes are also directly attributed. If sampling an artists’ song for a song of your own, you give them a credit, maybe even pay for it. With code, there are similar copyrights in place. The license is often included at the top of every open source code file. This usually says people are allowed to view, use, and edit code, as long as it’s attributed to the author and no changes are made to the license. However, in this case, Microsoft’s Copilot for GitHub doesn’t do any of that. It chops up code, figures out generally how it’s used by looking for the same patterns across other repositories, and creates output that is code. However, that output came, possibly incredibly closely or potentially even verbatim, from licensed code. Those licenses aren’t being shared and no attribution is being given.

Microsoft compares their Copilot tool to a compiler, but that seems dishonest. A compiler produces machine code from the code we write. It’s a 1:1 system, and we do it to our own code. Copilot is more like someone copying over code after reading it in multiple places and deciding to drop any attribution because they saw something similar elsewhere. Even if it’s used elsewhere, it’s still licensed code. Furthermore, and ironically, because Copilot isn’t open source, we can’t be sure exactly how it’s using these code samples. This is a murky legal area, but some say it’s clear: this is theft.

Bad Code?

If you ask two people to describe the same view of a park, they may use similar words or metaphors to describe the scene before them, but they will both have unique descriptions. Perhaps one person will have better distance vision, and can describe the ducks sitting across the lake. The other person may not have seasonal allergies and can describe the fragrant flowers nearby. You get two very different perspectives that paint very similar pictures.

Coding isn’t an exact science. Two people can write code that performs the same task, perhaps even with the same efficiency, but in very different ways. That’s not just the languages they’re using either. It could be their style, their choice of libraries, whether or not they use recursion or other techniques. The truth is, coding has been accurately described as a bit of an art. Relying on an AI to create the best solution isn’t always going to produce the best solution. It’ll just produce the most popular or perhaps the most average take. It’s not going to be as good as it could be. In fact, even GitHub admits it’s not perfect.

“Can GitHub Copilot introduce insecure code in its suggestions?”

“Public code may contain insecure coding patterns, bugs, or references to outdated APIs or idioms.”

– from GitHub’s FAQ on Copilot

That last little note should worry engineers. Junior engineers may not fully understand the code they’re implementing, and even more experienced engineers may trust the code a little too much because, “it’s from GitHub!” But the truth is, the code requires as much—if not more—scrutiny than code written by a human. In the long run, is it worth it? If it’s mostly replacing boilerplate code, and you have to scrutinize it really well, does it actually save an engineer time?

GiveUpGitHub.org

SFC had questions for GitHub when the feature was first introduced last year. Paraphrased (full versions here), they asked where Microsoft got the idea that all open source software is fair use, regardless of license. They followed up by asking that, if licensing doesn’t matter for AI-based code generation, why not use Microsoft’s own code, like that in Windows or Microsoft Office? Finally, they asked for a list of repositories used, including their licenses. These are all reasonable asks, and could precede legal actions if license holders truly believe Microsoft violated their copyright.

When SFC asked GitHub these questions last year, the company told them answers would be coming. However, after a year of pressuring GitHub for answers, they finally received a response that they wouldn’t get any. GitHub was refusing to answer questions. This is what spurred SFC to suggest all open source projects leave Microsoft’s GitHub. Since GitHub cannot guarantee that Copilot will recognize licenses or privacy—and refuses to elaborate—it would be foolish to trust them with your code if you care about how it’s used. Copilot relies on the very open source software that some feel GitHub is exploiting. Without open source code to pull suggestions from, the feature will die out. That’s what GiveUpGitHub.org is about. It helps engineers move their code off of Microsoft’s platform and onto one where they can protect the integrity of their code better.

Copilot also requires a more liberal reading of open source software licenses. It might violate a number of licenses, though it hasn’t been tested in court. Still, if engineers wanted to, they could use a new license that specifically targets code-generation based off of datasets including their code. However, because Microsoft hasn’t shared the details of their process, it could be difficult to prove GitHub used their code without Microsoft’s cooperation.

AI Introducing Old Problems in New Ways

GitHub’s Copilot was in a closed beta last year. In June of 2022, GitHub went live with it. There are still questions remaining. Did Microsoft and OpenAI sort out the issues they had with what appears to be—but likely isn’t—personal data in code snippets? Did they remove abusive, offensive, or discriminatory language? Are there any protections for licensing? Will they give proper credit to the open source projects they’re pulling code from? These are questions that companies, especially for-profit companies, will want to know. If a court were to decide that Microsoft’s practices here are plagiarism, the companies using code from GitHub’s Copilot could be in trouble as well.

Copilot has been described as “an AI pair programmer” by GitHub. Pair programming is “two engineers, one keyboard.” It’s literally having someone looking over your shoulder as you work. Sometimes it’s helpful, like during brainstorming for tricky bugs. But, honestly? Since I was introduced to the concept, I’ve always hated it. Doing it with AI doesn’t make it any better. In fact, since I can trust that a coworker wouldn’t plagiarize something right in front of me, it seems it may make pair programming worse. I can’t be sure what GitHub’s Copilot is doing won’t be considered plagiarism one day, but I won’t be sure where it is or how to replace it.

GitHub has been embroiled in controversy ever since Microsoft took over. The acquisition itself was controversial. There have been harassment allegations, employee demands after GitHub had a contract with U.S. Immigration and Customs Enforcement (ICE), and fired an employee for calling the traitors who stormed the Capitol in 2021, “Nazis.” Some of which were wearing Nazi slogans and iconography. Now they’ve upset the open source community.

Most of my code in the past few years has been on GitLab. I wonder why.

Sources/Further Reading:

Thomas Dohmke, GitHub
Dave Gershgorn, The Verge
Denver Ginrich and Bradley M. Kuhn, Software Freedom Conservancy
Tristan Greene, The Next Web