Informing the Innovation Policy Debate: Key Concepts in Copyright Laws for Generative AI
By Julia Yoon and Chris Borges
Since OpenAI released ChatGPT to the public in November 2022, the use of and investment in generative artificial intelligence (AI) has exploded. Companies have released dozens of new models that can create text, images, sound, and even video, while investments in generative AI quintupled from $4.3 billion in 2022 to $21.8 billion in 2023. Generative AI companies are generating billions in revenue, while individuals and organizations are employing these novel tools to create millions of new works daily. Accordingly, policymakers, technology and media experts, and industry leaders are exploring how this powerful new technology will impact business and society.
Many observers expect generative AI to disrupt creative industries (including film and publishing) putting the rules framing copyright—one of the primary legal protections for creative works—under intense scrutiny. Indeed, there are already several high-profile lawsuits related to generative AI and copyright underway in the United States. Given the nascent state of generative AI technology along with the substantial size of the creative economy—estimated at $2 trillion in annual revenue and employing over 50 million people worldwide—the outcome of these lawsuits will have major consequences for innovation, economics, and the future of the AI industry.
The scope and impact of these developments call for greater clarity in burgeoning policy discussions with regard to basic definitions, terms of art, and key issues. What follows is a glossary of terms and concepts to inform the debate.
Definitions
“Generative AI” refers to an AI model capable of producing new content such as text, images, and videos. Generative AI models utilize massive sets of training data to learn basic speech, sound, and visual patterns, which they then reproduce based on user prompts.
Intellectual property (IP) refers to creations of the mind, ranging from literary and artistic works to designs, symbols, and images. IP protections can be divided into four broad types—patents, copyrights, trademarks, and trade secrets—depending on the creation. The purpose of IP is to assign lawful property rights to inventors and artists, enabling them to earn recognition and financial benefit for their creations, thereby incentivizing further innovation and creation. The United States operates two separate offices for the registration of IP rights: The U.S. Patent and Trademark Office (USPTO) which oversees the registration of patents and trademarks and the U.S. Copyright Office which registers copyrights.
Copyright is a type of IP that specifically protects works that are tangible forms of expression. Unlike patents that protect inventions and new processes, copyrights are directly tied to a tangible form of artistic, literary, or intellectually created work. For example, any original and fixed work of paintings, photographs, musical compositions, books, movies, and computer programs can be protected by copyright. In contrast to patents which typically last for 20 years, copyright protections can last for many decades. Works created after 1978 are under copyright protection for up to 70 years after the death of the author, while pseudonymous works are protected for as long as 120 years.
Output vs. Input
Issues related to generative AI and copyright can be divided into two distinct categories: a) copyright protection of the data used during the training process and b) copyright protection of works created with generative AI. In other terms, the issues center on who owns the rights to the inputs and who owns the rights to the outputs of the generative AI model.
Inputs
Creating a sophisticated generative AI model that responds to user prompts requires massive amounts of data on which to train the computer. For instance, the popular model ChatGPT-3.5, a large language model (LLM) that produces text outputs, is trained with over 570GB of filtered text data. This is equivalent to roughly 300 billion words or 1.3 million books—more than three times the amount of text contained in the Library of Congress.
To acquire this data, LLMs typically scrape text from a variety of online sources including books and articles, some of which may be protected by copyright. Indeed, per OpenAI, ChatGPT’s training process involves downloading and copying publicly accessible data, which the company acknowledges “include copyrighted works.” Accordingly, some argue that using this data for training without the permission of the copyright holder is copyright infringement.
In 2023 alone, at least 13 copyright-related lawsuits were filed against generative AI companies. Most of these cases are class-action suits claiming AI developers unlawfully utilized copyrighted material without permission or appropriate compensation. Plaintiffs include popular authors like John Grisham and A Song of Ice and Fire author George R.R. Martin, famous comedians like Sarah Silverman, and prominent media outlets like The New York Times. In The New York Times lawsuit, the media outlet alleges that ChatGPT reproduced portions of their news articles verbatim. As generative AI becomes more popular in the years ahead, the number of AI-related copyright lawsuits is expected to increase.
The defendant AI companies, such as OpenAI, Meta, and Microsoft, have largely responded with assertions that generative AI models create new opportunities for the journalism industry and are therefore a net benefit to society. Further, they affirm that the use of data for training is lawful and falls under fair use, which is supported by a wide range of accepted academic and international precedents.
Fair Use
Fair use is a legal right listed under Section 107 of the Copyright Act allowing the reproduction of copyrighted materials without permission under certain circumstances. As creative works often contain information that serves the public interest, fair use is designed to support the free flow of information and public discourse. For instance, if the work is used for purposes like criticism, comment, news reporting, teaching, or research, it does not infringe on copyright. There are four criteria that define fair use.
The first criterion relates to “the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.” That is, how the copyrighted material is being used is an essential component to determining if it is fair use. For instance, a teacher showing a documentary to their students is likely fair use, whereas a theater owner selling tickets to view that same documentary without permission likely would not be considered fair use.
The first criterion also covers “transformative” uses of copyrighted material, which are uses that “add something new, with a further purpose or different character, and do not substitute for the original use of the work.” OpenAI asserts that their AI work is transformative as they are creating a useful AI tool and is therefore covered by fair use.
One relevant case highlighting how transformative use may be applied to AI is Authors Guild, Inc. v. Google, Inc. In this case, Google scanned physical books to create a digital database search program, enabling users to type in keywords and locate the books containing them. Google argued that using scanned books in such a manner is fair use as only a small portion of books are shown to searchable library users, though over 8,000 authors disagreed and filed a lawsuit against Google for copyright infringement. Ultimately, the court decided in favor of Google. It determined that the search program is a “transformative use” of copyrighted material, and that the search engine enhances public knowledge by offering information about the books without revealing a significant portion of copyrighted texts.
The second criterion relates to the “nature of the copyrighted work,” referring to if it is more creative or factual. Facts are not copyrightable, so a non-fiction work such as meeting minutes is more likely to support a fair use claim than a creative work such as a fiction novel.
The third criterion relates to “the amount and substantiality of the portion used in relation to the copyrighted work as a whole.” For instance, re-using a sentence from a book is more likely to be fair use than re-using an entire chapter. While the entire work is used in AI training, OpenAI asserts that as copyrighted works are used for training and are not accessible to the public, it is covered by the third criterion. Further, OpenAI asserts that the “regurgitation” alleged by The New York Times, an output from the AI model that reproduces a work used for training verbatim, is a rare bug.
The fourth criterion relates to the economic consequences of the infringement, specifically “the effect of the use upon the potential market for or value of the copyrighted work.” For instance, making copyrighted material publicly available may undermine the value of the original work by reducing incentives to purchase it, and would likely not be fair use. Further, using copyrighted material to directly compete against the original work is also likely to not be considered fair use. In the case of generative AI, creative workers argue that the outputs of the tools could compete in the same market as them, thereby using their own work to eliminate their economic opportunities.
Outputs
Generative AI is employed to create new works of text, images, videos, and other media. When created by a human, these works are typically afforded copyright protection. Therefore, there is much debate over whether the output of generative AI is copyrightable, and, if so, who owns the copyright.
The U.S. Copyright Office explicitly states that it will only register original works “created by a human being,” meaning that human authorship is required for copyright protection. Computer scientist Stephen Thaler tested this requirement for AI in 2018, when he attempted to register a copyright for a work he claimed was “autonomously created” by an AI model he developed. The Copyright Office denied his copyright claim, and he lost his subsequent lawsuit.
One way to interpret this outcome is that as AI models cannot own a copyright, then the output of a generative AI model is not copyrightable. However, some observers instead argue that generative AI is simply an advanced and sophisticated tool, like a camera. As photographers receive credit for works produced with their cameras, some argue that creating work with AI is a collaborative process where AI serves as a tool to express the author’s creativity. Therefore, the output is protected by copyright, which could be owned, for example, by the user who inputted the prompt.
However, there are legal cases suggesting that humans do not exert “sufficient creative control” when using generative AI to create new works. In 2022, for instance, Author Kristina Kashtanova attempted to register a copyright for her new graphic novel Zarya of the Dawn, which included AI-generated images. However, she received a copyright for the portions of the book that she wrote and arranged, but the AI-generated images were denied copyright protections.
While Kashtanova maintains that generative AI is simply a new tool like a camera, the U.S. Copyright Office asserts that human users generally do not have the ultimate creative control over the outcome and therefore cannot receive copyright. Other commentators distinguish between “works” and “ideas,” arguing that the users of generative AI only contribute ideas. Therefore, they are not the authors of the work, meaning there is no human author, and the work is not copyrightable. However, it is important to also note that the Copyright Office says AI-generated work may be copyrightable under specific conditions if sufficient human authorship can be proven.
Conclusion
The rapid advancement and widespread adoption of generative AI have increased the urgency for exploring its legal and ethical implications for copyright. Amid a surge in generative AI investment and ongoing copyright litigation, the stakes are already high for both creators and innovators navigating this policy environment. Policymakers, industry leaders, and legal experts must work together to strike a balance that fosters creativity, incentivizes innovation, and upholds the principles of fairness and accountability.
Julia Yoon is a research intern with the Renewing American Innovation Project at the Center for Strategic & International Studies (CSIS) in Washington, D.C. Chris Borges is a Program Manager and Associate Fellow with the Geoeconomics Center at CSIS.
This piece was originally published on April 12th, 2024 with the Renewing American Innovation (RAI) Project at the Center for Strategic and International Studies (CSIS)