Thoughts on AI #1 – Is AI Training theft?

WARNING: THIS IS LONG AND BORING AND FULL OF TECH STUFF (Well, not that much actually.)

There’s been a lot of discussion lately on the topic of what’s called generative AI, which is the term de jour for artificial intelligence software creating “art”, e.g. fiction, visual arts, animation and video and whatnot. (Technically, it would include non-art such as business diagrams, legal reports, etc. But what is art, anyway, man?)

If you’re in any way aware of the dialogue within the arts world, you’ll notice there’s a lot of calm, reasoned debate going on wherein people sagely present their suppositions and give careful consideration to the arguments of their critics.

Umm, yeah, we call that “sarcasm”, folks (something I’m pretty sure AI is still incapable of.) In reality, people seem to absolutely lose their shit about this topic, often filling discussions groups with spittle inflected arguments and angry denunciations of their opponents. (I’ll concede I’ve done some of this myself.)

As a result of this uproar, I’ve decided to do a series of posts on my thoughts on AI, if only to coalesce my ideas on the topic.

Before I do, I will issue a clear and concise blanket statement: I think generative AI is a good thing… mostly, kinda, sorta, with several caveats.

What are those caveats? Well, I’ll address them in these individual posts. I haven’t planned everything out, but I’ll likely cover claims such as

AI training is theft

AI is stealing jobs

AI is not writing

AI can write (readable, entertaining) novels

And so much more!

A couple of points of note:

I’m mainly going to focus on generative AI as it relates to writing fiction, but I may refer to other art forms.

I am not an expert on these technologies, but nor am I a complete ignoramus. My claims are based on my reading on the topics of neuroscience and machine learning.

Wherever you sit in this debate, some of my opinions may surprise you.

So let’s get to today’s topic. (Oh, one other thing. I can’t predict the speed with which I’ll publish these; it could be weeks until I follow each one up.)

A frequent claim made is that AI is stealing/plagiarizing from existing fiction. It’s often alleged that OpenAI’s ChatGPT (one of the big AI tools) swallowed up a huge corpus of text and now spits out regurgitated chunks whenever prompted. The New York Times even has a current lawsuit alleging ChatGPT unfairly used their articles as training materials. (I don’t think there’s any doubt the material was used; the question is whether it was unfair. This Guardian article notes:

“Courts have not yet addressed the key question of whether AI training qualifies as fair use under copyright law.”)

I think to really address this, we need to step back and take a detailed look at how AI works.

There’s only one problem.

I don’t know how AI works.

Nobody really does. It’s considered a black box technology which means how it does its magic is unclear. This is a byproduct of using neural networks, a fascinating design approach that is worth reading about but I’m not going to describe here.

But, despite my ignorance of the details, I think my general understanding of how AI is trained will work for our purposes. Let me break down the steps for AI to “learn” something.

INPUT: First, AI is fed a huge amount of data. Maybe it’s weather patterns, maybe it’s MRIs of potential cancer lesions, or, of interest to writers, maybe it’s a huge collection of text.

ANALYSIS: AI chews into this data and observes the connections between the data points. So it might note that a temperature drop in a weather data indicates a storm will arrive in two days. Or that some pattern in MRIs correlates with a developing tumor. With writing, it will learn the rules of grammar and style. And logic (e.g. the adjective “reptilian” is seldom used to describe a cat.)

I want to pause here and clarify one thing. Above, I say AI is “observing” and “noting” which implies it is a conscious entity of some sort. Is it? I dunno… I personally doubt it, though maybe it will be one day. But that language is just a short hand way of describing AI’s processing functions.

MODELING: After this analysis, AI might have a “model” of the mathematical relationships between data points. So it could say, “in the 500 individual weather pattern reports I was fed, this data point appeared with this other data point 23% of the time.” Or, “in the collection of text I was fed, the word ‘frog’s’ was followed by the word ‘tongue’ 12% of the time.” (Again, it’s not actually consciously thinking this… I think.)

TRAINING: But that’s not enough for the AI to output anything useful. It has to be trained, which is basically humans asking it to answer questions/produce material and judging the results. A person asks, “show me coming storms/cancerous tumors/cat descriptions” and the AI responds. Incorrect results are met with a “try again,” and correct results are met with “good boy.” Slowly, the data in the AI is “sorted” into a model of the world.

Worth noting: AI can now train itself (no humans needed) though this Atlantic article implies there are limitations. (I’m presenting the archive version to get around the paywall.)

Now, some people seem to think AI is filled with actual copies of the books it was trained on, Like, if you asked it to spit up Stephen King’s “The Shining”, it could. My understanding is this is not the case. What’s of interest to AI is the relationships between the words (how often is word X followed by word Y, etc.), not the actual text.

Regardless, if AI “learns” these relationships via the analysis and training steps, it can be prompted to produce readable text. It’s not great fiction or prose—I’d say it’s usually awful*—but it’s not complete nonsense either.

*Current AI text really requires human tweaking to be worth reading.

Now’s a good point to ask: will AI produce quality fiction unassisted by humans (say, in the next thirty years)? The more I think about this, the more I doubt it. But we’ll see.

So is this plagiarism?

If AI was regurgitating substantial chunks of other author’s text, I would say yes. But, while I’ve heard people say this had happened, I’ve yet to see examples. (That’s not quite true. The New York Times suit against OpenAI alleges that ChatGPT did spit up copyrighted material, but OpenAI claims it was “hacked.” (The details are interesting and worth reading, and hopefully the truth will come out in court.)

Also, as I’ve noted before, when textual similarities count as plagiarism is not easy to determine (though there are some “smoking gun” cases.)

Now, some might say concrete examples of AI plagiarizing text are not necessary. The very fact that it’s trained on existing data means it has to be replicating (in some sense) human penned prose.

Some people might respond by saying, look, that’s what human authors do. They read a lot, then write, and their writing is certainly influenced by what they read. They will deliberately or inadvertently use metaphors, sentence rhythms, and plot devices picked up from other writers. Why is it different when AI does it? (These contrarians ask.)

I think that’s a worthwhile but imprecise argument. What AI is doing is like the human process of learning, but not exactly the same.

Here’s where I end up. It’s possible AI could plagiarize, e.g. spit out text that is similar enough to existing text that it would be called plagiarizing if humans did it. It’s possible it’s done this in the past. But if plagiarizing detection software is at all functional, all an AI company will need to do is filter their output with that detection software and the problem is solved. They may already be doing that.

To make the case that AI training is theft, it needs to be clear what is being stolen? Plots? Humans “steal” those all the time. A particular, unique combination of words? Yeah, maybe, though per my article on plagiarism, it’s not that easy to define when a combination is truly unique. Some other essence of writing? That’s too vague to be helpful.

My suspicion is that this division of opinion is about metaphysics. Some people think when they write they are literally putting some core essence of their self on the page (and AI may steal that.) If you know me, you know I shy away from metaphysical claims—as writers, we’re arranging words in various patterns, influenced by what we’ve read, the rules of logic and various other concerns. It’s a process, and fundamentally a rather boring one. (Don’t get me wrong, it’s a process I love, but it’s not magical rainbows coming out of unicorn buttholes.)

So, for the most part, I don’t call AI training theft. It may occasionally spit up something that would be condemned as plagiarism if done by humans, and I imagine we can come up with a legal mechanism to address those instances. But for the most part, AI is grouping words together based on patterns it’s derived from the corpus of human generated text. No magic, no essence-stealing.

Man, I didn’t expect this to go on so long. Feel free to yell at me in the comments.

The Horror of Wil Forbis

Thoughts on AI #1 – Is AI Training theft?

Leave a Reply Cancel reply

Recent Posts

Categories

Archive

Tags

Social Media