If AI spits out stuff it's been trained on

ExtremeDullard@lemmy.sdf.org · 6 months ago

If AI spits out stuff it's been trained on

hendrik@palaver.p3x.de · 6 months ago

Well, it can draw an astronaut on a horse, and I doubt it had seen lots of astronauts on horses…

ExtremeDullard@lemmy.sdf.org · edit-2 6 months ago

Yeah but the article suggests that pedos train their local AI on existing CSAM, which would indicate that it’s somehow needed to generate AI-generated CSAM. Otherwise why would they bother? They’d just feed them images of children in innocent settings and images of ordinary porn to get their local AI to generate CSAM.

Rikudou_Sage@lemmings.world · 6 months ago

How do they know that? Did the pedos text them to let them know? Sounds very made up.

ExtremeDullard@lemmy.sdf.org · 6 months ago

The article says “remixed” images of old victims have cropped up.

Rikudou_Sage@lemmings.world · 6 months ago

And again, what’s the source? The great thing with articles about CSAM is that you don’t need sources, everyone just assumes you have them, but obviously cannot share.

Did at least one pedo try that? Most likely yes. Is it the best way to get good quality fake CSAM? Not at all.

ExtremeDullard@lemmy.sdf.org · 6 months ago

I don’t know man. But I assume associations concerned with child abuse are all over that shit and checking it out. I’m not a specialist of CSAM but I assume an article that says old victims show up in previously-unseen images doesn’t lie, because why would it? It’s not like Wired is a pedo outlet…

Also, it was just a question. I’m not trying to convince you of anything 🙂

hendrik@palaver.p3x.de · edit-2 6 months ago

I think that aricle lacks nuance. It’s a bit baity and attends to the usual talking points without contextualizing the numbers or what’s actually happening out there, the consequences or the harm. That makes me believe the author just wants to push some predetermined point across.

But I’ve yet to read a good article on this. Most articles are like this one. But yeah, are a few thousand images much in the context of crime that’s happening online? Where are these numbers from and what’s with the claim that there are more actual pictures out there? I seriously doubt that at this point, if it’s so easy to generate images. And what consequences does all of this have? Does it mean an increase or a decrease in abuse? And lots of services have implemented filters… Are the platforms doing their due diligence? Is this a general societal issue or criminals doing crime?

GBU_28@lemm.ee · 6 months ago

Training an existing model on a specific set of new data is known as “fine tuning”.

A base model has broad world knowledge and the ability to generate outputs of things it hasn’t specifically seen, but a tuned model will provide “better” (fucking yuck to even write it) results.

The closer your training data is to your desired result, the better.

BradleyUffner@lemmy.world · 6 months ago

The AI can generate a picture of cows dancing with roombas on the moon. Do you think it was trained on images of cows dancing with roombas on the moon?

xia@lemmy.sdf.org · 5 months ago

Individually, yes. Thousands of cows, thousands of "dancing"s, thousands of roombas, and thousands of "on the moon"s.

Hawk@lemmynsfw.com · 6 months ago

It doesn’t need CSAM in the dataset to generate images that would be considered CSAM.

I’m sure they take good effort to stay away from that stuff as it’s bad for business.

Ragdoll X@lemmy.world · edit-2 6 months ago

doesn’t it follow that AI-generated CSAM can only be generated if the AI has been trained on CSAM?

Not quite, since the whole thing with image generators is that they’re able to combine different concepts to create new images. That’s why DALL-E 2 was able to create a images of an astronaut riding a horse on the moon, even though it never saw such images, and probably never even saw astronauts and horses in the same image. So in theory these models can combine the concept of porn and children even if they never actually saw any CSAM during training, though I’m not gonna thoroughly test this possibility myself.

Still, as the article says, since Stable Diffusion is publicly available someone can train it on CSAM images on their own computer specifically to make the model better at generating them. Based on my limited understanding of the litigations that Stability AI is currently dealing with (1, 2), whether they can be sued for how users employ their models will depend on how exactly these cases play out, and if the plaintiffs do win, whether their arguments can be applied outside of copyright law to include harmful content generated with SD.

My question is: why aren’t OpenAI, Google, Microsoft, Anthropic… sued for possession of CSAM? It’s clearly in their training datasets.

Well they don’t own the LAION dataset, which is what their image generators are trained on. And to sue either LAION or the companies that use their datasets you’d probably have to clear a very high bar of proving that they have CSAM images downloaded, know that they are there and have not removed them. It’s similar to how social media companies can’t be held liable for users posting CSAM to their website if they can show that they’re actually trying to remove these images. Some things will slip through the cracks, but if you show that you’re actually trying to deal with the problem you won’t get sued.

LAION actually doesn’t even provide the images themselves, only linking to images on the internet, and they do a lot of screening to remove potentially illegal content. As they mention in this article there was a report showing that 3,226 suspected CSAM images were linked in the dataset, of which 1,008 were confirmed by the Canadian Centre for Child Protection to be known instances of CSAM, and others were potential matching images based on further analyses by the authors of the report. As they point out there are valid arguments to be made that this 3.2K number can either be an overestimation or an underestimation of the true number of CSAM images in the dataset.

The question then is if any image generators were trained on these CSAM images before they were taken down from the internet, or if there is unidentified CSAM in the datasets that these models are being trained on. The truth is that we’ll likely never know for sure unless the aforementioned trials reveal some email where someone at Stability AI admitted that they didn’t filter potentially unsafe images, knew about CSAM in the data and refused to remove it, though for obvious reasons that’s unlikely to happen. Still, since the LAION dataset has billions of images, even if they are as thorough as possible in filtering CSAM chances are that at least something slipped through the cracks, so I wouldn’t bet my money on them actually being able to infallibly remove 100% of CSAM. Whether some of these AI models were trained on these images then depends on how they filtered potentially harmful content, or if they filtered adult content in general.

frightful_hobgoblin@lemmy.ml · 6 months ago

a GPT can produce things it’s never seen.

It can produce a galaxy made out of dog food; doesn’t mean it was trained on pictures of galaxies made out of dog food.

justOnePersistentKbinPlease@fedia.io · 6 months ago

A fun anecdote is that when my friends and I tried the then brand new MS image gen AI built into Bing(for the purpose of a fake tinder profile, long story).

The generator kept hitting walls because it had been fed so much porn that the model averaged women to be by default nude in images. You had to specify that what clothes a woman was wearing. Not even just “clothed”, then it defaulted to lingerie or bikinis.

Not men though. Men it defaulted to being clothed.

thatKamGuy@sh.itjust.works · 6 months ago

I mean, Bing has proven itself to the the best search engine for porn - so it kinda stands to reason that their AI model would have a particular knack for generating even more of the stuff!

PM_ME_VINTAGE_30S [he/him]@lemmy.sdf.org · 6 months ago

If AI spits out stuff it’s been trained on

For Stable Diffusion, it really doesn’t just spit out what it’s trained on. Very loosely, it starts with white noise, then adds noise and denoises the result based on your prompt, and it keeps doing this until it converges to a representation of your prompt.

IMO your premise is closer to true in practice, but still not strictly true, about large language models.

notfromhere@lemmy.ml · 6 months ago

It’s akin to virtually starting with a block of marble and removing every part (pixel) that isn’t the resulting image. Crazy how it works.

Free_Opinions@feddit.uk · 6 months ago

First of all, it’s by definition not CSAM if it’s AI generated. It’s simulated CSAM - no people were harmed doing it. That happened when the training data was created.

However it’s not necessary that such content even exists in the training data. Just like ChatGPT can generate sentences it has never seen before, image generators can also generate pictures it has not seen before. Ofcourse the results will be more accurate if that’s what it has been trained on but it’s not strictly necessary. It just takes a skilled person to write the prompt.

My understanding is that the simulated CSAM content you’re talking about has been made by people running their software locally and having provided the training data themselves.

Buffalox@lemmy.world · edit-2 6 months ago

First of all, it’s by definition not CSAM if it’s AI generated. It’s simulated CSAM

This is blatantly false. It’s also illegal and you can go to prison for owning selling or making child Lolita dolls.

I don’t know why this is the legal position in most places. Because as you mention no one is harmed.

Free_Opinions@feddit.uk · 6 months ago

What’s blatantly false about what I said?

Buffalox@lemmy.world · edit-2 6 months ago

CSAM = Child sexual abuse material
Even virtual material is still legally considered CSAM in most places. Although no children were hurt, it’s a depiction of it, and that’s enough.

Free_Opinions@feddit.uk · 6 months ago

Being legally considered CSAM and actually being CSAM are two different things. I stand behind what I said which wasn’t legal advise. By definition it’s not abuse material because nobody has been abused.

Buffalox@lemmy.world · edit-2 6 months ago

There’s a reason it’s legally considered CSAM. as I explained it is material that depicts it.
You can’t have your own facts, especially not contrary to what’s legally determined, because that means your definition or understanding is actually ILLEGAL!! If you act based on it.

Free_Opinions@feddit.uk · edit-2 6 months ago

I already told you that I’m not speaking from legal point of view. CSAM means a specific thing and AI generated content doesn’t fit under this definition. The only way to generate CSAM is by abusing children and taking pictures/videos of it. AI content doesn’t count any more than stick figure drawings do. The justice system may not differentiate the two but that is not what I’m talking about.

Buffalox@lemmy.world · edit-2 6 months ago

The only way to generate CSAM is by abusing children and taking pictures/videos of it.

Society has decided otherwise, as I wrote, you can’t have your own facts or definitions. You might as well claim that in traffic red means go, because you have your own interpretation of how traffic lights should work.
Red is legally decided to mean stop, so that’s how it is, that’s how our society works by definition.

frightful_hobgoblin@lemmy.ml · 6 months ago

Dumb internet argument from here on down; advise the reader to do something else with their time.