New open source CSAM detection engine open for trial

terminus · March 16, 2022, 1:34am

A researcher who is working on a machine learning system for CSAM detection has gotten in touch with me, and I have permission to share the link. You can use this to test how accurate it is. A few disclaimers:

Please DON’T upload anything that you suspect might be illegal in your country.
You should expect the system to be pretty inaccurate. It’s still in development, and machine learning systems are not good at making fine distinctions.

Since I’m cynical about machine learning classifers, why am I sharing this at all? Because as these systems are becoming widely used, it is better for there to be an open source implementation with a well-documented explanation of how it was trained, so that researchers can properly investigate the capabilities and limitations of these systems.

Chie · March 16, 2022, 3:17am

My thoughts on the matter…

The use of machine-learning for CSAM detection is too high of risk, and I cannot help but share deep concern over how such regimes could be misused to target materials that do not involve or depict real children.
CSAM detection works because, in theory, only verified images could fall within its scope, rather than half-hazard guesses.
It is my hope that this boogeyman cat-and-mouse game between the technology and child abuse sectors can once again see reason and lay to rest these tools, and focus once more on human-verified hashes and reports.

Larry · March 16, 2022, 5:19am

You’re right, this thing stinks. I put in some non-pornographic photos and it says they are CP.

JoshuaACasey · March 16, 2022, 5:25am

interesting. I tested it with memes, non-pornographic images, and ‘adult’ pornography and it correctly said that none of the things I uploaded were CP

Jigsy · March 16, 2022, 6:04am

I feel let down. I tried a whole bunch of things and got nothing.

I’m with Chie on this, though. I can just see this being abused to stop lolicon artwork and the line.

JoshuaACasey · March 16, 2022, 8:30am

I’m certainly not a fan of “csam detection algorithms”. I think there are better methods (say, public health approaches that actually offers help & resources to help people that are looking at or looking for that material to help them stop)

But if a csam detection algorithm is going to be used, I definitely think open source (which will allow for transparency & oversight–something that neither NCMEC or Microsoft PhotoDNA has!) is the right way to do it!

edit: the concerns about it being abused are certainly valid. However, I think being open source is a beneficial way to hopefully stop it from being abused.

Chie · March 16, 2022, 4:15pm

So I decided to bite the bullet and try it out with some anime/manga images. None of the images I’ve tested out seem to indicate ‘child’, even the more grotesque, anatomically correct examples (like Guts from Berzerk).

Moreover, I tested some legal JAV images. None of these women were underage, nor the victim of abuse. Japanese women typically have smaller proportions, and are more ‘child-like’, with lots of petite JAV models playing the roles of child characters, despite they themselves being in their early to mid 20s, sometimes even 30s.
It seems to think that they are ‘child’, but not illegal pornography.

My main concern with machine learning is that it’s not cut out for this type of issue, one so delicate. I’d hate to see a disgrunteled/annoyed judge in a federal court slam the gavel down against CSAM scanning regimes due to how prone to abuse or misidentification they are. I can recall reading a case where a positive on a detection algorithm did not qualify as probable cause, absent of some LEA validation, with regard to CSAM prosecution, but I could be wrong.

This type of delicate issue needs proper validation by human actors and analysts, as well as due diligence. If these detection algorithms are unreliable or prone to misidentification, then it could render their reliability and credibility as nonexistent, and potentially cast doubt on society’s ability to adequately combat something that can be dealt with.
The sexual exploitation and abuse of children is a very sensitive issue. Big tech treats content ID matching for copyrighted material with more gravity than this, and I’d hate to see that philosophy carry over.

Jigsy · March 16, 2022, 7:15pm

Based on what I remember from the Rittenhouse trial, where I was watching a live stream with lawyers talking about video enhancement, nearest neighbor, etc.

…in court, anything AI related is not allowed to be used as evidence iirc.
But I’m not sure on the scope of that.

Larry · March 17, 2022, 2:10am

I uploaded several pictures of young, but adult, nude females & males and had them routinely ruled CP. All of the figures were shaved or waxed and the men mostly
“alert.”

Clearly this thing has issues with a lack of hair.

Jigsy · March 17, 2022, 5:00am

Finally got one. Uploaded a close-up of an anime pussy from behind.

There is child pornography in this image.

Chie · March 17, 2022, 7:23am

I uploaded what was very clearly a naked adult, one from BRAZZERS that I found on Google Images and it said that it was child pornography.

If this is seriously the level of accuracy we’re dealing with in terms of false-positives, then color me shocked if these regimes don’t trivialize the whole prospect of CSAM scanning. How any technologically-minded person can condone this is not only embarrassing, but horrifyingly irresponsible.

@terminus I think these regimes have no hope of being reasonably implemented, and policy recommendation should be focused on the traditional implementation and verification of image hashes by qualified, trained personnel, images that are limited to the US definition of child pornography, which can only involve actual, real children.

JoshuaACasey · March 17, 2022, 8:53am

uhh what LOL. I downloaded some porn (legal obviously cuz the models are adults) from twitter to try this out some more and this result from the first image I tried is rather interesting. It says “The image data is safe and there is no child pornography” but then in the little graph it shows a high prediction of child nudity, with a low prediction of adult nudity. Hello!? That seems strange…LMAO

JoshuaACasey · March 17, 2022, 8:58am

This one too. (the model isn’t nude in this image but she’s wearing underwear & a bra and she’s holding a lollipop up to her mouth). I kinda suspected that the lollipop would “trick” the AI into thinking it was a child and that seems like an accurate suspicion

JoshuaACasey · March 17, 2022, 9:19am

It seems to think picture(s) of an adult male (shaved penis FYI) is child pornography. I suspect because men don’t have boobs like adult women do so their chests are flat. And I’m guessing the AI is looking for flat chests to determine if there’s a (female) child in the picture. Because it hasn’t flagged any of the pictures of adult woman (I’m specifically using images of women with small breasts because I think small breasts would be more likely to make the AI think it’s a child) as being a child. So there might be some gender bias – not particularly great for male sex workers

Chie · March 17, 2022, 1:36pm

@terminus
Do we have permission to share this resource off-site?

Jigsy · March 17, 2022, 3:52pm

Tried a whole bunch of nude Japanese pornstars (small, flat chested, shaved pussies).

I’d say out of 22 images I tried, only four or five got flagged, but a good chunk were “this is a naked child, but it’s not CP.”

I’m guessing that if there is other items in the progress bar like 1% of adult, 1% of nature, but 98% of child naked, it won’t classify it as CP… but if only child naked is 100%, then it will.

I did try a bunch of loli anime posters from Megami Magazine. They all got flagged as “Other.”

terminus · March 17, 2022, 5:46pm

Yes but I suggest linking back to this forum thread for the full context.

Larry · March 17, 2022, 6:25pm

don’t know why you would want to, since it seems problematic in its determinations.

kevix · March 18, 2022, 12:15am

Speaking as an ML scientist, the best thing you can do is to share all your false-positives to the researcher.

The first iterations of any ML model is very often inaccurate. Even after the researcher tunes it on his side and achieves 99.999% accuracy using his own dataset, that figure will certainly plummet the first time he exposes it to outside data. This is the stage where it’s most critical to get all the false-positives so he can retrain and make adjustments to the model.

We like to say in ML, “garbage-in, garbage out”. So it’s very very critical that the researcher gets access to all these falsely tagged images. Otherwise, you’ll really end up with a model that tags someone’s smooth ass as CP.

On the issue of whether it’s a good idea to help out and contribute to this CSAM engine, I would say that it might be worth your time. It’s because in the ML field, researchers like to share papers and algos. So even if this person would not be the one to ultimately bring this engine to market, it’s highly likely that somebody, someday might reference (or buy) his work. So now would be a good chance to steer this ML project in the right direction.

To be clear, I do NOT condone a FULLY ML/AI based approach to detecting CSAM. It’s a very very bad idea to use this without having a human at the end of the process.

I believe a lot of companies are looking at ML to at least help reduce the amount of content that has to be manually processed by a real person.

Jigsy · March 18, 2022, 12:33am

Nope. I was wrong about that.

That said, modifiying the image via an online photoshop and adding effects like NVG or Sepia filters did somewhat change the output. NVG = Not CP. Others = High probability. Mosaicing the image with a level of 1 = still CP.