Scary: GOP revives ISP-tracking legislation

You won’t miss your privacy until it’s gone. This comes from Republicans, with a Democrat proponent, and the bill also includes mandatory web labeling, snip from C-Net’s article:

“Details about data retention requirements would be left to Gonzales. At a minimum, the bill says, the regulations must require storing records ‘such as the name and address of the subscriber or registered user to whom an Internet Protocol address, user identification or telephone number was assigned, in order to permit compliance with court orders.’

Because there is no limit on how broad the rules can be, Gonzales would be permitted to force Internet providers to keep logs of Web browsing, instant message exchanges, or e-mail conversations indefinitely. (The bill does not, however, explicitly cover search engines or Web hosting companies, which officials have talked about before as targets of regulation.)” Link. (thanks, Kimo!)

Declan McCullagh, the article’s author, is following the story closely on his invaluable site, Politech.

Update: A reader explains *exactly* how this bill won’t work, and I’ll bet the people pushing this bill have no idea. Snip:

“This isn’t workable. And even if it were — even if someone could identify all of the originals, compute all the permutations, compute hashes of all those, store all those, and rig up a service that scaled to Internet size…it would fail the first time someone figured out and used a permutation not accounted for. Or created another picture.

Epilogue part 1
—————

There’s something else scary about this, and about the requirement that such images be reported. One of the other properties of hash functions is that they’re one-way: that is, if I take image I1, and compute its hash S1, and give you S1, you can’t work backwards to I1. So if you’re in the field, using such a hypothetical service, and your software checks a newly-seen image and discover that its hash matches S123456 from the database, and you automatically report this to Fatherland Security…then you do not know — unless *you* stored the image (possibly a crime) and looked at the image (possibly a crime) what is actually in that picture. You don’t know whether it’s really kiddie porn or whether it’s a picture of the Shrub standing under the ‘Mission Accomplished’ banner.

In other words, there is no way for an independent observer to verify that every hash in the database was actually generated from kiddie porn input. So this same mechanism could be easily used to track *any* images that its operators take an interest in.”

Read the whole explanation in its entirety after the jump.


* * * * * * *

Declan’s article contains this nugget:

> Afterward, the center is authorized to compile that information into a form
> that can be sent back to ISPs and used to assemble a database of “unique
> identification numbers generated from the data contained in the image file.”
> That could be a unique ID created by a hash function, which yields something
> akin to a digital fingerprint of a file.

Heh. This isn’t going to work. Not that it’ll stop the usual grandstanding
with a healthy dose of “we must do it for the chilllllll-drrrrrunnnnnnn”
drumbeating but it’s not going to work.

How hashes work
—————

Presume that a hash function H(x) is applied to an image I1, and yields
a signature S1. S1 is just a string of bits — maybe 128 or 256 or 1024.
And presume that the hash function H() is chosen to have the properties
we want hash functions to have: that is, that when it’s applied to other
images, e.g. H(I2), H(I3), … H(In), that none of those results
S2, S3, …. Sn will equal S1 or each other.

Which means…we now have a way to reduce I1 (big) to S1 (small), we can
store S1, we can compute H(Ix) where Ix is an unknown image that we see
flying across our network, and if we get S1, then we know that Ix = I1,
and if I1 is naughty badness, then we push the big red button and sound
the klaxxon, etc.

Note that one property of hashes is that changing a single bit of
the input changes the output dramatically.

[ Declan pointed out

http://www.codemonkeyramblings.com/2006/07/yet_another_ineffective_way_to.php

which has visual examples of this and makes much the same
point I’m about to make. ]

How hashes don’t work
———————

Except…we can modify a single bit in any of the pixels of I1 without
materially affecting the perceived image. (We can probably modify more
than one, but let’s just stick with one for now.) Let’s pick the least
significant bit and not worry about which color that maps to. That means
for a small pseudocolor image — say 256 x 256 at 8 bits — that we
can trivially create 256 X 256 = 65536 new images each of which
will have a different signature but none of which are distinguishable
to the eye — for example, S1234 and S4321 will be VERY different while
I1234 and and I4321 will look about the same.

If we move up to 512×512 pixel, 24-bit images (a somewhat more realistic
choice for porn), and presume we can modify the lowest-order 1 bit each
of RGB without anyone noticing, then we can generate 512 X 512 X 3, or
786,432 new images, and thus roughly three-quarter million new signatures.

And we’re still just modifying a single bit of RGB in a single pixel.

If we consider modifying two pixels, or three, or cropping a line of
pixels, or adding a line of pixels, or modifying a block of pixels, or
mirror-reversing the image, or modifying other than the least significant
bit, or dealing with larger images, or resizing images, or changing the
image format (e.g. JPG to GIF) or or or or…then it quickly becomes
possible to create enormous numbers of new images that are visually
indistinguishable…but will have very different hashes.

Where “enormous” == “hundreds of millions” and that’s probably too small.

[ It is. Last night I did the math a bit more carefully, and
worked out that I could create 34 billion variations on a
512×512 24-bit image while still only changing two pixels
and without getting into the other methods listed above. ]

Practical considerations
————————

This is not hypothetical speculation. Spammers have been doing
this with images for some time now, in order to avoid content-scanning
anti-spam programs that check images. Code to perform these operations
exists in numerous programs, libraries, closed-source and open-source
products, and could be written from scratch in a day by any programmer
with even a little background in graphics or image processing.

So. If we posit a set of “original images” that’s only 1 million (10^6)
and a permutation algorithm that can generate only 1 billion (10^9)
(and both of these are vastly too conservative) then we’re looking at
computing, storing, distributing, and checking 1 million billion
(10^15) hashes.

I think 10^24 or still higher is probably more realistic, but let’s
stick with 10^15 for now. If the hashes are 128 bits long (also
conservative), then 16,000 terabytes of storage will be needed for
the hashes.

That’s 3.2 million DVDs at 5 Gbytes per DVD.

If we repeat this calculation with larger images (e.g. consider those
from a 7-megapixel digital camera, roughly 3072×2048), more images (more
realistic) longer hashes (probably more realistic) and more image
variations (also more realistic), then we pretty quickly arrive at
numbers that exceed all data storage ever manufactured on this planet.

Endgame
——-

This isn’t workable. And even if it were — even if someone could
identify all of the originals, compute all the permutations, compute
hashes of all those, store all those, and rig up a service that
scaled to Internet size…it would fail the first time someone
figured out and used a permutation not accounted for. Or created
another picture.

Epilogue part 1
—————

There’s something else scary about this, and about the requirement that
such images be reported. One of the other properties of hash functions is
that they’re one-way: that is, if I take image I1, and compute its hash
S1, and give you S1, you can’t work backwards to I1. So if you’re in
the field, using such a hypothetical service, and your software checks
a newly-seen image and discover that its hash matches S123456 from the
database, and you automatically report this to Fatherland Security…then
you do not know — unless *you* stored the image (possibly a crime) and
looked at the image (possibly a crime) what is actually in that picture.
You don’t know whether it’s really kiddie porn or whether it’s a picture
of the Shrub standing under the “Mission Accomplished” banner.

In other words, there is no way for an independent observer to verify that
every hash in the database was actually generated from kiddie porn input.
So this same mechanism could be easily used to track *any* images that
its operators take an interest in.

Epilogue part 2
—————

Consider also the Julie Amero case, where it seems pretty clear that
this poor woman has had her life destroyed and faces prison because
malware got into an undefended school computer and subsequently pulled
in some combination of (more malware, porn, popups, etc.). A NYTimes
article (by Markoff) about a month ago cited an estimate of 70M
fully-compromised systems on the ‘net; an Ars Technica interview
with Vint Cerf

Vint Cerf: one quarter of all computers part of a botnet
http://arstechnica.com/news.ars/post/20070125-8707.html

cites an estimate of 150M. Personally, I think 300M is closer,
but regardless of who’s right, this means that something on the
order of 100M people could be turned into instant kiddie porn
suspects the moment the operators of the botnets feel inclined to
download it onto their systems.

Share This Post