The following statement was read by Edward W. Felten today at the Fourth International Information Hiding Workshop, in Pittsburgh.
===============
On behalf of the authors of the paper "Reading Between the Lines: Lessons from the SDMI Challenge," I am disappointed to tell you that we will not be presenting our paper today.
Our paper was submitted via the normal academic peer-review process. The reviewers, who were chosen for their scientific reputations and credentials, enthusiastically recommended the paper for publication, due to their judgment of the paper's scientific merit.
Nevertheless, the Recording Industry Association of America, the SDMI Foundation, and the Verance Corporation threatened to bring a lawsuit if we proceeded with our presentation or the publication of our paper. Threats were made against the authors, against the conference organizers, and against their respective employers.
Litigation is costly, time-consuming, and uncertain, regardless of the merits of the other side's case. Ultimately we, the authors, reached a collective decision not to expose ourselves, our employers, and the conference organizers to litigation at this time.
We remain committed to free speech and to the value of scientific debate to our country and the world. We believe that people benefit from learning the truth about the products they are asked to buy. We will continue to fight for these values, and for the right to publish our paper.
We look forward to the day when we can present the results of our
research to you, our colleagues, through the normal scientific publication
process, so that you can judge our work for yourselves.
25 April 2001:
Date: Wed, 25 Apr 2001 14:43:09 -0400 (EDT) From: Jeremy A Erwin <jerwin@osf1.gmu.edu> To: dvd-discuss@eon.law.harvard.edu Subject: [dvd-discuss] SDMI Challenge Information The Secure Internet Programming Laboratory (http://www.cs.princeton.edu/sip/) posted this on their website (http://www.cs.princeton.edu/sip/sdmi/) Update: Wednesday April 25, 2001, 1:30 PM EDT No decision has yet been announced regarding whether our presentation at the Pittsburgh conference will go ahead. The presentation is scheduled for 10:00 AM on Thursday. We will post any updated information here, as it becomes available. We have created a mailing list for people who are interested in receiving any announcements relating to the status of our paper and presentation. To subscribe, send email to majordomo@cs.princeton.edu; the message body should contain the line "subscribe sdmi-paper-info".
1 The Verance Watermark is currently used for DVD-Audio and SDMI Phase I products and certain portions of that technology are trade secrets.
Scott A. Craver1, John R McGregor1, Min
Wu1, Bede Liu1,
Adam Stubblefield2, Ben
Swartzlander2, Dan S. Wallach2,
Drew Dean3,
and Edward W. Felten4
1 Dept. of Electrical Engineering, Princeton
University
2 Dept. of Computer Science, Rice University
3 Computer Science Laboratory, Xerox Palo Alto Research Center
4 Dept. of Computer Science, Princeton University
Abstract. The Secure Digital Music Initiative is a consortium of parties interested in preventing piracy of digital music, and to this end they are developing architectures for content protection on untrusted platforms. SDMI recently held a challenge to test the strength of 4 watermarking technologies, and 2 other security technologies. No documentation explained the implementations of the technologies, and neither watermark embedding nor detecting software was directly accessible to challenge participants. We nevertheless accepted the challenge, and learned a great deal about the inner workings of the technologies. We report on our results here.
1 The SDMI challenge offered a small cash payment to be shared among everyone who broke at least one of the technologies and was willing to sign a confidentiality agreement giving up all rights to discuss their findings. The cash prize amounted to the price of a few days of time from a skilled computer security consultant, and it was to be split among all successful entrants, a group that we suspected might be significant in size. We chose to forgo the payment and retain our right to publish this paper.
- File 1: an unwatermarked song;
- File 2: File 1, with a watermark added; and
- File 3: another watermarked song.
Fig. 1. The SDMI watermark attack
problem. For each of the four watermark challenges, Sample-1, sample-2, and
sample-3 are provided by SDMI sample-4 is generated by participants in the
challenge and submitted to SDMI oracle for testing.
Figure 1 provides an overview of the challenge goal. As mentioned earlier,
there are three audio files per watermark challenge: an original and watermarked
version of one clip, and then a watermarked version of a second clip, from which
the mark is to be removed. All clips were 2 minutes long, sampled at 44.1kHz
with 16-bit precision.
The reader should note one serious flaw with this challenge arrangement. The
goal is to remove a robust mark, while these proposals appear to be Phase II
watermark screening technologies [4]. As we mentioned earlier, a Phase II screen
is intended to reject audio clips if they have been compressed, and presumably
compression degrades a fragile component of the watermark. An attacker need not
remove the robust watermark to foil the Phase II screen, but could instead
repair the modified fragile component in compressed audio. This attack was not
possible under the challenge setup.
3.1 Attack and Analysis of Technology A
A reasonable first step in analyzing watermarked content with original,
unmarked samples is differencing the original and marked versions in some way.
Initially, we used sample-by-sample differences in order to determine roughly
what kinds of watermark- ing methods were taking place. Unfortunately,
technology A involved a slowly varying phase distortion which masked any other
cues in a sample-by-sample difference. We ultimately decided this distortion was
a pre-processing separate from the watermark, in part because undoing the
distortion alone did not foil the oracle.
The phase distortion nevertheless led us to attempt an attack in which both
the phase and magnitude change between sample 1 and sample 2 is applied to
sample 3. This attack was confirmed by SDMI's oracle as successful, and
illustrates the general attack approach of imposing the difference in an
original-watermark pair upon another media clip. Here, the "difference" is taken
in the FFT domain rather than the time domain, based on our suspicions regarding
the domain of embedding. Note that this attack did not require much information
about the watermarking scheme itself, and conversely did not provide much extra
insight into its workings.
A next step, then, is to compute the frequency response H(w) =
W(w)/O(w) of the watermarking process for segments of audio, and
observe both |H(w)| and the corresponding impulse response
h(t). If the watermark is based on some kind of linear filter,
whose properties change slowly enough relative to the size of a frame of
samples, then this approach is ideal.
Figure 2 illustrates one frequency response and impulse response about 0.3
seconds into the music. These responses are based on FFTs of 882 samples, or one
fiftieth second of music. As can be clearly seen, a pair of sinusoidal ripples
are present within a certain frequency band, approximately 8-16Khz. Ripples in
the frequency domain are indicative of echoes in the time domain, and a sum of
sinusoids suggested the presence of multiple echoes. The corresponding impulse
response h(t) confirms this. This pattern of ripples changes quite
rapidly from frame to frame.
Thus, we had reason to suspect a complex echo hiding system, involving
multiple time-varying echoes. It was at this point that we considered a patent
search, knowing enough about the data hiding method that we could look for
specific search terms, and we were pleased to discover that this particular
scheme appears to be listed as an alternative embodiment in US patent number
05940135, awarded to Aris corporation, now part of Verance [5]. This provided us
with little more detail than we had already discovered, but confirmed that we
were on the right track, as well as providing the probable identity of the
company which developed the scheme. It also spurred no small amount of
discussion of the validity of Kerckhoffs's criterion, the driving principle in
security that one must not rely upon the obscurity of an algorithm. This is,
surely, doubly true when the algorithm is patented.
Fig. 2. A short-term complex echo.
Above, the frequency response between the watermarked and original music, taken
over 1/50 second, showing a sinusoidal ripple between 8 and 16 KHz. Below, the
corresponding impulse response. The sinusoidal pattern in the frequency domain
corresponds to a pair of echoes in the time domain.
The most useful technical detail provided by the patent was that
the "delay hopping" pattern was likely discrete rather than continuous, allowing
us to search for appropriate frame sizes during which the echo parameters were
constant. Data collection from the first second of audio showed a frame size of
approximately 882 samples, or 1/50 second. We also observed that the mark did
not begin until 10 frames after the start of the music, and that activity also
existed in a band of lower frequency, approximately 4-8 Khz. This could be the
same echo obscured by other operations, or could be a second band used for
another component in the watermarking scheme. A very clear ripple in this band,
indicating a single echo with a delay of about 34 samples, appears shortly
before the main echo-hopping pattern begins.
The next step in our analysis was the determination of the delay hopping
pattern used in the watermarking method, as this appeared to be the "secret key"
of the data embedding scheme. It is reasonable to suspect that the pattern
repeats itself in short order, since a watermark detector should be able to find
a mark in a subclip of music, without any assistance initially aligning the mark
with the detector's hopping pattern. Again, an analysis of the first second
revealed a pattern of echo pairs that appeared to repeat every 16 frames, as
outlined in figure 3. The delays appear to fall within six general categories,
each delay approximately a multiple of 1/4 millisecond. The exact values of the
delays vary slightly, but this could be the result of the phase distortion
present in the music.
Fig. 3. The hypothesized delay
hopping pattern of technology A. Here two stretches of 16 frames are illustrated
side-by-side, with observed echoes in each frame categorized by six distinct
delays: 2, 3, 4, 5, 6 or 7 times 0.00025 sec. Aside from several missing echoes,
a pattern appears to repeat every 16 frames. Note also that in each frame the
echo gain is the same for both echoes.
The reader will also note that in apparently two frames there is only one
echo. If this pattern were the union of two pseudorandom patterns chosen from
six possible delay choices, two "collisions" would be within what is expected by
chance.
Next, there is the issue of the actual encoded bits. Further work shows the
sign of the echo gain does not repeat with the delay-hopping pattern, and so is
likely at least part of an embedded message. Extracting such data without the
help of an original can be problematic, although the patent, of course, outlines
numerous detector structors which can be used to this end. We developed several
tools for cepstral analysis to assist us in the process. See [2] for in
introduction to cepstral analysis; Anderson and Petitcolas [1] illustrate its
use in attacks on echo hiding watermark systems.
With a rapidly changing delay, normal cepstral analysis does not seem a good
choice. However, if we know that the same echo is likely to occur at multiples
of 16/50 of a second, we can improve detector capability by combining the
information of multiple liftered2 log spectra.
____________________
2 in accordance with the flopped vocabulary used with cepstral analysis, "liftering" refers to the process of filtering data in the frequency domain rather than the time domain. Similarly, "quefrencies" are frequencies of ripples which occur in the frequency domain rather than the time domain.
Fig. 4. Three cepstral detector
structures. In each case we have a collection of distinct frames, each believed
to possess echoes of the same delay. The first two compute cepstral data for
each frame, and sum their squares (or squared magnitudes) to constructively
combine the echo signal in all frames. The third structure illustrates a method
for testing a hypothesized pattern of positive and negative gains, possibly
useful for brute-forcing or testing for the presence of a known "ciphertext."
In the final structure, one cepstrum. is taken using a guess of the gain sign
for each suspect frame. With the correct guess, the ripple should be strongest,
resulting in the largest spike from the cepstral detector. Figure 5 shows the
output of this detector on several sets of suspect frames. While this requires
an exponential amount of work for a given amount of frames, it has a different
intended purpose: this is a brute-forcing tool, a utility for determining the
most probable among a set of suspected short strings of gain signs as an aid to
extracting possible ciphertext values.
Fig. 5. Detection of an echo. A
screenshot of our CepstroMatic utility shows a combination of 4 separate frames
of music, each a fiftieth of a second long, in which the same echo delay was
believed to exist. Their combination shows a very clear ripple on the right,
corresponding to a clear cepstral spike on the left. This is a single echo at a
delay of 33 samples, the delay suggested for these intervalus by the
hypothesized delay-hopping pattern.
Finally, there is the issue of what this embedded watermark means. Again, we
are uncertain about a possible signalling band below 8Khz. This could be a
robust mark, signalling presence of a fragile mark of echoes between 8 and 16
KHz. The 8-16KHz band does seem like an unusual place to hide robust data,
unless it does indeed extend further down, and so this could very easily be
hidden information whose degredation is used to determine if music has already
been compressed.
Of course, knowledge of either the robust or fragile component of the
mark is enough for an attacker to circumvent the scheme, because one can either
remove the robust mark, or repair or reinstate the fragile mark after
compression has damaged it. As mentioned earlier, this possible attack of
repairing the fragile component appears to have been ruled out by the nature of
the SDMI challenge oracles. One must wait and see if real-world attackers will
attempt such an approach, or resort to more brute methods or oracle attacks to
remove the robust component.
3.2 Attack on Challenge B
We analyzed samp1b.wav and samp2b.wav using short-time FFT. Shown in Fig. 6
are the two FFT magnitudes for 1000 samples at 98.67 sec. Also shown is the
difference of the two magnitudes. A spectrum notch around 2800Hz is observed for
some segments of samp2b.wav and another notch around 3500Hz is observed for some
other segments of samp2b.wav. Similar notches are observed in samp3b.wav. The
attack fills in those notches of samp3b.wav with random but bounded coefficient
values. We also submitted a variation of this attack involving different
parameters for notch description. Both attacks were confirmed by SDMI oracle as
successful.
Fig. 6. Technology-B: FFT
magnitudes of samp1b.wav and samp2b.wav and their difference for 1000 samples at
98.67 sec.
3.3 Attacks on Challenge C
By taking the difference of samp1c.wav and samp2c.wav, bursts of narrowband
signal are observed, as shown in Fig. 7. These narrow band bursts appear to be
centered around 1350 Hz. Two different attacks were applied to Challenge C. In
the first at- tack, we shifted the pitch of the audio by about a quartertone. In
the second attack, we passed the signal through a bandstop filter centered
around 1350Hz. Our submissions were confirmed by SDMI oracle as successful. In
addition, the perceptual quality of both attacks has passed the "golden ear"
testing conducted by SDMI after the 3-week challenge.
Fig. 7. Challenge-C: Waveform of
the difference between samp1c.wav and samp2c.wav.
3.4 Attack on Challenge F
For Challenge F, we warped the time axis, by inserting a periodically varying
delay. The delay function comes from our study on Technology-A, and was in fact
initially intended to undo the phase distortion applied by technology A.
Therefore the perceptual quality of our attacked audio is expected to be better
than or comparable to that of the audio watermarked by Technology-A. We also
submitted variations of this at- tack involving different warping parameters and
different delay function. They were confirmed by SDMI oracle as successful.
3 Specifically, Netscape Navigator and Mozilla under Linux, Netscape Navigator under Windows NT, and Internet Explorer under Windows 98 and 2000.
Fig. 8. In a Technology D
Authenticator, the signal fades in, repeats, and fades out.
Extracting the Data Frequency analysis on the 1024 sample block shows
that almost all of the signal energy is concentrated in the 16-20kHz range, as
shown in Figure 9. We believe this range was chosen because these frequencies
are less audible to the human ear. Closer examination shows that this l6-20kHz
range is divided up into 80 discrete bins, each of which appears to carry one
bit of information. As shown in Figure 10, these bits can be manually counted by
a human using a graph of the magnitude of signal in the frequency domain.
Fig. 9. Magnitude vs. Frequency of
Technology D Authenticator
Fig. 10. Individual Bits From a
Technology D Authenticator
Close inspection and pattern matching on these 80 bits of information reveals
that there are only 16 bits of information repeated 5 times using different
permutations. using the letters A-P to symbolize the 16 bits, these 5
permutations are described in Figure 11.
ABCDEFGHIJKLMNOP
OMILANHGPBDCKJFE
PKINHODFMJBCAGLE
FCKLGMEPNOADJBHI
PMGHLECAKDONIFJB
Fig. 11. The encoding of the 16
bits of data in Technology D
Because of the malfunctioning oracle, we were unable to determine the
function used to map TOCs to authenticators, but given an actual SDMI device, it
would be trivial to brute force all 216 possibilities. Likewise,
without the oracle, we could not determine if there was any other signal present
in the authenticator (e.g., in the phase of the frequency components with
nonzero magnitude).
For the moment, let us assume that the hash function used in Technology D has
only 16 bits of output. Given the number of distinct CDs available, an attacker
should be able to acquire almost, if not all, of the authenticators. We note
that at 9 kilobytes each, a collection of 65,536 files would fit nicely on a
single CD. Many people have CD collections of 300+ discs, which by the birthday
paradox makes it more likely than not that there is a hash collision among their
own collection.
Our results indicated that the hash function used in Technology D could be
weak or may have less than 16 bits of output. In the 100 authenticator samples
provided in the Technology D challenge, there were 2 pairs of 16-bit hash
collisions. We will not step through the derivation here, but the probability of
two or more collisions occurring in n samples of X equally likely
possibilities is:
If the 16-bit hash function output has 16 bits of entropy, the probability of
2 collisions occurring in n = 100 samples of X = 216
possibilities is 0.00254 (by the above 1.5 equation). If X ~
211.5, the chances of two collisions occurring is about even. This
suggests that either 4 bits of the 16-bit hash output may be outputs of
functions of the other 12 bits or the hash function used to generate the 16-bit
signature is weak. It is also possible that the challenge designers purposefully
selected TOCs that yield collisions. The designers could gauge the progress of
the contestants by observing whether anyone submits authenticator A with TOC B
to the oracle, where authenticator A is equal to authenticator B. Besides the
relatively large number of collisions in the provided authenticators, it appears
that there are no strong biases in the authenticator bits such as significantly
more or less 1's than 0's.
4.2 Technology E
Technology E is designed to fix a specific bug in Technology D: the TOC only
mentions the length of each song but says nothing about the contents of
that song. As such, an attacker wishing to produce a mix CD would only need to
find a TOC approximately the same as the desired mix CD, then copy the TOC and
authenticator from that CD onto the mix CD. If the TOC does not perfectly match
the CD, the track skipping functionality will still work but will only get
"close" to track boundaries rather than reaching them precisely. Likewise, if a
TOC specified a track length longer than the track we wished to put there, we
could pad the track with digital silence (or properly SDMI-watermarked silence,
copied from another valid track). Regardless, a mix CD played from start to end
would work perfectly. Technology E is designed to counter this attack, using the
audio data itself as part of the authentication process.
The Technology E challenge presented insufficient information to be properly
studied. Rather than giving us the original audio tracks (from which we might
study the unspecified watermarking scheme), we were instead given the tables of
contents for 1000 CDs and a simple scripting language to specify a concatenation
of music clips from any of these CDs. 'Me oracle would process one of these
scripts and then state whether the resulting CD would be rejected.
While we could have mounted a detailed statistical analysis, submitting
hundreds or thousands of queries to the oracle, we believe the challenge was
fundamentally flawed. In practice, given a functioning SDMI device and actual
SDMI-protected content, we could study the audio tracks in detail and determine
the structure of the watermarking scheme.