crazy idea with military/intelligence applications
I'm not sure how I ended up there, but I found myself read about linear predictive coding of speech just now. My crazy idea is making a system that transforms voice in such a way that the content of the speech is clear but the identity of the speaker is destroyed. The "scramble suits" in A Scanner Darkly performed a similar function, but in a fictional world. For a quick rundown of LPC, let me say that it is a faily simple process that breaks down an incoming audio signal based on the assumption that it is being produced by a buzzer in resonating tube -- not a bad approximation for the vocal chords and the mouth and nose cavities in the human head. LPC has already been used in speech compression as well as creating vocoder-style effects in music and other applications. For each "frame", a window of a few ms, the LPC encoder gives you a handful of coefficients representing the shape of the resonating filter (important for preserving formants) and a base frequency for the buzzer source. In my system these two pieces of data would be regularlized in a way that many speaker's voices would be mapped to the same output. The buzzer frequency for high and low pitched voices would be normalized by finding the current frequency's difference from a moving average. This difference would be applied to a fixed base frequency defined in the algorithm. In this way meaning-conveying pitch variations (rising tone in questions, etc.) would be preserved while obscuring the true pitch of the speaker's voice. The shift between the canonical frequency and the speakers buzzer frequency could be used to shift up or down the formats represented by the filter coefficients to preserver thier relative location without further leaking pitch information. Furthermore, the space of normalized filter coefficients could be segmented into bins that give enough variation for good intelligibility, but collect several speaker's variations in vocalization into the same buckets (although a simple VQ isn't immediately applicable in this space). So far I only have ways to normalize the voice with respect to overall pitch, but there are several other identifying features of a voice that would still be percievable after this process. These include features like accent, pace, vocabulary, grammar. To defeat these as well one would probably have to go to a system that read in a large window of speech, correctly extracted and interpreted the natural language, and resynthesized its meaning with a canonical grammar -- certainly not feasable for real time communication nor even possible with current technology that I know of. My system provides first-line-of-defense against voice identification while only introducing delays on the order of a frame as well as being general enough to apply to several languages without an extensive database -- systems with far greater complexity already exist with hardware implementations inside the average cell phone.
How well does it really work? I have no idea, I just thought of it. Maybe I can string together a minimal prototype in PureData when this fast-paced working lifestyle passes over.
Wait, isn't this hiding stuff just the opposite of what Adam is always talking about? Hmm, very true... Well, the normalized voice would be much more compressible -- youd only need to send the pitch-differential signal (which could be heavily mutilated while still retaining clarity, it'd just sound a little robotic) and an identifier for the filter-coefficient bucket used. Sure its no good for hi-fi music compression, but the application here is encoding a single speakers words and expression, nothing more. Plenty of shortcuts to make!