Soundmosaic constructs an approximation of one sound out of small
pieces of other sounds.
Starting with a target file and a set of source files, soundmosaic
splits the target file up into equal-sized segments, or "tiles". For
each tile in the target file, it finds the closest match in the source
files, and replaces the target tile with the tile from the source
I've made some sample mp3s:
For the first demo, the target sound was a recording of a chimpanzee screaming, and the source files were a
few short recordings from George W. Bush's public speeches.
The final product is a concatenation of soundmosaic results for
decreasing tile sizes, starting at a few seconds per tile (such that
the first sound is a direct clip of GW's speech), and decreasing to
one microsecond per tile (such that the last sound is a perfect
reproduction of the chimp's
The second demo is based on a recording of the Beatles introducing
themselves, replaced by snippets of John Coltrane performing "A Love
Supreme". The tile size is about half a centisecond. You can hear
the sax pretty clearly, especially when one of the Beatles whistles
after George's introduction. Some of the clicks you hear are
artifacts of the concatenation, but others are drums, mostly the ride
The difference between two tiles is defined as the correlation of the
normalized vectors. This is the cosine of the angle between the
vectors, and can be calculated with a dot product once the vectors
have been scaled to any common length.
In fact, the prospective match is scaled to the volume of the original
tile before comparison, and it is written to the output file at that
volume. Normalization before comparison means that the overall volume
of tiles does not affect the comparison. This also serves to make the
output sound a little bit more like the target, since it follows the
same broad amplitude changes.
Before 1.1, soundmosaic used the Manhattan distance between the
"normalized" vectors, where "normalization" was done in the common
audio sense of increasing the volume as much as possible without
clipping (this corresponds to mapping onto the surface of a hypercube
rather than a hypersphere). The old metric worked reasonably well,
but the new metric is much better.
Soundmosaic automatically resamples the source files to match the
sample rate of the target file. It does this using a simple zero
order hold / drop sample resampler, which is low quality and
introduces all kinds of artifacts -- it doesn't even low pass filter
at the relevant Nyquist frequency. If resampling quality is important
to you, you should use a higher quality resampler to adjust all of
your source material to the same sample rate as the target file before
you run soundmosaic.
Dealing with Large Amounts of Data:
In order to find matches good enough to make both the target and
source inputs recognizable in the output, it helps to have a
tremendous amount of source data, and a tremendous amount of data
storage and processing to go with it. Distributing the system across
multiple machines using the --master and --slave options helps to
handle that load so that a decent result can be achieved in a more
reasonable amount of time.
Normally, we compare each tile with all of the continuous tiles in the
source files (one beginning at the first sample, another beginning at
the second, and so on). That's very time consuming, though, even for
a small amount of data, so the --partition flag is provided to merely
partition the source file into non-overlapping tiles, the same as is
done with the target file. This method produces lower quality
results, but it allows for a variety of source tiles, and prevents the
processing time from getting out of hand. It can be a useful way to
"test run" a soundmosaic project to get an idea of what the results
might be like.
I'm interested in ways of speeding up the calculation of distance --
I'm not sure whether soundmosaic can use the standard DSP techniques
for calculating correlation more efficiently, because I think the
per-tile normalization probably gets in the way.
I'm also interested in distance metrics which are more relevant to the
sounds which are important to the human ear. It might be helpful to
filter some frequency ranges before doing the comparison, or to use
mp3 compression to strip out less important information.
Soundmosaic usually produces output that clicks loudly at the edges of
tiles. I'd like to fix that. I could fade the ends of every output
tile, but I'm not sure that would sound any better for small tile
sizes, and I don't know what the falloff curve should be or how
quickly to fade the edges. Or I could split tiles at the nearest
0-crossing, but I don't like the idea of having variable-length tiles.