speech-reco
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Speech-reco] Comparison of Voxin to Espeak for short phrase


From: Bill Cox
Subject: [Speech-reco] Comparison of Voxin to Espeak for short phrase
Date: Tue, 31 Aug 2010 12:44:52 -0400

I've gone off and learned a bit about speech synthesis, just enough to
be dangerous.  Overall, I have to say I have increased respect for
Espeak.  It's an amazing tool.  That said, I still understand Voxin at
higher speed than Espeak, and I wanted to know why.  There are
probably a lot of reasons, but I now think I partly understand why
Voxin can say "The pro golfer" faster, yet clearer, than Espeak.

I've linked to two .wav files and the resulting spectrograms, which I
created using a new "speechbox" toolkit I'm hacking together for
examining and manipulating speech samples.  The code is at
vinux-project.org/gitosis if anyone is interested, but it's not ready
for prime time.  The links to the data are:

http://vinux-project.org/espeak_golfer_fast.wav
http://vinux-project.org/voxin_golfer_fast.wav
http://vinux-project.org/espeak.pgm
http://vinux-project.org/voxin.pgm

I ran Voxin at speed "110", and Espeak at speed "450".  Both are at
pitch 65.  I up-sampled the 11.025KHz Voxin to 22.1KHz to match
Espeak.  I took FFTs of both samples with a window size of 2048, an
"envelope width" of 128, and a "step size" of 8.  I don't find these
terms on the Internet, and I had to write my own FFT code, but I found
that the "Snack 2" library does almost the same thing.  The Voxin
phrase ends around pixel 1026, and there are 8 input  samples per
output colum in the spectrograph, meaning the phrase is spoken in
1026*8/22100 = .371 seconds.  The espeak sample plays in 1095 pixels,
or 0.396 seconds.  I percive the Voxin voice to be easier to
comprehend.  I compared the spectrograms and see the following:

Espeak says "With the" slower than Voxin, but says "pro golfer"
slightly faster.  Voxin does "with the" in 303 pixels, while espeak
takes 463.  There are several things I think are going on here.
First, Voxin eliminates the voiced "th" sound completely, saying "wi -
the" as one word, rather than "with - the" as espeak does. (Sorry...
I'm still learning IPC so for now I'm just spelling phonetically).
Second, voxin very rapidly completes the "wi" part, by pixel 98,
versus espeak at 169.  I believe this is mostly an optimization in
Voxin to slur common short words together.  It may also be that I
don't need to hear the "wi" for as long as espeak plays it.  Finally,
Voxin has no exact silence between "With the" and "pro", like espeak
does.  Instead, the voicing reduces and the formants fade to near zero
when the "p" starts.  This seems to allow Voxin to eliminate a silence
gap which espeak currently requires.  Neither Voxin nor Espeak
introduce any further silences, but Voxin does several interesting
things.  First, phonemes don't seem to be even multiples of the
fundamental frequency, allowing Voxin to better tune for comprehension
and speed.  I think this effect becomes more significant at very high
speeds.  For example, the voiced "g" in "golfer" seems to be about 1.7
lengths of the fundamental frequency in Voxin, while Espeak takes 3.
I think that makes the "g" a bit harsh, but I see there's a lot going
on in there, and doing it in 1.7 requires subtle sub-fundamental
period manipulation.  Voxin slows down dramatically at the end, with
the fundamental frequency lowering quite a bit.  Espeak does this,
too, but to a lesser extent.  Also, the volume drops off significantly
in Voxin.  The overall effect is that Voxin speeds past Espeak in the
first part of "pro golfer", but then drags out the "fer".  I percieve
espeak as stopping abruptly, compared to Voxin's more smooth stop.
Also, I suspect the Voxin approach would allow a smaller or even zero
silence before the next phrase begins, while I'd need a silence with
Espeak.

That's all I've found out for now.  I'm hoping that over the next
weeks to months I'll be able to begin diving into the espeak code and
see if I can figure out how to tweak things.  Now that I can begin to
see what's slowing down Espeak, I'm rather excited to have a try at
speeding it up.

Thanks,
Bill



reply via email to

[Prev in Thread] Current Thread [Next in Thread]