Development Page--Not for Official Use
This voicemodem protocol is based on documents describing the voice modem protocol for Rockwell voicemodem chipsets. Pm hasn't been able to find the original document he used to develop the voiceinfo software, but another version is available at http://www.zoltrix.com/PUBLIC/MODEM/ATmanual/ATVROCK.HTM. A bound dead-tree version of this manual is available somewhere in the DNR IT library.
I found this AT Commands for RCV56ACx, RCV336ACx, RCV288ACx, and RCV144ACx Modems manual (375K PDF) after some googling. I believe it is a more recent version of the above manual.
A reminder when reading the manual: DTE stands for "Data Terminal Equipment" and represents the computer connected to the voice modem (running voiceinfo, in this case). DCE stands for "Data Communication Equipment" and is the voicemodem itself.
To play audio out the voice modem, the computer (DTE) does the following:
If the voicemodem detects a touch-tone (DTMF) signal while the audio is playing, it sends it back to the computer as a <DLE> character followed by a byte indicating the specific DTMF signal sent.
Ideally what voiceinfo (PlayPhrases) should probably do is to flush the serial port buffer and send an abort command <DLE><CAN> ("\x10\x18", I think) to the modem to cause it to immediately stop playing the audio. If voiceinfo can clear the playback buffers as soon as it detects a DTMF tone, there's no need for the delay between packets and stuttering will no longer be a problem.