Personal tools


From SynSIG


Full system


Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Tools and documentation for build new voices are available through Carnegie Mellon's FestVox project

  • Last update: 2015/01/06
  • Reference:
     title	       = {The festival speech synthesis system, version 1.4.2},
     author       = {Black, Alan and Taylor, Paul and Caley, Richard and
   		  Clark, Rob and Richmond, Korin and King, Simon and
   		  Strom, Volker and Zen, Heiga},
     journal      = {Unpublished document available via},
     year	       = {2001}


FreeTTS is a speech synthesis system written entirely in the JavaTM programming language. It is based upon Flite: a small run-time speech synthesis engine developed at Carnegie Mellon University. Flite is derived from the Festival Speech Synthesis System from the University of Edinburgh and the FestVox project from Carnegie Mellon University.

  • Last update: 2009-03-09
  • Reference:
     title	       = {Freetts 1.2: A speech synthesizer written entirely
   		  in the Java programming language},
     author       = {Walker, Willie and Lamere, Paul and Kwok, Philip},
     year	       = {2010}


The aim of the MBROLA project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons (Belgium), is to obtain a set of diphone-based speech synthesizers for as many languages as possible, and provide them free for non-commercial applications.

  • Last update:
  • Reference:
     title	       = {The MBROLA project: Towards a set of high quality
   		  speech synthesizers free of use for non commercial
     author       = {Dutoit, Thierry and Pagel, Vincent and Pierret,
   		  Nicolas and Bataille, Fran{\c{c}}ois and Van der
   		  Vrecken, Olivier},
     booktitle    = {Spoken Language, 1996. ICSLP 96. Proceedings.,
   		  Fourth International Conference on},
     volume       = {3},
     pages	       = {1393--1396},
     year	       = {1996},
     organization = {IEEE}


MARY is a multi-lingual (German, English, Tibetan) and multi-platform (Windows, Linux, MacOs X and Solaris) speech synthesis system. It comes with an easy-to-use installer - no technical expertise should be required for installation. It enables expressive speech synthesis, using both diphone and unit-selection synthesis.

  • Last update: 2017/09/26
  • Reference:
     title	       = {The German text-to-speech synthesis system MARY: A
   		  tool for research, development and teaching},
     author       = {Schr{"o}der, Marc and Trouvain, J{"u}rgen},
     journal      = {International Journal of Speech Technology},
     volume       = {6},
     number       = {4},
     pages	       = {365--377},
     year	       = {2003},
     publisher    = {Springer}

Front end (NLP part)

Front end inc G2P


(Si)mply a (Re)search front-end for Text-To-Speech Synthesis. This is a research front-end for TTS. It is incomplete, inconsistent, badly coded and slow. But it is useful for me and should slowly develop into something useful to others.

  • Last update: 2016/10/11


This repository contains scripts suitable for training, evaluating and using grapheme-to-phoneme models for speech recognition using the OpenFst framework. The current build requires OpenFst version 1.6.0 or later, and the examples below use version 1.6.2.

The repository includes C++ binaries suitable for training, compiling, and evaluating G2P models. It also some simple python bindings which may be used to extract individual multigram scores, alignments, and to dump the raw lattices in .fst format for each word.

  • Last update: 2017/09/17


Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. Work on it started with funding from the EU FP7 Project Simple4All, and this repository contains a version which is considerable more up-to-date than that previously available. In particular, the original version of the toolkit relied on HTS to perform acoustic modelling. Although it is still possible to use HTS, it now supports the use of neural nets trained with the Merlin toolkit as duration and acoustic models. All comments and feedback about ways to improve it are very welcome.

  • Last update: 2017/09/15


The SALB system is a software framework for speech synthesis using HMM based voice models built by HTS ( See a more generic description on

The package currently includes:

A C++ framework that abstracts the backend functionality and provides a SAPI5 interface, a command line interface and a C++ API.

Backend functionality is provided by

  • an internal text analysis module for (Austrian) German,
  • flite as text analysis module for English and
  • htsengine for parameter generation/synthesis. (see COPYING for information on 3rd party libraries)

Also included is an Austrian German male voice model.

  • Last update: 2016/11/14

Sequence-to-Sequence G2P toolkit

The tool does Grapheme-to-Phoneme (G2P) conversion using recurrent neural network (RNN) with long short-term memory units (LSTM). LSTM sequence-to-sequence models were successfully applied in various tasks, including machine translation [1] and grapheme-to-phoneme [2].

This implementation is based on python TensorFlow, which allows an efficient training on both CPU and GPU.

  • Last update: 2017/03/28

Text normalization


Sparrowhawk is an open-source implementation of Google's Kestrel text-to-speech text normalization system. It follows the discussion of the Kestrel system as described in:

Ebden, Peter and Sproat, Richard. 2015. The Kestrel TTS text normalization system. Natural Language Engineering, Issue 03, pp 333-353.

After sentence segmentation (sentenceboundary.h), the individual sentences are first tokenized with each token being classified, and then passed to the normalizer. The system can output as an unannotated string of words, and richer annotation with links between input tokens, their input string positions, and the output words is also available.

  • Last update: 2017/07/25


This is the README for the Automatic Speech Recognition Tools.

This project contains various scripts in order to facilitate the preparation of ASR related tasks.

Current tasks ares:

  1. Sentences extraction from pdf files
  1. Sentences classification by langues
  1. Sentences filtering and cleaning

Document sentences can be extracted into single document or batch mode.

For an example on how to extract sentences in batch mode, please have a look at the script located in examples/bash directory.

For an example on how to extract sentences in single document mode, please have a look at the script located in examples/bash directory.

The is also an API to be used in python code. It is located into the common package and is called

  • Last update: 2017/09/20

Dictionary related tools

CMU Pronunciation Dictionary Tools

Tools for working with the CMU Pronunciation Dictionary

  • Last update: 2015/02/23

ISS scripts for dictionary maintenance

These scripts are sufficient to convert the distributed forms of dictionaries into forms useful for our tools (notably HTK and ISS). Once a dictionary is in a standard form, the generic tools in ISS can be used to manipulate it further.

  • Last update: 2017/07/04

Backend (Acoustic part)

Unit selection

HMM based


MAGE is a C/C++ software toolkit for reactive implementation of HMM-based speech and singing synthesis.

  • Last update: 2014/07/18

HMM-Based Speech Synthesis System (HTS)

The basic core system of HTS, available from NITECH, was implemented as a modified version of HTK together with SPTK (see below), and is released as HMM-Based Speech Synthesis System (HTS) in a form of patch code to HTK.

  • Last update: 2016/12/25

HTS Engine

htsengine is a small run-time synthesis engine (less than 1 MB including acoustic models), which can run without the HTK library. The current version does not include any text analyzer but the Festival Speech Synthesis System can be used as a text analyzer.

  • Last update: 2015/12/25

DNN based


Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).

The system is written in Python and relies on the Theano numerical computation library.

Merlin comes with recipes (in the spirit of the Kaldi automatic speech recognition toolkit) to show you how to build state-of-the art systems.

  • Last update: 2017/09/29
  • Reference:
     title	       = {Merlin: An open source neural network speech synthesis system},
     author       = {Wu, Zhizheng and Watts, Oliver and King, Simon},
     journal      = {Proc. SSW, Sunnyvale, USA},
     year	       = {2016}


Idlak is a project to build an end-to-end parametric TTS system within Kaldi, to be distributed with the same licence.

It contains a robust front-end, voice building tools, speech analysis utilities, and DNN tools suitable for parametric synthesis. It also contains an example of using Idlak as an end-to-end TTS system, in egs/ttsdnnarctic/s1

Note that the kaldi structure has been maintained and the tool building procedure is identical.

  • Last update: 2017/07/03
  • Reference:
     title	       = {Idlak Tangle: An Open Source Kaldi Based Parametric Speech Synthesiser Based on DNN.},
     author       = {Potard, Blaise and Aylett, Matthew P and Baude, David A and Motlicek, Petr},
     booktitle    = {INTERSPEECH},
     pages	       = {2293--2297},
     year	       = {2016}

Wavenet based


Signal processing

Vocoder, Glottal modelling


STRAIGHT is a tool for manipulating voice quality, timbre, pitch, speed and other attributes flexibly. It is an always evolving system for attaining better sound quality, that is close to the original natural speech, by introducing advanced signal processing algorithms and findings in computational aspects of auditory processing.

STRAIGHT decomposes sounds into source information and resonator (filter) information. This conceptually simple decomposition makes it easy to conduct experiments on speech perception using STRAIGHT, the initial design objective of this tool, and to interpret experimental results in terms of huge body of classical studies.

  • Last update:


WORLD is free software for high-quality speech analysis, manipulation and synthesis. It can estimate Fundamental frequency (F0), aperiodicity and spectral envelope and also generate the speech like input speech with only estimated parameters.

This source code is released under the modified-BSD license. There is no patent in all algorithms in WORLD.

  • Last update: 2017/08/23

Covarep - A Cooperative Voice Analysis Repository for Speech Technologies

Covarep is an open-source repository of advanced speech processing algorithms and is stored as a GitHub project ( where researchers in speech processing can store original implementations of published algorithms.

Over the past few decades a vast array of advanced speech processing algorithms have been developed, often offering significant improvements over the existing state-of-the-art. Such algorithms can have a reasonably high degree of complexity and, hence, can be difficult to accurately re-implement based on article descriptions. Another issue is the so-called 'bug magnet effect' with re-implementations frequently having significant differences from the original ones. The consequence of all this has been that many promising developments have been under-exploited or discarded, with researchers tending to stick to conventional analysis methods.

By developing Covarep we are hoping to address this by encouraging authors to include original implementations of their algorithms, thus resulting in a single de facto version for the speech community to refer to.

  • Last update: 2016/10/16
  • Reference:
     title	       = {COVAREP: A Cooperative Voice Analysis Repository for Speech Technologies},
     author       = {Degottex, Gilles},
     year	       = {2014}

MagPhase Vocoder

Speech analysis/synthesis system for TTS and related applications.

This software is based on the method described in the paper:

  1. Espic, C. Valentini-Botinhao, and S. King, “Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis,” in Proc. Interspeech, Stockholm, Sweden, August, 2017.
  1. Last update: 2017/08/30
  1. Link:
  1. Reference:
     title	       = {Direct Modelling of Magnitude and Phase Spectra for
   		  Statistical Parametric Speech Synthesis},
     author       = {Espic, Felipe and Valentini-Botinhao, Cassia and King, Simon},
     journal      = {Proc. Interspeech, Stochohlm, Sweden},
     year	       = {2017}


Waveform generator based on signal reshaping for statistical parametric speech synthesis.

  • Last update: 2017/08/30
  • Reference:
     title	       = {Waveform Generation Based on Signal Reshaping for Statistical Parametric Speech Synthesis.},
     author       = {Espic, Felipe and Valentini-Botinhao, Cassia and Wu, Zhizheng and King, Simon},
     booktitle    = {INTERSPEECH},
     pages	       = {2263--2267},
     year	       = {2016}

Pulse model analysis and synthesis

It is basically the vocoder described in:

  1. Degottex, P. Lanchantin, and M. Gales, "A Pulse Model in Log-domain for a Uniform Synthesizer," in Proc. 9th Speech Synthesis Workshop (SSW9), 2016.
  1. Last update: 2017/09/7
  1. Link:
  1. Reference:
     title	       = {A pulse model in log-domain for a uniform synthesizer},
     author       = {Degottex, Gilles and Lanchantin, Pierre and Gales, Mark},
     year	       = {2016},
     publisher    = {International Speech Communication Association}

YANG VOCODER: Yet-ANother-Generalized VOCODER

Yet another vocoder that is not STRAIGHT.

This project is a state-of-the-art vocoder that parameterizes the speech signal into a parameterization that is amenable to statistical manipulation.

The VOCODER was developed by Hideki Kawahara during his internship at Google.

  • Last update: 2017/01/02


Ahocoder parameterizes speech waveforms into three different streams: log-f0, cepstral representation of the spectral envelope, and maximum voiced frequency. It provides high accuracy during analysis and high quality during reconstruction. It is adequate for statistical parametric speech synthesis and voice conversion. Furthermore, it can be used just for basic speech manipulation and transformation (pitch level and variance, speaking rate, vocal tract length…).

Ahocoder is reported to be a very good complement for HTS. The output files generated by Ahocoder contain float numbers without header, so they are fully compatible with the HTS demo scripts in the HTS website. You can use the same configuration as in the STRAIGHT-based demo, using the "bap" stream to handle maximum voiced frequency (set its dimension to 1 both in data/Makefile and in scripts/

  • Last update: 2014

PhonVoc: Phonetic and Phonological vocoding

This is a computational platform for Phonetic and Phonological vocoding, released under the BSD licence. See file COPYING for details. The software is based on Kaldi (v. 489a1f5) and Idiap SSP. For training of the analysis and synthesis models, follow please train/README.txt.

  • Last update: 2016/11/23

Pitch extractor

REAPER: Robust Epoch And Pitch EstimatoR

This is a speech processing system. The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.

This code was developed by David Talkin at Google. This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

  • Last update: 2015/03/04

SSP - Speech Signal Processing module

SSP is a package for doing signal processing in python; the functionality is biassed towards speech signals. Top level programs include a feature extracter for speech recognition, and a vocoder for both coding and speech synthesis. The vocoder is based on linear prediction, but with several experimental excitation models. A continuous pitch extraction algorithm is also provided, built around standard components and a Kalman filter.

There is a "sister" package, libssp, that includes translations of some algorithms in C++. Libssp is built around libube that makes this translation easier.

SSP is released under a BSD licence. See the file COPYING for details.

  • Last update: 2017/04/16

Diverse useful tools

SPTK - Speech Signal Processing Toolkit

The main feature of the Speech Signal Processing Toolkit, available from NITECH, is that not only standard speech analysis and synthesis techniques (e.g., LPC analysis, PARCOR analysis, LSP analysis, PARCOR synthesis filter, LSP synthesis filter, and vector quantization techniques) but also speech analysis and synthesis techniques developed at the research group can easily be used.

  • Last update: 2016/12/25

Singing synthesizer


Sinsy is a HMM-based singing voice synthesis system.

  • Last update: 2015/12/25

Ebook reader

Bard Storyteller ebook reader

Bard Storyteller is a text reader. Bard not only allows a user to read books, but can also read books to the user using text-to-speech. It supports txt, epub and (x)html files.

  • Last update: 2014/07

Various tools


Matlab realtime speech tools and voice production tools

  • Last update: 2017/06/29

Articulatory synthesizer

KLAIR - A virtual infant for spoken language acquisition research

The KLAIR project aims to build and develop a computational platform to assist research into the acquisition of spoken language. The main part of KLAIR is a sensori-motor server that displays a virtual infant on screen that can see, hear and speak. Behind the scenes, the server can talk to one or more client applications. Each client can monitor the audio visual input to the server and can send articulatory gestures to the head for it to speak through an articulatory synthesizer. Clients can also control the position of the head and the eyes as well as setting facial expressions. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction.

  • Last update:
  • Reference:
     title	       = {KLAIR: a virtual infant for spoken language acquisition research.},
     author       = {Huckvale, Mark and Howard, Ian S and Fagel, Sascha},
     booktitle    = {INTERSPEECH},
     pages	       = {696--699},
     year	       = {2009}


VocalTractLab stands for "Vocal Tract Laboratory" and is an interactive multimedial software tool to demonstrate the mechanism of speech production. It is meant to facilitate an intuitive understanding of speech production for students of phonetics and related disciplines.

The current versions of VocalTractLab are free of charge. Only a registration code, which you can request by email, will be necessary to activate the software. VocalTractLab is written for Windows operating systems (XP or higher), but a porting to Linux/Unix is conceivable for the future.

  • Last update: 2016

Visualization & annotation tools


Praat is a system for doing phonetics by computer. The computer program Praat is a research, publication, and productivity tool for phoneticians. With it, you can analyse, synthesize, and manipulate speech, and create high-quality pictures for your articles and thesis.

  • Last update:
  • Reference:
     title	       = {Praat: doing phonetics by computer},
     author       = {Boersma, Paul},
     journal      = {},
     year	       = {2006}


KPE provides a graphical interface for the implementation of the Klatt 1980 formant synthesiser. The interface allows users to display and edit Klatt parameters using a graphical display which includes the time-amplitude waveform of both the original speech and its synthetic copy, and some signal analysis facilities.

  • Last update:
  • Reference:


WaveSurfer is a tool for doing speech analysis. The analysis features include formants and pitch extraction and real time spectrograms. The Wavesurfer tool built on top of the Snack speech visualization module, is highly modular and extensible at several levels.

  • Last update:
  • Reference:

SynSIG is a Special Interest Group of ISCA, the International Speech Communication Association.

SynSIG 1998-2017