Skip to content

Commit 682d47c

Browse files
committed
Adding an Iban recipe
1 parent 9e8ff73 commit 682d47c

23 files changed

+978
-3
lines changed

egs/iban/README

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
###
2+
# Iban Data collected by Sarah Samson Juan and Laurent Besacier
3+
# Prepared by Sarah Samson Juan and Laurent Besacier
4+
# Created in GETALP, Grenoble, France
5+
###
6+
7+
8+
## INTRODUCTION ##
9+
This package has iban text and speech corpora used for Automatic Speech Recognition (ASR) experiments. Data is available in the subdirectories of /data. The subdirectories contain:
10+
a. train - train transcript for training ASR system using Kaldi ASR (http://kaldi.sourceforge.net/)
11+
b. test - test transcript for testing ASR system (also Kaldi ASR format)
12+
c. wav - speech corpus
13+
14+
We have provided text corpus and language model in the /LM directory, while, the pronunciation dictionary in /lang directory.
15+
16+
###PUBLICATION ON IBAN DATA AND ASR #####
17+
Details on the corpora and the our experiments on iban ASR can be found in the following list of publication. We appreciate if you cite them if you intend to publish.
18+
19+
@inproceedings{Juan14,
20+
Author = {Sarah Samson Juan and Laurent Besacier and Solange Rossato},
21+
Booktitle = {Proceedings of Workshop for Spoken Language Technology for Under-resourced (SLTU)},
22+
Month = {May},
23+
Title = {Semi-supervised G2P bootstrapping and its application to ASR for a very under-resourced language: Iban},
24+
Year = {2014}}
25+
26+
27+
@inproceedings{Juan2015,
28+
Title = {Using Resources from a closely-Related language to develop ASR for a very under-resourced Language: A case study for Iban},
29+
Author = {Sarah Samson Juan and Laurent Besacier and Benjamin Lecouteux and Mohamed Dyab},
30+
Booktitle = {Proceedings of INTERSPEECH},
31+
Year = {2015},
32+
Address = {Dresden, Germany},
33+
Month = {September}}
34+
35+
36+
###IBAN SPEECH CORPUS
37+
News data provided by a local radio station in Sarawak, Malaysia.
38+
39+
Directory: data/train
40+
Files: text (training transcript), wav.scp (file id and path to audio file), utt2spk (file id and audio id), spk2utt(audio id and file id), wav (.wav files).
41+
For more information about the format, please refer to Kaldi website http://kaldi.sourceforge.net/data_prep.html
42+
Description: training data in Kaldi format about 7 hours. Note: The path of wav files in wav.scp MUST BE MODIFIED to point to the actual location.
43+
44+
Directory: data/test
45+
Files: text (test transcript), wav.scp (file id and path to audio file), utt2spk (file id and audio id), spk2utt(audio id and file id), wav (.wav files).
46+
Description: testing data in Kaldi format about 1 hour. Note: The path of wav files in wav.scp MUST BE MODIFIED to point to the actual location.
47+
48+
The audio files have the format:
49+
ib[m|f]_SPK_UTT where, m refers to male and f refers to female speaker, SPK denotes speaker id and UTT is the utterance id.
50+
51+
#### IBAN TEXT CORPUS
52+
Directory: /LM/
53+
Files: iban-bp-2012.txt, iban-lm-o3.arpa
54+
55+
# /iban-bp-2012.txt
56+
Contains 2 M Words. Full text data crawled from an online newspaper and cleaned as much as we could.
57+
58+
# /iban-lm-o3.arpa
59+
The language model build on SRILM (http://www.speech.sri.com/projects/srilm/) using iban-bp-2012.txt
60+
61+
62+
#### LEXICON/PRONUNCIATION DICTIONARY
63+
Directory: /lang
64+
Files : lexicon.txt (lexicon), nonsilence_phones.txt (speech phones), optional_silence.txt (silence phone)
65+
Description: lexicon contains words and their respective pronunciation, non-speech sound and noise in Kaldi format. Details on the development of the dictionary can be found in our papers. (For this package, we provided the Iban-Hybrid version.)
66+
67+
68+
#TO DOWNLOAD THE REPOSITORY
69+
70+
svn co https://github.com/sarahjuan/iban
71+
72+
### SCRIPTS
73+
In /kaldi-scripts, you can find all scripts that can be used to train and test models from the existing data and lang directory. Note: Path needs to changed to make it work in your own directory.
74+
75+
You can launch run.sh to prepare data & language model, make mfccs and train acoustic models.
76+
77+
78+
### WER RESULTS OBTAINED USING OUR CORPORA AND SETTINGS. RESULTS OBTAINED AFTER UPDATING TEST TRANSCRIPT. THE ONES REPORTED IN OUR PAPERS WERE BEFORE THIS UPDATE##
79+
80+
See the latest results in s5/RESULTS file (they will not match the results from the paper)
81+
82+
##ACKNOWLEDGEMENT ###
83+
We would like to thank the Ministry of Higher Education Malaysia for providing financial support to conduct this study. We also thank The Borneo Post news agency for providing online materials for building the text corpus and also to Radio Televisyen Malaysia (RTM), Sarawak, Malaysia, for providing the news data.
84+

egs/iban/s5/RESULTS

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
%WER 15.32 [ 1686 / 11006, 220 ins, 338 del, 1128 sub ] exp/sgmm2_5b2/decode_dev.big/wer_18_0.0
2+
%WER 15.36 [ 1691 / 11006, 214 ins, 322 del, 1155 sub ] exp/nnet3/nnet_tdnn_h_sp_4_850_170/decode_dev.big/wer_18_0.0
3+
%WER 15.50 [ 1706 / 11006, 212 ins, 327 del, 1167 sub ] exp/nnet3/nnet_tdnn_h_sp_4_850_170/decode_dev.rescored/wer_18_0.0
4+
%WER 15.84 [ 1743 / 11006, 242 ins, 332 del, 1169 sub ] exp/sgmm2_5b2/decode_dev.rescored/wer_15_0.0
5+
%WER 17.45 [ 1921 / 11006, 252 ins, 326 del, 1343 sub ] exp/nnet3/nnet_tdnn_h_sp_4_850_170/decode_dev/wer_15_0.0
6+
%WER 17.55 [ 1932 / 11006, 266 ins, 323 del, 1343 sub ] exp/sgmm2_5b2/decode_dev/wer_13_0.0
7+
%WER 19.08 [ 2100 / 11006, 245 ins, 503 del, 1352 sub ] exp/tri3b/decode_dev.rescored/wer_20_0.0
8+
%WER 20.92 [ 2302 / 11006, 263 ins, 518 del, 1521 sub ] exp/tri3b/decode_dev/wer_19_0.0
9+
%WER 24.19 [ 2662 / 11006, 243 ins, 900 del, 1519 sub ] exp/tri2b/decode_dev.rescored/wer_14_0.0
10+
%WER 25.26 [ 2780 / 11006, 294 ins, 736 del, 1750 sub ] exp/tri3b/decode_dev.si/wer_16_0.0
11+
%WER 26.44 [ 2910 / 11006, 292 ins, 832 del, 1786 sub ] exp/tri2b/decode_dev/wer_13_0.0
12+
%WER 30.99 [ 3411 / 11006, 245 ins, 1391 del, 1775 sub ] exp/tri1/decode_dev.rescored/wer_12_0.0
13+
%WER 33.31 [ 3666 / 11006, 260 ins, 1428 del, 1978 sub ] exp/tri1/decode_dev/wer_12_0.0
14+
%WER 33.81 [ 3721 / 11006, 241 ins, 1585 del, 1895 sub ] exp/tri2a/decode_dev.rescored/wer_11_0.0
15+
%WER 35.69 [ 3928 / 11006, 243 ins, 1750 del, 1935 sub ] exp/tri2a/decode_dev/wer_12_0.0
16+
%WER 39.41 [ 4338 / 11006, 190 ins, 1237 del, 2911 sub ] exp/mono/decode_dev/wer_11_0.0

egs/iban/s5/cmd.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
export train_cmd="run.pl --max-jobs-run 32"
2+
export decode_cmd="run.pl --max-jobs-run 32"

egs/iban/s5/conf/decode.config

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Use wider-than-normal decoding beams for RM.
2+
first_beam=16.0
3+
beam=20.0
4+
lattice_beam=10.0

egs/iban/s5/conf/decode_dnn.config

Whitespace-only changes.

egs/iban/s5/conf/mfcc.conf

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--use-energy=false # only non-default option.

egs/iban/s5/conf/mfcc_hires.conf

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# config for high-resolution MFCC features, intended for neural network training
2+
# Note: we keep all cepstra, so it has the same info as filterbank features,
3+
# but MFCC is more easily compressible (because less correlated) which is why
4+
# we prefer this method.
5+
--use-energy=false # use average of log energy, not energy.
6+
--num-mel-bins=40 # similar to Google's setup.
7+
--num-ceps=40 # there is no dimensionality reduction.
8+
--low-freq=20 # low cutoff frequency for mel bins... this is high-bandwidth data, so
9+
# there might be some information at the low end.
10+
--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)

egs/iban/s5/conf/online_cmvn.conf

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# configuration file for apply-cmvn-online, used in the script ../local/run_online_decoding.sh

egs/iban/s5/local/arpa2G.sh

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
#!/bin/bash
2+
# Copyright 2013-2014 Johns Hopkins University (authors: Yenda Trmal, Daniel Povey)
3+
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
11+
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
12+
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
13+
# MERCHANTABLITY OR NON-INFRINGEMENT.
14+
# See the Apache 2 License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
#Simple utility script to convert the gzipped ARPA lm into a G.fst file
18+
19+
20+
oov_prob_file=
21+
unk_fraction=
22+
cleanup=true
23+
#end configuration section.
24+
25+
26+
27+
echo $0 $@
28+
29+
[ -f ./path.sh ] && . ./path.sh
30+
[ -f ./cmd.sh ] && . ./cmd.sh
31+
. parse_options.sh || exit 1;
32+
33+
if [ $# -ne 3 ]; then
34+
echo "Usage: $0 [options] <arpa-lm-file> <lang-dir> <dest-dir>"
35+
echo "Options: --oov-prob-file <oov-prob-file> # e.g. data/local/oov2prob"
36+
echo " # with this option it will replace <unk> with OOVs in G.fst."
37+
exit 1;
38+
fi
39+
40+
set -e #Exit on non-zero return code from any command
41+
set -o pipefail #Exit if any of the commands in the pipeline will
42+
#return non-zero return code
43+
44+
lmfile=$1
45+
langdir=$2
46+
destdir=$3
47+
48+
mkdir $destdir 2>/dev/null || true
49+
50+
51+
if [ ! -z "$oov_prob_file" ]; then
52+
if [ ! -s "$oov_prob_file" ]; then
53+
echo "$0: oov-prob file $oov_prob_file does not exist"
54+
exit 1;
55+
fi
56+
if [ -z "$unk_fraction" ]; then
57+
echo "--oov-prob option requires --unk-fraction option";
58+
exit 1;
59+
fi
60+
61+
min_prob=$(gunzip -c $lmfile | perl -e ' $minlogprob = 0.0;
62+
while(<STDIN>) { if (m/\\(\d)-grams:/) { $order = $1; }
63+
if ($order == 1) { @A = split;
64+
if ($A[0] < $minlogprob && $A[0] != -99) { $minlogprob = $A[0]; }}} print $minlogprob')
65+
echo "Minimum prob in LM file is $min_prob"
66+
67+
echo "$0: creating LM file with unk words, using $oov_prob_file, in $destdir/lm_tmp.gz"
68+
gunzip -c $lmfile | \
69+
perl -e ' ($oov_prob_file,$min_prob,$unk_fraction) = @ARGV; $ceilinged=0;
70+
$min_prob < 0.0 || die "Bad min_prob"; # this is a log-prob
71+
$unk_fraction > 0.0 || die "Bad unk_fraction"; # this is a prob
72+
open(F, "<$oov_prob_file") || die "opening oov file";
73+
while (<F>) { push @OOVS, $_; }
74+
$num_oovs = @F;
75+
while(<STDIN>) {
76+
if (m/^ngram 1=(\d+)/) { $n = $1 + $num_oovs; print "ngram 1=$n\n"; }
77+
else { print; } # print all lines unchanged except the one that says ngram 1=X.
78+
if (m/^\\1-grams:$/) {
79+
foreach $l (@OOVS) {
80+
@A = split(" ", $l);
81+
@A == 2 || die "bad line in oov2prob: $_;";
82+
($word, $prob) = @A;
83+
$log10prob = (log($prob * $unk_fraction) / log(10.0));
84+
if ($log10prob > $min_prob) { $log10prob = $min_prob; $ceilinged++;}
85+
print "$log10prob $word\n";
86+
}
87+
}} print STDERR "Ceilinged $ceilinged unk-probs\n";' \
88+
$oov_prob_file $min_prob $unk_fraction | gzip -c > $destdir/lm_tmp.gz
89+
lmfile=$destdir/lm_tmp.gz
90+
fi
91+
92+
if [[ $lmfile == *.bz2 ]] ; then
93+
decompress="bunzip2 -c $lmfile"
94+
elif [[ $lmfile == *.gz ]] ; then
95+
decompress="gunzip -c $lmfile"
96+
else
97+
decompress="cat $lmfile"
98+
fi
99+
100+
$decompress | \
101+
grep -v '<s> <s>' | grep -v '</s> <s>' | grep -v '</s> </s>' | \
102+
arpa2fst - | \
103+
fstprint | \
104+
utils/eps2disambig.pl | \
105+
utils/s2eps.pl | \
106+
fstcompile --isymbols=$langdir/words.txt \
107+
--osymbols=$langdir/words.txt --keep_isymbols=false --keep_osymbols=false | \
108+
fstrmepsilon | fstarcsort --sort_type=olabel > $destdir/G.fst || exit 1
109+
fstisstochastic $destdir/G.fst || true;
110+
111+
if $cleanup; then
112+
rm $destdir/lm_tmp.gz 2>/dev/null || true;
113+
fi
114+
115+
exit 0
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#!/bin/bash
2+
3+
## Script was adapted from WSJ (login) and RM (some settings)
4+
5+
. cmd.sh
6+
mfccdir=mfcc
7+
8+
stage=1
9+
10+
. cmd.sh
11+
. ./path.sh
12+
. ./utils/parse_options.sh
13+
14+
15+
if [ $stage -le 1 ]; then
16+
for datadir in train; do
17+
utils/perturb_data_dir_speed.sh 0.9 data/${datadir} data/temp1
18+
utils/perturb_data_dir_speed.sh 1.1 data/${datadir} data/temp2
19+
utils/combine_data.sh data/${datadir}_tmp data/temp1 data/temp2
20+
utils/validate_data_dir.sh --no-feats data/${datadir}_tmp
21+
rm -r data/temp1 data/temp2
22+
23+
mfccdir=mfcc_perturbed
24+
steps/make_mfcc.sh --cmd "$train_cmd" --nj 17 \
25+
data/${datadir}_tmp exp/make_mfcc/${datadir}_tmp $mfccdir || exit 1;
26+
steps/compute_cmvn_stats.sh data/${datadir}_tmp exp/make_mfcc/${datadir}_tmp $mfccdir || exit 1;
27+
utils/fix_data_dir.sh data/${datadir}_tmp
28+
29+
utils/copy_data_dir.sh --spk-prefix sp1.0- --utt-prefix sp1.0- data/${datadir} data/temp0
30+
utils/combine_data.sh data/${datadir}_sp data/${datadir}_tmp data/temp0
31+
utils/fix_data_dir.sh data/${datadir}_sp
32+
rm -r data/temp0 data/${datadir}_tmp
33+
done
34+
fi
35+
36+
mkdir -p exp/nnet3
37+
38+
if [ $stage -le 2 ]; then
39+
steps/align_fmllr.sh --nj 16 --cmd "$train_cmd" \
40+
data/train_sp data/lang exp/tri3b exp/nnet3/tri3b_ali_sp || exit 1
41+
fi
42+
43+
mfccdir=mfcc_hires
44+
if [ $stage -le 3 ]; then
45+
utils/copy_data_dir.sh data/train_sp data/train_hires || exit 1
46+
steps/make_mfcc.sh --nj 16 --mfcc-config conf/mfcc_hires.conf \
47+
--cmd "$train_cmd" data/train_hires exp/make_hires/train $mfccdir || exit 1;
48+
steps/compute_cmvn_stats.sh data/train_hires exp/make_hires/train $mfccdir || exit 1;
49+
50+
for datadir in dev; do
51+
utils/copy_data_dir.sh data/$datadir data/${datadir}_hires || exit 1
52+
steps/make_mfcc.sh --nj 6 --mfcc-config conf/mfcc_hires.conf \
53+
--cmd "$train_cmd" data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1;
54+
steps/compute_cmvn_stats.sh data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1;
55+
done
56+
fi
57+
58+
if [ $stage -le 4 ]; then
59+
# Train a small system just for its LDA+MLLT transform. We use --num-iters 13
60+
# because after we get the transform (12th iter is the last), any further
61+
# training is pointless.
62+
steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \
63+
--realign-iters "" --splice-opts "--left-context=3 --right-context=3" \
64+
5000 10000 data/train_hires data/lang \
65+
exp/nnet3/tri3b_ali_sp exp/nnet3/tri5b || exit 1
66+
fi
67+
68+
if [ $stage -le 5 ]; then
69+
steps/online/nnet2/train_diag_ubm.sh --cmd "$train_cmd" --nj 16 --num-frames 200000 \
70+
data/train_hires 256 exp/nnet3/tri5b exp/nnet3/diag_ubm || exit 1
71+
fi
72+
73+
if [ $stage -le 6 ]; then
74+
# even though $nj is just 10, each job uses multiple processes and threads.
75+
steps/online/nnet2/train_ivector_extractor.sh --cmd "$train_cmd" \
76+
--nj 10 --num-processes 1 --num-threads 2 --ivector-dim 50\
77+
data/train_hires exp/nnet3/diag_ubm exp/nnet3/extractor || exit 1;
78+
fi
79+
80+
if [ $stage -le 7 ]; then
81+
# having a larger number of speakers is helpful for generalization, and to
82+
# handle per-utterance decoding well (iVector starts at zero).
83+
steps/online/nnet2/copy_data_dir.sh --utts-per-spk-max 2 data/train_hires \
84+
data/train_hires_max2 || exit 1
85+
86+
steps/online/nnet2/extract_ivectors_online.sh --cmd "$train_cmd" --nj 16\
87+
data/train_hires_max2 exp/nnet3/extractor exp/nnet3/ivectors_train || exit 1
88+
fi
89+
90+
if [ $stage -le 8 ]; then
91+
steps/online/nnet2/extract_ivectors_online.sh --cmd "$train_cmd" --nj 6 \
92+
data/dev_hires exp/nnet3/extractor exp/nnet3/ivectors_dev || exit 1
93+
fi
94+
95+
exit 0;

0 commit comments

Comments
 (0)