A recipe is given in Demos/Speech for using HTCopy to transform the TIMIT .wav files to .zda files which are used as features for CNTK. A master label file TimitLabels.mlf is provided but no recipe for generating it is given. The HTK MLF file TimitLabels.mlf
is evidently derived from the TIMIT .PHN files.
For example the TIMIT file SI1027.PHN begins:
0 1513 h#
1513 2994 q
2994 4440 iy
and the file TimitLabels.mlf begins:
0 200000 h#_s2 -136.655975 h# -589.680481 h#
200000 400000 h#_s3 -145.780716
400000 800000 h#_s4 -307.243774
800000 1200000 q_s2 -349.529327 q -897.429504 q
1200000 1500000 q_s3 -280.568817
1500000 1800000 q_s4 -267.331390
1800000 1900000 iy_s2 -76.825096 iy -673.892883 iy
1900000 2400000 iy_s3 -305.832458
2400000 2800000 iy_s4 -291.235352
The time data are in units of 100 nsec. (as per the HTK standard). If these numbers are divided by 10^5, the resulting number represents the analysis frame number (frames are 10 ms. in length). They are derived from the TIMIT PHN files by multiplying the times
given there in 16 KHz samples by 625 and rounding down to an exact frame boundary.
Each phone such as h# q iy etc. is split into three, for example iy_s2 iy_s3 iy_s4. This is what gives the 183 phones used in the Speech demo (the TIMIT PHN files use 61 distinct phones).
How were the PHN files transformed and concatenated to produce the file TimitLabels.mlf?
How were the durations of the individual phones in the PHN files split up into three sections each?
How are the scores derived (the log probability numbers after each phone label, as per the HTK MLF file format)? Are these scores used in CNTK?
Thank you for any help,