DSSM with CNTK

Jun 3, 2015 at 10:13 AM
Hello,

I started using CNTK for some of my tasks. It is quite interesting and very simple to train NNs.

One of the things that I have been trying is to build word semantic embeddings using DSSM.

You can find my (non working, experimental) configuration and NDL files here.

Compared to the original work I am using a small dataset to begin with. My dataset has about 10K documents and vocabulary of 17k words. Word hashing reduces the one hot input dimensionality to about 8K with the tri-gram character representation. This is the input to the model described in the configuration and NDL files.

For the moment I am facing the following problems to train the model:
[] The DSSMReader available in the CNTK source has some Windows specific api's. While I am trying to port it to work on my linux (by replacing Windows file mapping api's with mmap), I would like to know if it has been already tried by someone? Also what is the format of data expected by DSSMReader?
[] As an alternative, I am trying to use the UCIFastReader in my configuration. The uci file stores concatenation of the query and document input vectors and are sliced in the NDL file. The labels in the file are dummy and not used in the model. However with this approach I found that the UCIFastReader does not read all the lines/records in the uci input file. A little debugging of the source code showed that UCIParser uses _ftelli64() in the GetFilePosition() function to calculate the size of the uci file and this fails to give the true file size on my machine (Linux 3.13.0-52-generic #86-Ubuntu 64bit, gcc version 4.8.2). I replaced _ftelli64() with ftell() and now it can read the entire file. (Yes this is not the best/portable solution and may be something like stat will be a better one). As i debug further, it seems some variables (like m_totalNumbersConverted) in UCIParser should also be changed to int64_t. Before making these changes for myself I would like to confirm if this is really an issue or something wrong with my setup.

Further I would be glad to learn about my mistakes and the inefficient steps in my network configuration. (Specially in calculation of the loss function.)

Thank you,

Imran
Jun 4, 2015 at 11:58 PM
I am not a Linux expert but was told that something similar to memory mapped files is available in Linux. I will need to consult the author of the DSSMReader to know the format it requires.

For the UCIFastReader your observation is right I think. Would you mind sending us the patch once it works on your task?
Jun 10, 2015 at 3:34 PM
Edited Jun 10, 2015 at 3:37 PM
Please find below the patch for UCIFastReader to enable reading of large data files.
I have tested this on my Ubuntu and Win7 machines by:
(1) verifying the number of records read and numbers parsed
(2) asking cntk to dump the values of input nodes
/*
*line 93,94 in UCIParser.h
*/
int64_t m_totalNumbersConverted;
int64_t m_totalLabelsConverted;
/*
*line 380 in UCIParser.cpp
*/
int64_t position = ftell64(m_pFile);
/*
*add definition of ftell64() in UCIParser.cpp
*/
#if WIN32
#define ftell64 _ftelli64
#else
#define ftell64 ftell
#endif
I am using UCIFastReader with this patch this to train the DSSM word embeddings.
For now the word embeddings are such that only words with similar spellings group together.
Checking what is going wrong and trying variations in the training configuration.
Jun 11, 2015 at 2:10 AM
Thanks. I will port the changes tomorrow.

When using DSSM please note that the input is a list of positive samples (feature-label pair) and the negative samples will be automatically generated by permuting the labels inside the minibatch. This means you get better results if the permutation would have very small probability of generating positive (instead of negative) samples.
Jun 12, 2015 at 6:29 PM
Expected binary format for DSSMReader:

Header:
            Int64 numRows
            Int32 numCols  
            Int64 nnz_total 
----Here is header offset
            Int64[numRows] instanceOffsets // offset relative to header offset, i.e., header offset + instance offset[I] is the offset to find instance i
Each instance (total numRows instances)
                int32 nnz //number of nonzero values
                float[nnz] //actual nonzero values, total nnz values
                int32[nnz] colIds //column ids (in ascending order) corresponding to the nnz values