The new large-scale RNNLM system based on distributed neuron
RNNLM (Recurrent Neural Network Language Model) can save the historical information of the training dataset by the last hidden layer and can also as input for training. It has become an interesting topic in the field of Natural Language Processing research. However, the immense training time overhead is a big problem. The large output layer, hidden layer, last hidden layer and the connections among them will generate enormous matrix in training. It is the main facts to influence the efficiency and scalability. At the same time, output layer class and small hidden layer should decrease the accuracy of RNNLM. In general, the lack of parallel for artificial neuron is main reason for these. We change the structure of RNNLM and design the new large-scale RNNLM by the center of distributed artificial neurons in hidden layer to stimulate the parallel characteristic of biological neuron system. Meanwhile, we change training method, and present the coordination strategy for distributed neuron. At last, the prototype of new large-scale RNNLM system is implemented based on Spark. The testing and analysis results show that the training time overhead is far less than the growth rate of the distributed neuron in hidden layer and size of training dataset. These results show our large-scale RNNLM system has efficiency and scalability advantage.