weighting,weighting method

文史通2年前历史故事头条316

本文由竹间智能科技(Emotibot)出品。作者为Emotibot科学家彭冠举和许储羽。

Emotibot致力于打造中国首款人工智能伴侣,以情感计算研究为核心,深度学习等尖端技术为基础,满足广大用户的日常生活和工作所需。

如需转载,请联系Emotibot情感机器人微信公众号(Emotibot_tech),并注明出处。

本文由竹间智能科技(Emotibot)出品。作者为Emotibot科学家彭冠举和许储羽。

Emotibot致力于打造中国首款人工智能伴侣,以情感计算研究为核心,深度学习等尖端技术为基础,满足广大用户的日常生活和工作所需。

如需转载,请联系Emotibot情感机器人微信公众号(Emotibot_tech),并注明出处。

Author introduction:

Guan-Ju Peng is the research scientist in Emotibot. His research interests include mathematical modeling in image/video processing and machine learning applications.

Chu-Yu Hsu is currently the research engineer in Emotibot. His research interests include in machine learning and natural language processing.

Author introduction:

Guan-Ju Peng is the research scientist in Emotibot. His research interests include mathematical modeling in image/video processing and machine learning applications.

Chu-Yu Hsu is currently the research engineer in Emotibot. His research interests include in machine learning and natural language processing.

The Distributional Representation of Language:

The Unsupervised Way

Abstract

展开全文

In this article, we briefly review some important prior arts related to distributional representation of language, especially in an unsupervised way. We also discuss their possible usage in natural language processing and building a conversational robot.

1 Introduction

The way how we represent language is a fundamental problem in many natural language processing application, such as name entity recognition (NER), semantic role labeling (SRL), sentiment analysis, etc. Recently, since the methods of machine learning to perform these tasks are shown to be powerful and promising, selecting the feature they learned from becomes an important issue. The traditional features, such as bag-of-word (BOW) and the n-gram are trivial [1], but they are not adjustable and thus provide limited usage in machine learning algorithms. On the contrary, the distributional representation transforms documents into vectors and/or tensors, and is thus adjustable in the machine learning process. This flexibility greatly benefits the performance of many machine learning algorithms in natural language processing [2, 3, 4, 5, 6, 7, 8, 9, 10, 11].

The distributional representation can be learned from the language itself. In the case, the following two assumptions are usually considered: 1) "Distributional hypothesis" means that similar words appear in similar context [12], and 2) The meaning of a word is determined by those of surrounding words [13]. Two methods adopting these assumptions are proposed to learn the representation of word [10] and that of sentence [14], respectively. Later in [11], the global probability of any two-words' concurrence is considered as an additional feature. In the case, the higher weights are imposed on the neighboring documents which co-occur with the center one more frequently in the global range.

In contrast to the unsupervised methods mentioned above, when the labeled data, which can be very expensive, is available, the distributional representation can also be learned from these labels. A simple neural probabilistic model for the purpose is firstly proposed in [2]. Based on the foundation, Collobert et al, introduced the multi-task learning to enhance its generality in different NLP applications [5]. In [8], Socher et el, use an additional matrix to represent a word for its relation to the neighboring words. And by parsing the dependency tree, the learning structure of word representation is dynamically constructed for each sentence.

In the recent years, to model the longer dependency among documents, the Recurrent Neural Network, especially the Long-Short-Term Memory (LSTM) [15] and its variants [16], draw tremendous attention in the natural language processing field. LSTM not only provides the representation of words but also that of sentences by its state argument. LSTM also shows significant performance while being used to train language model [17].

In the article, we briefly introduce several important methods belonging to the unsupervised category in learning language representation [10, 11, 14]. Although the supervised methods [2, 5] are also very important, and moreover, the new models for the NLP tasks are proposed to the vigorous field rapidly, realizing the arts mentioned in the article can still be a solid foundation of getting into the field.

2 Notations

We define the notation used in the rest parts of the article in this section.

(1) Hierarchy of Language: A document, from high level to low level, can be articles, paragraphs, sentences, or words. We use a function hto denote the decomposition from one level to the lower level. For example, if sis a sentence, then h(s)is the sequence of words in the sentence.

(2) Window: Let sdenote the document, and then the Ν(s)is the set consisting of its neighboring documents at the same level. The window size indicates the amount of neighbors in the set.

(3) Tensor Mapping: We use Sto denote the set containing all possible documents, which is also called dictionary. A function gis a mapping from the sto a tensor ds, and the set Dis the set containing all tensors mapped from S. The mapping function g is usually learned from the data, and depending on the complexity of s, it can be very complicate multi-layer neural network, or a very simple look-up table. For example, the tensor representation of sentences can be mapped by tangled LSTMs, while that of words is usually a simple look-up table.

(4) The Probability Model: A fundamental problem in any machine learning approach is how to model the probability of events. A general form of the modeling can be written as follows:

where Φ denote the conditional events, P(s;Φ) the probability of s, fand Ω are the modeling function, and its parameters. For example, modeling the occurrence of document scan be a simple softmax function with its neighbors N(s) as a known condition:

weighting,weighting method

3 Skip-Gram Model for Language Representation

The skip-gram model is founded on the assumption in [13], where the meaning of a document is determined by its neighbors. In the model, for each documents, the probability of its co-occurrence with its neighbors is considered:

where Tis the whole documents. If we assume the neighboring documents are independent to each other, the probability in (3) becomes

Our objective is to find an appropriate modeling function f(s; c,Ω)forP(s; c) and maximize (4) with the model parameter Ω.

3.1 Skip-Gram Model for Word Representation

weighting,weighting method

Using the model in (4) to obtain the word representation is first proposed in [18], where the probability P(s; c)is modeled as the following soft-max function:

Then the cost function of the skip-gram model for the word representation can be summarized as

Since the document set T, the window N, and the dictionarySis given, the optimization is to find a mapping gto minimize the cost:

A critical problem in solving the optimization is high complexity caused by the probability model in (5), which requires the access to every documents listed in the dictionary S. The concept of noise contrastive estimation is proposed in [19] to solve this problem. The softmax modeling in (5) is replaced by cascading multiple sigmoid functions.

When we observe c∈ N(s), the pattern (s, c) is considered as a positive sample, and several random draws from the dictionary (the draws can be uniformly distributed) are considered as negative samples, or noise. Then instead of maximizing P(s; c), we simultaneously maximize the probability of the positive sample and minimizes those of negative samples, so the term logP(s; c)in (4) is replaced by

where kis the number of negative samples, Pn(s) the probability distribution of the random draw, and sithe i-th negative sample. In [10], NEG is reported excellent performance when kis about 5 to 15, which is much less than the dictionary's size.

3.2 Skip-Gram Model for Document Representation

We may generalize the skip-gram model from representing words to multi-level documents. For example, a sentence can be decomposed to words, and the generalized model should be able to obtain the representations of words and sentences simultaneously. We use Sto denote the set of high-level documents, andh(S)the sequences of low-levels decomposed from S. For example, (S,h(S)) can be respectively the sentences and the corresponding word-sequences, or the articles and the sentence-sequences.

For simplicity, we first consider the representation learning in a two-level language hierarchy.We useS0and S1to represent the sets respectively containing high and low-level documents.The functiong0andg1are the mappings from S0and S1to their distributional representations.For a high-level documents0∈ S0, the function h0maps s0to a unique sequence consisting of the low-level documents inS1. Then with the set of data T, the skip-gram model that considers the two-level language hierarchy can be written as follows:

where N0and N1are their same-level neighboring documents.

For simple notation, we use sto denote the high-level document and wfor the low-level document. Two core settings must be determined before we are able to apply (9) to obtain the representations: 1) the mappings from documents to vectors, gsand gw, and 2) the probability model for P(s1;Ns(s),Nw(w)). In [14], the long-short-term memory and a look-up table are used for gsand gw, respectively. We can also use hto denote the decomposition of sand thus h(s)is a sequence denoted by{wt}, where tis the index. The LSTM for gssequentially reads the low-level document from h(s)(tas the index) and update its state. The state at the time stampt, denoted by ht is updated by ht and wt as:

where xt=gw(wt) is the vector representing wt from the look-up table, and (Wr,Wz,W,Ur,Uz,U) are the model parameters. The state after reading the last token in h(s)is the distributional representation of sand denoted by gs(s).

To build the model for P(w;Ns(s),Nw(w)), first, assuming the independence among the documents in Ns(s) holds, and in addition, the windowNw includes only the documents prior to win h(s). Then the assumptions indicate the following equations:

where whas the indext as btand {b0, b1, ... , bt-1} are the documents prior to win h(s). If we use a vector vΦ, where φ={a,{b0, b1, ... , bt-1}} in the case, to represent for the condition, and gw(bt) for bt, the probability P(bt;a,{b0, b1, ... , bt-1}) can be modeled by the softmax function as:

where the set Sw contains all possible low-level documents. The rest problem is, how do we obtain vΦ? Here, again the LSTM, which has different parameters with the previous one, is used to obtain vΦ:

where Cr, Cz, and C are the parameters introduced for the condition of the neighboring high-level document a. The LSTM sequentially iterates the neighboring low-level documents in {b0, b1, ... , bt-1}, and the vector vΦ is the state, which is ht in (13) after the last iteration. Please note the two LSTMs do not share their parameters though we use the same notation for simplicity.

4 Cross-Entropy Model for Language Representation

We follow the thoughts in [11] to explain the cross-entropy model. For any two documents i, j∈ S, we use Xijto denote the times of their co-occurence, i.e. i∈ T and j∈ N(i), and then (4) can be re-written as

where H(Zi, Pi) is the cross-entropy between two distribution Zi and Pi. We may assume Zi is the measured frequency of document i's occurrence, while Pi models the probability of document i's occurrence. To maximize (15) is to minimize the cross-entropy, and the minimal case happens when the two distributions are identical. The analogy indicates that we can replace the maximization of (15) with the minimization of dierence between the distributions:

where the Euclidean distance is used to measure the difference of distributions. We may use a function eg(i)Tg(j)+��bi+��bj to model P(j; i), and generalize Xi from the weight of document ito the weighting function q(Xi,j) for each (i; j) pair. Then the cost function of the cross-entropy model can be written as follows:

5 Applications

The language representation learned in an unsupervised way has the advantage in searching for the documents of similar semantics. The property can be widely used in many NLP applications and even in human-robot conversation. We introduce several applications as follows:

1) The analogical reasoning task: Given 3 words, which, for example, are x="France", y="Paris", and p="Germany", what is the argument qthat minimizes the cosine distance between (g(x)-g(y)) and (g(p)-g(q))? The answer is that q="Berlin" if the mapping function gis learned by the skip-gram or its similar models.

2) The document similarity task: When two documents have similar semantics, we may assume their representations by gmay have smaller cosine or Euclidean distance than those with different semantics.

3) The question-answering system: When building a conversational robot, a simple but brutal way may be constructing a huge database including as many question-answer pairs as possible. And when the document from human is received, its most similar question in the database is searched and the binded answer becomes the response of the robot. In the system, the role of the mapping gis the measurement of similarity between any two documents.

6 Conclusion Remarks

In this brief article, we reviewed three important unsupervised methods of learning distributional language representation. We focused on describing them in a unied and theoretical way for easy understanding and comparing their concepts, but on the contrary, the implementation and experimental details, which can be found in the corresponding reference, are abbreviated.

Although in many applications, such as language classication, the representations learned by the supervised methods may outperform those by the unsupervised ones, the unsupervised methods still enjoys getting rid of the expensive data labeling task. Finally, we wish the readers may discover material useful in getting into this eld.

References

[1] C. E. Shannon, "A mathematical theory of communication," Bell System Technical Journal, vol. 27, no. 3, pp. 379-423, July 1948.

[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, \A neural probabilistic language model," Journal of Machine Learning Research, vol. 3, pp. 1137-1155, 2003.

[3] J. M. Cabrera, H. J. Escalante, and M. Montes-y Gomez, "Distributional term representations for short-text categorization," in 14th International Conference in Computational Linguistics and Intelligent Text Processing, 2013, pp. 335-346.

[4] C. Ma, X. Wan, Z. Zhang, T. Li, and Y. Zhang, "Short text classication based on semantics," in Advanced Intelligent Computing Theories and Applications - 11th International Conference, 2015, pp. 463-470.

[5] R. Collobert and J. Weston, "A unied architecture for natural language processing: Deep neural networks with multitask learning," in International Conference on Machine Learn-ing, ICML, 2008.

[6] D. Chen, R. Socher, C. D. Manning, and A. Y. Ng, "Learning new facts from knowledge bases with neural tensor networks and semantic word vectors," CoRR, vol. abs/1301.3618, 2013.

[7] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532-1543.

[8] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng, "Semantic compositionality through recursive matrix-vector spaces," in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 1201-1211.

[9] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, "Semi-supervised recursive autoencoders for predicting sentiment distributions," in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 151-161.

[10] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in 27th Annual Conference on Neural Information Processing Systems, 2013, pp. 3111-3119.

[11] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp.1532-1543.

[12] Z. Harris, "Distributional structure," Word, vol. 10, no. 23, pp. 146-162, 1954.

[13] J. Firth, "A synopsis of linguistic theory 1930-1955," Studies in Linguistic Analysis., pp. 1-32, 1957.

[14] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, R. Urtasun, A. Torralba, and S. Fidler, "Skip-thought vectors," in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 3294-3302.

[15] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

[16] J. Chung, C . Gulcehre, K. Cho, and Y. Bengio, "Gated feedback recurrent neural networks," in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 2067-2075.

[17] M. Sundermeyer, H. Ney, and R. Schluter, "From feedforward to recurrent LSTM neural networks for language modeling," IEEE/ACM Trans. Audio, Speech & Language Processing, vol. 23, no. 3, pp. 517-529, 2015.

[18] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Ecient estimation of word representations in vector space," CoRR, vol. abs/1301.3781, 2013.

[19] M. Gutmann and A. Hyvarinen, "Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics," Journal of Machine Learning Research, vol. 13, pp. 307-361, 2012.

Emotibot 最新产品体验:

小影机器人APP下载链接

(识别二维码,复制链接至浏览器即可下载)

小影机器人公众号

(识别二维码即可关注)

Emotibot API开放平台

(免费对外开发)

标签: weightingmethod