DHE(Deep Hash Embedding)
Published:
DHE(Deep Hash Embedding)
Abstract
This paper proposed a new method to generate embedding vector for recommendation models. Compared to looking up embedding vectors from an embedding table, the deep hash embedding can reduce the model size by 75% with similar AUC, and better generalization ability.
One-hot based embedding learning
For a categorical feature (e.g country), the recommendation models usually generate an one-hot vector of a feature value, and looks up the embedding vector from an embedding table (with random initialization values). If a feature value is “US”, then it will be converted into an one-hot vector $[0, 0, …, 1, 0, …, 0]$.
The one-hot vector can be denoted as $x$. The dimensions of $x$ is $1 \times m$ where m is the cardinality of the feature. Then it will lookup an embedding vector from an embedding table $W$ with dimensions of $m \times n$ where n is the embedding width. The process of “looking up” an embedding vector can be denoted as $xW$, which is a linear feature transformation. The $W$ here can be learned by back propagation (detailed computation).
Deep hash embedding
The limitations of one-hot based embedding learning are: (1) When m is very large, we need a lot of memory to store the embedding table (e.g. a 1 billion x 100 matrix takes 400GB of memory). Even though we can use hashing tricks to reduce the original cardinality of the feature, it still cannot be generalized to unseen feature values during inference.
The innovation of this paper is, it substitudes the traditional shallow, wide layer with a DNN. It reduces the number of parameter because DNN is more efficient, and improves the generalization ability because the weights of the DNN can be applied to any unseened feature values.
The limitation of this method is, as the author mentioned, it usually underfits the dataset. They achieved best performance by using the Mish activation function. Maybe it is because the original hashing encoding (input for the DNN) is too complex.
An interesting future direction is to jointly model multiple features using DHE.