The Hashing Trick
The hashing trick is a technique used in machine learning to represent high-dimensional data, such as text or categorical features, as fixed-length vectors, by mapping them to a lower dimensional space using a hash function. This allows the data to be processed more efficiently by machine learning algorithms, which often require fixed-length inputs. However, since collisions may occur during the hashing process, the resulting vectors may not be unique, but in practice, the hashing trick is effective and widely used.
It sounds like embedding, except:
- There is no ML involved to “learn” a latent representation of your data points
- That means no model training or inference, it just works out of the box even if the number of categories increase over time
- It’s super fast and computationally very simple/cheap
This is perfect if:
- The categorical features you are trying to encode have a great number of categories (ex: products in e-commerce)
- Ressources are scarce
The rationale behind this technique is quite intricate, and I had to go through many ressources before getting to a good understanding of it. Here’s a pretty good ressource, using text as an example. There’s a scikit-learn function for that (sklearn.feature_extraction.FeatureHasher)