The Hashing Trick

05 Apr, 2023

The hashing trick is a technique used in machine learning to represent high-dimensional data, such as text or categorical features, as fixed-length vectors, by mapping them to a lower dimensional space using a hash function. This allows the data to be processed more efficiently by machine learning algorithms, which often require fixed-length inputs. However, since collisions may occur during the hashing process, the resulting vectors may not be unique, but in practice, the hashing trick is effective and widely used.

It sounds like embedding, except:

There is no ML involved to “learn” a latent representation of your data points
That means no model training or inference, it just works out of the box even if the number of categories increase over time
It’s super fast and computationally very simple/cheap

This is perfect if:

The categorical features you are trying to encode have a great number of categories (ex: products in e-commerce)
Ressources are scarce

The rationale behind this technique is quite intricate, and I had to go through many ressources before getting to a good understanding of it. Here’s a pretty good ressource, using text as an example. There’s a scikit-learn function for that (sklearn.feature_extraction.FeatureHasher)

#AI #english