Categorizing Products the STYLIGHT Way
Machine learning is quite a hot topic nowadays, being used for a variety of purposes– from recommending products and movies to analysis of business data. Here at STYLIGHT, we use Machine Learning for a completely different purpose.
Our partner shops send us thousands of products every day, and we are putting them into our own fashion categories ranging from very broad ones, e.g. clothing, to quite detailed ones, e.g. mini skirts. With almost 800 classes, this task can be very time-consuming when conducted manually; therefore, we have implemented a machine learning system which allows us to predict the fashion categories of the products in a fast, scalable and automated way.
A classical machine learning architecture consists of the steps shown in Figure 1. For each block, an individually specialized algorithm can be chosen depending on which type of data is available (images, documents, numbers, etc.).
Figure 1: Classical machine learning architecture
This is the process of getting features which might help to predict the given product. For example, if one wanted to predict the sizes of a T-shirt, one would get the name, brand, width, height, color and price of a T-shirt. These attributes form a feature vector as can be seen in Figure 2. However, some information might be completely unnecessary and getting rid of it is part of the next step.
[name, brand, width, height, color, price]T
Figure 2: Feature vector of a T-shirt
As stated above, in this step only important features are selected. This means features that carry no information are discarded. Imagine we want to predict the sizes of a T-shirt (S, M, L) and let us assume the size is only depending on the width and length of a T-shirt. All additional information like the price or brand are then completely unnecessary. Fewer features usually lead to less noise, less memory usage and faster times for classification. However, the tricky part is to find these relevant features.
Well known classifiers which are often used are k-NN, Naive Bayes, Decision Trees, Random Forests, Support Vector Machines and Neural Networks, just to name a few. Each one of these classifiers has its own strengths and weaknesses. Some have long training phases, some cannot deal with high dimensional feature spaces and others make some assumption about the a priori distribution of the data points. The selection of the classifier is a crucial part in the machine learning pipeline and can greatly influence the final outcome. Therefore, one should carefully think before choosing one of them.
Figure 3: Example of a classifier
While the basic principles for machine learning tasks as shown above are quite obvious, the devil is as always in the details.
For example, the outcome of a machine learning experiment is hardly predictable. Meaning you have to run an actual experiment in order to know how good your system will perform. Designing and conducting such tests is, however, very time consuming, but absolutely necessary to do some proper model selection and cross-validation. It is especially important to be able to make some assumptions about the generalizability of the machine learning system.
Here at Stylight, we perform a nested cross-validation to find the best model and its presumable performance with new unseen data. We tried different splits for training and testing sets and found out that a classical 80/20 ratio works best for us. However, this number can vary depending on the data but can be seen as a rule of thumb.
As we have an enormous amount of products to process we developed our own fully automated testing and validation environment in order to generate new models. It is heavily parallelized and has a low memory footprint. This allows us to test new ideas very quickly. In addition to that, we use confusion matrices amongst other things for visualizing our results. There we can see changes we make directly and identify fashion categories which might be harder to predict than others.
Last but not least, our attempts with machine learning are quite promising and we definitely want to do more in this direction. This is the reason why we are continually exploring new ways in which we can apply machine learning to make our daily work more efficient.