0 votes
par dans 02 - L’enfer des données, des modèles et des représentations...
It was mentioned the dataset can be small. How small is small? Do you have further literature hints about this? Thank you

1 Réponse

0 votes
par Vétéran du GPU 🐋 (46.6k points)
sélectionné par
 
Meilleure réponse
There is no single answer to this question since it depends on your data and your objective. For instance for MNIST, a few dozens of thousand works just fine. For a large language model you need hundreds of billions of examples. You have to look at your data, and sometimes only an expert can assess the complexity of the problem and therefore the amount of data required.
par
It is a bit frustrating. I thought you could help me with further literature hints about this. I have zero experience with it.  An expert is also not an expert from the start.
par (2.2k points)
this kaggle blog post explain why this is a problem https://www.kaggle.com/code/rafjaa/dealing-with-very-small-datasets
par (5.6k points)
Usually, it is hard to determine if the dataset is too small or not, intuitively we try to compare a dataset relative to another one. If the other dataset is of similar complexity and can be understand by a AI model, we can say that it is of reasonable size and that our new dataset which is comparable isn't too small. Another way to determine the smallness of a dataset, which is more costly in time, is to try to solve the task with an AI model and see if the result is good enough
par
Thank you very much. I would like to suggest maybe to put on the slides further literature hints.
par
This is not necessarily about dataset size but about available features in your dataset; take SVM, you can well have a working SVM model with 100-200 dataset, and only 2-3 features; if you raise the number of features to, say, 20, that will probably not work. In ML, everything is relative, and there's no black and white answer. One of the most important aspects of ML is actually understand the data which will be used, not necessarily its size.
...