Do you need big data in order to train your machine learning? No, not necessarily. While having a large set of labeled data is typically needed when it comes to ML, there are other solutions that can help reduce the amount of data needed. Beyond datasets, there are lightweight algorithms, feature engineering, fine-tuning pre-trained models, and active learning. So, depending on your specific needs, you may not have to rely solely on big data for training.
Now, datasets are simply collections of objects that are labeled by a human. For example, if you are searching for cats in photos, you would need photos to label as “cat” and also the coordinates of the cat in each picture. With datasets, it’s important to make sure they are representative; if you’re only using photos from a certain fan forum, your results won’t be successful, no matter how large the pool is.
The challenge with ML is called overfitting, when the algorithm only remembers the training dataset and isn’t able to work with unseen data. To combat this, more data is usually added, so that the algorithm isn’t focusing on uninterpretable noise.
But another option is to use lightweight algorithms – algorithms that aren’t able to handle complex dependencies, but also aren’t as prone to overfitting. These are great for when you have to manually search for patterns in the data. For example, when trying to predict store sales, you only have the address, date and list of purchases – but if it’s a holiday, it’s likely the customers will purchase more and bring in more revenue. This process is called feature engineering, and it’s helpful in instances where the features are easy to create.
However, there are tasks where this isn’t applicable – like image processing. This is where deep learning neural networks can come in, as they are capacious algorithms that are able to find non-trivial dependencies. Recent advances in computer vision have been credited to neural networks, which often need more data. But they can also be prompted – you can use pre-trained models, and fine-tune them to your own task.
However, in cases where labeling is difficult, such as when classifying body cells, active learning can be used. The neural network will suggest which examples it needs labeling, and also detect which examples are labeled incorrectly. It also conveys its confidence in its result, so you can learn from it by running it on unseen data.
As you can see, there are many options when it comes to training a machine learning algorithm, even if you don’t have access to large datasets.
Now, let’s talk about the special issue mentioned in the article, “The quest for Nirvana: Applying AI at scale”. It is a special issue created by Emerj (formerly Emerging Technology Research), which is an intelligence platform for use by corporations and public institutions for rapid-cycle AI strategy, tactics and research. It’s aimed to help companies figure out the best way to implement AI by providing a framework and knowledge from over 5,000 analytics, research reports, and machine learning case studies.
Lastly, we have the person mentioned in the article, Emerj CEO and Co-Founder, Daniel Faggella. Daniel has an MBA from the University of Massachusetts and has been featured in MIT Sloan Magazine and The Next Web. He’s an AI research and strategy expert, an international keynote speaker, and the co-author of XAI: Reusable Explainable Artificial Intelligence Systems. He started Emerj in 2018 to create comprehensive AI strategy and research solutions. Daniel’s experience helps companies to understand their potential application of AI, more specifically to use AI in their current operations.