The ML Foundation

07 Sep, 2024

Personal Guide Through a Machine Learning Project. (Originally published: February 02, 2024)

A Pro is A Master of the Basics

Hello world, it's been a while since I published my post declaring my goal to become the best data scientist in the world. I've been in the lab if I should say so myself actually, mixing Coursera courses & Kaggle classes in volumetric flasks. I wanted to share some of what I've learnt with you guys, the basics more specifically.

The basics are very important, you remember what your primary school teacher said about "The foundation is important", well if you don't, imagine me saying it now. Whether it's an advanced course after that or some shortcuts, the first things you learn are always what's built upon.

Yay, please teach me Daniel-Sensei...What are we learning again?

Good question my student, well, to answer this question I'd have to take you on a trip in my favourite rocket ship so let's start here. I want my expertise to touch on a blurry line between Data Science & Machine Learning Operations. I'd like to provide business solutions using data, create notebooks and reports, hence Data Science and I'm also interested in how ML models are maintained, scaled and deployed, hence MLOps.

I want to be skilled on both fronts although I understand the more proficiency and experience I gain, I'd probably have a more specialised role. But both are tied together with Machine Learning (from what I understand at least). So to conclude my long winded answer, what we're learning about here is Machine Learning. It's an interesting concept, I have some views on it bar what you can learn about it so let's get into it, who is this Machine Learning superstar?

The A-list Celebrity, Machine Learning

As you may all know, Machine Learning, hence forth termed as "ML" has become a modern-day buzzword (hate that word lol) which actually comes with its pros and cons.

It's actually a good thing because it's leading to a lot of advancements in the space with tech like everyone's best friend, ChatGPT and image generators like Dall-E (which I've actually never used before, my AI art views are quite conservative), and soon Sora for videos. There's more accessible material for me to get my hands on to learn about ML and ML processes, and community support is only growing.

The negative aspect affects people like me. Being all the range, has created an unprecedented amount of pressure for people to stand out in this field, actually apart from being a hobby, standing out is part of why I do this blog. So I guess it's "Just give it time, we'll see who's still around a decade from now" for all the competition...but I can't wait a decade though.

A lot of people come into it under the impression that they're going to build cool models from scratch when they enter the field but that doesn't bother me at all. In fact, I love the idea of outsourcing that task to genius PhD students who have probably ran the calcs and deliver profoundly accurate models so I can pick it up and use it as a tool to create my solutions.

So as much as I learn the fundamentals of how models work (because again, the fundamentals are super important) I plan on using models as tools or nodes within a system. Also, most companies already have their own models they already work with, MLOps also teaches you about No Code/Low Code solutions that would easily get you the ML model of your dreams.

Enough red carpet, onto the Main Event

ML is basically when a computer program performs a task without explicitly being programmed. Traditionally, you would have to write a ton of loops and conditional statements to get specific results or outputs for a given input but ML closes the gap on that front. And also, that's the noob definition, we're nerds here so let's use my favourite definition.

ML is when a computer program is said to learn from experience (E) with respect to some task (T) and some performance measure ℗, if its performance on (T), as measured by ℗, improves with experience (E). — Tom Mitchell, 1997

What that means is a machine learning model has actually acquired some knowledge or means on how to do a task, it could be sorting spam emails or driving a car without human assistance, and the way we know it's going about its job the right way is by measuring its performance. Over time, as the model has gained more experience, it should continue to improve and get more accurate with the task.

We could use something like a confusion matrix for a classification task or fun math equations for regression tasks (predicting a continuous value).

This is the core concept of ML and everything else is built upon this small foundation. The task involved is (but not limited to) what problem you're trying to solve, how to frame it and the type of model you're using. The experience is basically the data you're inputting to the model which includes the features (like columns in tabular data) for it to learn off of and the performance is how accurate your results are and how well it's adjusting to new/future data.

A whole lot can go into each aspect of ML and it's the reason a PhD is generally a requirement for most entry level jobs in this field, so many things can affect any part of these three steps; We have biases (biased data & results), overfitting & underfitting, how to do exploratory data analysis (EDA) to choose the right features, how to preprocess data, ethics and so on. And it even gets deeper with deep learning (it's literally called deep learning, maybe they got tired of naming) and neural networks.

But today I have a movie staring ML & my gorgeous self, front row tickets too, just for you.

The Plot

Your parents are desperate for you to get out of their basement and meet the love of your life and now they've asked me if I can use my lean, mean machine learning skills to find you your soulmate. Well, they came to the right place that's for sure, but I'll let you in on how we'll go about it.

Introducing LoveDoctorAI!

We hired an astute data collecting team to contact you and gather a lot of relevant data on people you would date. They used age, height, weight, income, sex, etc. to determine if you "would date" someone which you responded with various "Yes" or "No" answers. (Which can be encoded into a binary 0 or 1 but we'll get there)...cool. Huh you have quite a unique taste...wow this is getting peculiar...okay okay sorry for being unethical I won't continue snooping. Let's talk data again.

So these "age", "sex", "income", "Would date" columns are called features we're going to separate these features into inputs (or input features) and a target which in this cause would be "Would date", the target is basically what we are trying to predict.

Now let's see how this is going to play out, act by act, after all I know your popcorn is ready now.

The Setup: EDA

Short for exploratory data analysis, you probably missed that I said this earlier, anyway. Remember when I spoke about the input features we just spoke about? Yeah let's start from there. We can't just feed any input into the model, in fact how do we know what to feed the model?

We can do some data visualisation techniques to get a good view of the dataset and we can also call in some nifty feature engineering. Through feature engineering we can find out which input features have a good correlation to the target.

So let's say after we do this we find out that you only date a certain sex, we can drop all candidates who are not that sex and remove "sex" from the input features (since they're now all the same value) and we also drop "name" and "age" because find out there was no real proof that you choose people because of their name and to you age is just a number or in more scientific terms, there's a weak correlation between "name" and "age" and the target feature "would date".

We can also employ other unsupervised learning techniques to find trends in data but calm down you nerd, not today!

Preprocessing data

The data collecting team might have collected some data with missing values, some numerical values that don't scale well or made the "would date" answers "Yes" and "No" instead of 0 and 1. Thanks a lot data collecting team, tsk. Now we have to clean this data. For the missing values we can either make them 0, fill the missing values with the median or mean of taken from the other values even remove the whole feature, which can be a bit extreme if the correlation is good. We also need to convert categorical data (words, basically) into numerical data cause machines love numbers, y'know. And we can also scale the numerical values because if they "income" values range between 0 - 250000 and there is another feature like "height(in metres) ranging from 1.5 - 2 it can create some complications during training.

Some for learning, some for revision

Now we have a clean, full, salacious wink wink dataset. We want to split this dataset into testing dataset, which the model will learn from and a validation dataset which the model would use to guess the right answers. If you don't have a large enough dataset, assuming they collected data on 5000 people for this project, sorry that must have been exhausting! But to our model 5000 is pretty small. We can use cross-validation to chop up the dataset into multiple subsets (or folds), training the model on some folds while evaluating it on the remaining fold.

Confrontation: The Model

There are a lot of no code/low code or open source models out there right now and you would hardly have to make your own models from scratch if you're working for an already established company with senior staff. But for the sake of clarity you can import a model from a Python library like a decision tree, random forest, XGBRegressor (for linear regression tasks) or XGBClassifier (for classification tasks). You can then add some parameters and biases to fine tune the model for your specific use case (make sure you read the documentation to really understand what you're doing).

What?? There are two gatecrashers? Show yourselves!

There are two scoundrels we have to be careful of.

Our first hooligan is overfitting, always underdressed this one. What's overfitting's deal? Overfitting is basically when a Machine Learning model learns too much about the task at hand. This can happen in several ways;

The model might be learning from unrelated or unnecessary features to predict our target. For instance, someone called "John" should stand a better chance at love simply because the model a lot of other "Johns" were suitable.

Or if there not enough data for the model to train on so it can't get a good grasp on all possible or feasible future inputs. An example of that would be if we only used 10 people to find your match. Something like just because they're 1.7m instead of 1.8m tall the model will make a wrong prediction.

Overfitting is a bad boy because when overfitting happens it doesn't leave room for the model to generalise properly so it's too strict with all types of incoming data. This is bad for predicting new input.

Thing 2 of our delinquent duo is our satirically overdressed, underfitting. Who is this underfitting guy anyway? Well if you have at least two brain cells more than me (I just got my 7th one, ha beat that!), you may have figured out that underfitting is at the opposite end of the spectrum and it's when the model doesn't have enough feature inputs to properly grasp the target data and ends up generalising the data way too much. Let me give you a glimpse of how that can happen:

The model doesn't have enough training time. So when you don't let the model iterate through the dataset enough to extract and match the features to the target, it won't properly learn the features that matter.

Another cause could be it needs more input features. Example, if we just put "height" and "weight" into the model these two features wouldn't be enough to determine a good partner for you (what if you don't like their eye colour!)

Resolution: Performance

Now we have our output, how do we know we're on the right path? We have equations for that more or less. In a classification task we can use formula (which is usually in a function in a Python library, not something you have to hard code) to test the accuracy of the prediction. Or in a linear regression task can use another function to calculate the distance between the line of regression and the predicted result. The further the point is from the line the higher the error.

Easter Eggs

There are a lot of other things to consider when making a model. Sometimes to want the model to have very high and not miss anything (like you don't want to miss explicit photos on a messaging app suited for kids) or you want to be very precise (like in a medical diagnosis). This is usually a tradeoff called the Precision-Recall Tradeoff and can be visualised with a confusion matrix ( as if it's not confusing enough), so make sure you understand the requirements of the project. For LoveDoctorAI recall would be more important than precision cause we don't want to miss out any good fit and wrong predictions are not fatal, just say "I'm me, not you but we can always be friends."

Cliffhanger: So who's my match made in heaven?

That was fun, after new data is inputted, the model can now make a prediction and even add those details to the dataset if it's an online model.

Nice! Now we can export the model, host it online and hook it up to the website we made just for you so hot singles can put in their details and see if they stand a chance with the gorgeous being that you are.

Oh and to answer your question, it was me. Pfft, are you reallyyy choosing someone over me? Let's be real. Although "income" had the strongest correlation to the target so... yeah maybe it's not me yet. I'll come back for you when I reach six figures. Pinkie promise!