Math Series: Why Unearthing Mathematical Concepts Core to Deciphering Data Science Problems!
Well you guessed it! I have click baited you into opening my article.
Given that Data Science is the new fashion statement of the job industry, it wasn’t hard to bait you into reading this article. But please hear me out, for I have a topic that is far more important if you really want to succeed in being able to don this fancy title. A basic understanding of mathematical concepts such as statistics, probability or multi variate calculus will assist aspiring data scientists’ in choosing how best to build a model that fits their data.
I know you will come across scores of articles highlighting the obvious necessities of the other components i.e. programming, data engineering, visualizations etc., but know that I am not hear to start a forum war over which component is important. My view is that they are complementary, not competing.
So beware, keyboard warriors 😜!
The Premise of my Early Struggles
During my High School years, Mathematics was my favorite subject. Growing up in a South Asian household, it was imperative for me to be seen by my parents while I was studying. However, I could not stand being bored by Social Sciences, Biology and Languages. As a result, in goes the ear phones and out comes the Mathematics book. It was an extremely fun time, until S&P!
As I identified in my later years, my early struggle with S&P was purely down to zero practical implementation of S&P concepts i.e. conducting data analysis for research projects. Through years laboring, crying and clamoring for help to mathematical equations that were seemingly simple but practically mind numbing to understand, I have come across a concept which makes learning statistics & probability enjoyable through coding.
Hint: Do it by coding statistical & probabilistic concepts in R/Python.
That is not to say that my journey learning S&P is over, or that I have mastered it all. I am still building my confidence and daring to choose a statistical learning theory that is complex and solves multi-pronged problems (case-in-point anything other than regression).
Also, learning S&P from books is sometimes tedious, or even confusing at best. Majority of the topics are taught without making a prior or post comparison between different statistical methods. This also makes learning for readers challenging while trying to practically apply these topics. In one of my other articles (link) where I describe the stages of a data science project, it accurately highlights the correlation between the different concepts of S&P. The early phase places emphasis on historical data whereas the later stages emphasizes on future connotations on historical trends. That is in essence data science. Another eye opener was the need to adjust myself to work with real world problems. I mean I learned how RSS (sum of residual squares) tells us how far the predicted value is from the regression line, but I did not know how that would look in real-life relationship comparison between the demand & pricing of the product when factoring elasticity of demand.
The Difference b/w Statistics & Probability
Before we move on to the more complex concepts of mathematics in data science, it is important to build a good working foundation of Statistics & Probability (S&P). The difference b/w S&P is the difference b/w past and present. In statistics, we collect historical data and use it to make certain assumptions about how did the data happen to be how it is. Probability on the other hand answers questions about predicting likelihood of a certain scenario. In short, statistics is about what happened in our past while probability is what could happen in our future. Hence it is imperative to recognize that statistics will form the basis of EDA whereas probability is for predictive modelling (except for non-binary regression models, curse my obsession with regression!!!).
Keeping that in mind, we will move forward with distinguishing the series between EDA & Predictive Modelling. In series about EDA, we will explore statistics which summarize the main characteristics of our datasets, and also engage in few fun data visualizations along the way. On the other hand, we will examine probabilistic theories of predictive modelling and eventually employ visualizations to probe our findings.
The End Game here
I have gone through too many articles, books & university related coursework relating to these complex theories that made me cry for help at night. It would be unfair to make my readers go through such suffering. My focus is to build this article as a fun and interactive series which uses real life examples to help the readers learn as they code. Below is a list of the topics I will be covering in the series, and it will get updated as I write each article.
- Measures of Central Tendencies (link)
Please help me improve myself and this series by sharing your 2¢ on the topic in the comments/DMs after you manage to survive till the end. This piece will be updated from time to time as my journey takes form. So enjoy the read, and let me know through the comments section!