Linear Regression with Python

This article is written for developers who have intermediate level of coding and already have their python environment setup. Go here if you do not have that setup.
Visual Studio Code: Python, Sklearn, numpy, panda and matplot


The last time I heard about Linear Regression was probably in my undergraduate years, where I was taking exams on probabilities, differential equations, derivatives and the worst, abstract mathematics.

Little did I know I would be coming back to calculate the slope and intercept in my career, in which they now call, machine learning! Let's get to the point:

Situation - Data was coming in through from a separate channel that was considered "garbage data" compared to the source of truth, so calculations on the data were misrepresented.

Task - Apply linear regression to perform predictive analysis, to calculate a correction factor of the predicted percent difference between the source of truth and the separate channel on a given day, using only the past 30 days of data for that day.

Action - Research a best method practice to automate the calculation and not perform manually with excel.

Result - Utilized python with libraries matplot, numpy, panda and sklearn to loop through data in 10 lines of code and calculate the offset.

Time taken: 4h of research/ meeting with data scientist/ env setup. 6h development

Firstly: Import the data and fit the x and y plot using sklearn’s linear regression -

1. Use the library numpy to perform data manipulation and put into an array that’s analyzable by sklearn. Define your x and y axis data.

2. Fit the x and y axis with Sklearns Linear Regression

Matplot.pyplot will scatter plot your data for you, if you need a visualization.

Secondly: Was it doing what I expected? Let’s validate in excel using Slope() and Intercept() function. BAM, matches.. but wait, my purpose is to automate this thing where I performed predictive analysis on the previous 30 days.

1. Panda has a .tail() function that grabs the current row plus x amount of rows before it.

2. It also has an itertuple method that would help me loop through each day.

I also don’t want to include the current day in the analysis, so instead, I got back 31 days and remove the very last row for correct predictions. Since I deleted that row, I need to make sure to start where there is data with length len function where > 0.

Third: Don’t forget you need two extra columns to store the slope and intercept so, I created those rows (line 10 and 13) with panda as well, then in my loop, I update those values utilizing the date as my indexing and the coef_ and intercept_ in Sklearn’s Library.

Voila!! Last thing to do is save this back into my folder with to_csv, but in reality when working with big data you want to hit the cluster with it as well as make sure you are utilizing the correct machines that can handle the row amount! Hope I was able to share some insight to machine learning and predictive analysis!!

- Ena Vu 5/29/2019