
Build a Regression model using Scikit-learn: Prepare and Visualize Data
Introduction
Asking the right question after the setting up of the right tools for ML is important, and it is very important that the right questions were being asked. This lesson focus on:
- How to prepare data for model building
- How to use Matplotlib for data visualisation.
Asking the right question of your data
The question framework that we can use our ML to answer depends on two things, first is the algorithms model we adopted, and the data quality type. The quality and nature of our available data will determine how best we can adopt it to the ML use. Despite the importance of data quality in the ML use, It is often a challenge to have a clean data to use for our modelling, thus, we will need to clean, and transform our data from the raw state to the useful state.
Case study: 'The Pumpkin Market'
The data contains 1757 rows information about the market for pumpkins in US, and this has been group by city. A closer look at the data reveals that there is a mix of strings, numbers, blanks and strange values that needs to be work on before we can use the data for our ML.
Looking at the available column in the data, we can start to think of what kind of questions can our data provides an answer to, and this regression analysis tools is one of the good analytics too to use for such.
One question that we want to find an answer to: "Predict the price of a pumpkin for sale during a given month”.
Exercise - analyze the pumpkin data
Using Pandas (Python Data Analysis), we can reshape the data and prepare it well for our ML use.
Note 1: US-Pumpinks.pynb.
MICROSOFT CLASS:
~EXPLORE AND ANALYSE DATA WITH PYTHON~
Data exploration and analysis is at the core of data science. Data scientists require skills in programming languages like Python to explore, visualize, and manipulate data.
Learning objectives
In this module, you'll learn:
- Common data exploration and analysis tasks.
- How to use Python packages like NumPy, Pandas, and Matplotlib to analyze data.