INTRODUCTION
The followings are the essential Python libraries for data science:
- Numpy
- Pandas
- Matplotlib
- Scikit-learn
- Statsmodels
- TensorFlow
- Keras
- PyTorch 10.TensorFlow
NumPy
Description:
NumPy (Numerical Python) is a fundamental package for scientific computing with Python. It provides support for arrays, matrices, and many mathematical functions to operate on these arrays.
Key Features:
Efficient array computation.
Mathematical functions for linear algebra, Fourier transform, and random number generation. Integration with C/C++ and Fortran code. Necessary Modules:
Key Features:
Efficient array computation.
Mathematical functions for linear algebra, Fourier transform, and random number generation. Integration with C/C++ and Fortran code. Necessary Modules:
import
numpy
as
np
Pandas
Description: Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and data manipulation library built on top of the Python programming language.
Key Features:
Data manipulation and data analysis. Data structures like Series and DataFrame. Time-series functionality. Necessary Modules:
import
pandas
as
pd
Matplotlib
Description: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader SciPy stack.
Key Features:
Supports various plots like line plots, bar plots, scatter plots, and histograms. Customizable plots. Necessary Modules:
import
matplotlib.pyplot
as
plt
Scikit-learn
Description: Scikit-learn is a free machine learning library for Python. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN.
Key Features:
Simple and efficient tools for data mining and data analysis. Built on NumPy, SciPy, and matplotlib. Open-source, commercially usable - BSD license. Necessary Modules:
import
sklearn
Statsmodels
Description: Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and exploring data. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
Key Features:
Regression models. Time-series analysis. Nonparametric methods. Necessary Modules:
import
statsmodels.api
as
sm
TensorFlow
Description: TensorFlow is an open-source machine learning library developed by Google. It is used for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
Key Features:
Highly efficient computation. Support for deep learning and machine learning. Necessary Modules:
import
tensorflow
as
tf
Keras
Description: Keras is an open-source neural network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible.
Key Features:
User-friendly API. Modular and extensible. Support for convolutional and recurrent networks. Necessary Modules:
import
keras
PyTorch
Description: PyTorch is an open-source machine learning library developed by Facebook. It is based on the Torch library and used for applications such as natural language processing. It is primarily used for applications such as computer vision and natural language processing.
Key Features:
Support for dynamic computation graphs. Highly efficient tensor computation. Necessary Modules:
import
torch
def
function_name():
5 GETTING STARTED WITH PANDAS
While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneously typed numerical array data. The two main house in Pandas is series, and dataframe
Series
In Pandas, a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is similar to a column in an Excel spreadsheet or a database table. Each element in a Series has an index, which is used to label the data.
The syntax for creating a Pandas Series is:
pandas.Series(data = None, index = None, dtype = None, name = None, copy = False, fastpath = False)
Series Parameters:
data
Ā
(array-like, Iterable, dict, or scalar value, optional):- The data for the Series. This can be a list, NumPy array, dictionary, or scalar value. If data is a dictionary, the
keys
will be used as the index. If data is ascalar value
, an index must be provided. index
Ā
(array-like, optional):- Values must be unique and hashable, same length as data. This is the index (row labels) for the Series. If not provided, a default integer index is used (0, 1, 2, ā¦, n).
dtype
Ā (numpy.dtype, optional):- Data type for the output Series. If not specified, the data type will be inferred.
name
Ā
(str, optional):- The name to give to the Series.
copy
Ā
(bool, default False):- Copy the data. This is relevant for array-like or dictionary inputs.
fastpath
Ā (bool, default False):- This is an internal parameter and should generally not be used.
Creating a Series:
Here's a simple example to create a Pandas Series:
- Import Pandas:
- Create a Series from a list:
import pandas as pd
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)
Output:
goCopy code
0 10
1 20
2 30
3 40
4 50
dtype: int6
In this example, the SeriesĀ series
Ā contains integers with the default integer index ranging from 0 to 4.
Creating a Series with Custom Index:
You can also specify a custom index for the Series:
data = [10, 20, 30, 40, 50]
Info = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index = info)
print(series)
Output:
cssCopy code
a 10
b 20
c 30
d 40
e 50
dtype: int64
Creating a Series from a Dictionary:
A Series can also be created from a dictionary, where the keys become the index:
pythonCopy code
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
series = pd.Series(data)
print(series)
Output:
cssCopy code
a 10
b 20
c 30
d 40
e 50
dtype: int64
Creating a Series with a Specified Data Type:
ythonCopy code
data = [1, 2, 3, 4, 5]
series = pd.Series(data, dtype='float64')
print(series)
Output:
goCopy code
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
Creating a Series with a Name:
pythonCopy code
data = [1, 2, 3, 4, 5]
series = pd.Series(data, name='my_series')
print(series)
print(series.name)
Output:
yamlCopy code
0 1
1 2
2 3
3 4
4 5
Name: my_series, dtype: int64
my_series
Accessing Data in a Series:
You can access data in a Series using both the index and the position:
pythonCopy code
# Using the index
print(series['c']) # Output: 30
# Using the position
print(series[2]) # Output: 30
Basic Operations on Series:
Pandas Series supports various operations like arithmetic operations, applying functions, filtering, etc.
- Arithmetic Operations:
- Applying Functions:
- Filtering:
pythonCopy code
series2 = series + 5
print(series2)
Output:
cssCopy code
a 15
b 25
c 35
d 45
e 55
dtype: int64
pythonCopy code
series3 = series.apply(lambda x: x * 2)
print(series3)
Output:
cssCopy code
a 20
b 40
c 60
d 80
e 100
dtype: int64
pythonCopy code
series4 = series[series > 30]
print(series4)
Output:
goCopy code
d 40
e 50
dtype: int64
Accessing Array Representation and Index of a Pandas Series:
In Pandas, you can access the array representation and the index object of a Series using itsĀ .array
Ā andĀ .index
Ā attributes, respectively. These attributes provide useful ways to work with the underlying data and the labels.
Array Representation
TheĀ .array
Ā attribute returns the underlying data of the Series as a Pandas Extension. Array, which is an abstraction over the actual data array (e.g., NumPy array or other array-like objects).
Index Object
TheĀ .index
Ā attribute returns the index (labels) of the Series, which can be used to access or modify the index labels.
# Accessing the array
obj = pd.Series([4, 7, -5, 3])
obj.array
Output:
<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64
# Accessing the index
obj = pd.Series([4, 7, -5, 3])
obj.index
Output:
RangeIndex(start=0, stop=4, step=1)
Another Example:
import pandas as pd
# Creating a Series
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
# Accessing the array representation
array_representation = series.array
print("Array Representation:")
print(array_representation)
# Accessing the index object
index_object = series.index
print("\nIndex Object:")
print(index_object)
Output:
Array Representation:
<PandasArray>
[10, 20, 30, 40, 50]
Length: 5, dtype: int64
Index Object:
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
.array
attribute allows you to work directly with the data values, while theĀ .index
Ā attribute gives you access to the index labels, both of which can be crucial for various data operations. Note that attributes are without ( ). eg obj.array
and obj.index