📚

Python for Data Analysis

Cover

Authors

Topics

Status

Book series

Average Rating

My Rating

My Review

Number of Pages

Private Notes

Publisher

Read Count

Series #

Year Published

Finish Date

Parent item

Progress

NaN

Read pages

Sub-item

Total pages

INTRODUCTION

The followings are the essential Python libraries for data science:

Numpy
Pandas
Matplotlib
Scikit-learn
Statsmodels
TensorFlow
Keras
PyTorch 10.TensorFlow

NumPy

Description:

NumPy (Numerical Python) is a fundamental package for scientific computing with Python. It provides support for arrays, matrices, and many mathematical functions to operate on these arrays.

Key Features:

Efficient array computation.

Mathematical functions for linear algebra, Fourier transform, and random number generation. Integration with C/C++ and Fortran code. Necessary Modules:

Key Features:

Efficient array computation.

Mathematical functions for linear algebra, Fourier transform, and random number generation. Integration with C/C++ and Fortran code. Necessary Modules:

import numpy as np

Pandas

Description: Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and data manipulation library built on top of the Python programming language.

Key Features:

Data manipulation and data analysis. Data structures like Series and DataFrame. Time-series functionality. Necessary Modules:

import pandas as pd

Matplotlib

Description: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader SciPy stack.

Key Features:

Supports various plots like line plots, bar plots, scatter plots, and histograms. Customizable plots. Necessary Modules:

import matplotlib.pyplot as plt

Scikit-learn

Description: Scikit-learn is a free machine learning library for Python. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN.

Key Features:

Simple and efficient tools for data mining and data analysis. Built on NumPy, SciPy, and matplotlib. Open-source, commercially usable - BSD license. Necessary Modules:

import sklearn

Statsmodels

Description: Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and exploring data. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

Key Features:

Regression models. Time-series analysis. Nonparametric methods. Necessary Modules:

import statsmodels.api as sm

TensorFlow

Description: TensorFlow is an open-source machine learning library developed by Google. It is used for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.

Key Features:

Highly efficient computation. Support for deep learning and machine learning. Necessary Modules:

import tensorflow as tf

Keras

Description: Keras is an open-source neural network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible.

Key Features:

User-friendly API. Modular and extensible. Support for convolutional and recurrent networks. Necessary Modules:

import keras

PyTorch

Description: PyTorch is an open-source machine learning library developed by Facebook. It is based on the Torch library and used for applications such as natural language processing. It is primarily used for applications such as computer vision and natural language processing.

Key Features:

Support for dynamic computation graphs. Highly efficient tensor computation. Necessary Modules:

import torch

def function_name():

5 GETTING STARTED WITH PANDAS

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneously typed numerical array data. The two main house in Pandas is series, and dataframe

Series

In Pandas, a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is similar to a column in an Excel spreadsheet or a database table. Each element in a Series has an index, which is used to label the data.

💡

Note that Series is a capital letter “S”

The syntax for creating a Pandas Series is:

pandas.Series(data = None, index = None, dtype = None, name = None, copy = False, fastpath = False)

Series Parameters:

data (array-like, Iterable, dict, or scalar value, optional):

The data for the Series. This can be a list, NumPy array, dictionary, or scalar value. If data is a dictionary, the keys will be used as the index. If data is a scalar value, an index must be provided.

index (array-like, optional):

Values must be unique and hashable, same length as data. This is the index (row labels) for the Series. If not provided, a default integer index is used (0, 1, 2, …, n).

dtype (numpy.dtype, optional):

Data type for the output Series. If not specified, the data type will be inferred.

name (str, optional):

The name to give to the Series.

copy (bool, default False):

Copy the data. This is relevant for array-like or dictionary inputs.

fastpath (bool, default False):

This is an internal parameter and should generally not be used.

Creating a Series:

Here's a simple example to create a Pandas Series:

Import Pandas:

import pandas as pd

Create a Series from a list:

data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

Output:

goCopy code
0    10
1    20
2    30
3    40
4    50
dtype: int6

In this example, the Series series contains integers with the default integer index ranging from 0 to 4.

Creating a Series with Custom Index:

You can also specify a custom index for the Series:

data = [10, 20, 30, 40, 50]
Info = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index = info)
print(series)

Output:

cssCopy code
a    10
b    20
c    30
d    40
e    50
dtype: int64

Creating a Series from a Dictionary:

A Series can also be created from a dictionary, where the keys become the index:

pythonCopy code
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
series = pd.Series(data)
print(series)

Output:

cssCopy code
a    10
b    20
c    30
d    40
e    50
dtype: int64

Creating a Series with a Specified Data Type:

ythonCopy code
data = [1, 2, 3, 4, 5]
series = pd.Series(data, dtype='float64')
print(series)

Output:

goCopy code
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

Creating a Series with a Name:

pythonCopy code
data = [1, 2, 3, 4, 5]
series = pd.Series(data, name='my_series')
print(series)
print(series.name)

Output:

yamlCopy code
0    1
1    2
2    3
3    4
4    5
Name: my_series, dtype: int64

my_series

Accessing Data in a Series:

You can access data in a Series using both the index and the position:

pythonCopy code
# Using the index
print(series['c'])  # Output: 30

# Using the position
print(series[2])    # Output: 30

Basic Operations on Series:

Pandas Series supports various operations like arithmetic operations, applying functions, filtering, etc.

Arithmetic Operations:

pythonCopy code
series2 = series + 5
print(series2)

Output:

cssCopy code
a    15
b    25
c    35
d    45
e    55
dtype: int64

Applying Functions:

pythonCopy code
series3 = series.apply(lambda x: x * 2)
print(series3)

Output:

cssCopy code
a    20
b    40
c    60
d    80
e    100
dtype: int64

Filtering:

pythonCopy code
series4 = series[series > 30]
print(series4)

Output:

goCopy code
d    40
e    50
dtype: int64

Accessing Array Representation and Index of a Pandas Series:

In Pandas, you can access the array representation and the index object of a Series using its .array and .index attributes, respectively. These attributes provide useful ways to work with the underlying data and the labels.

Array Representation

The .array attribute returns the underlying data of the Series as a Pandas Extension. Array, which is an abstraction over the actual data array (e.g., NumPy array or other array-like objects).

Index Object

The .index attribute returns the index (labels) of the Series, which can be used to access or modify the index labels.

# Accessing the array
obj = pd.Series([4, 7, -5, 3])
obj.array

Output:

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

# Accessing the index
obj = pd.Series([4, 7, -5, 3])
obj.index

Output:

RangeIndex(start=0, stop=4, step=1)

Another Example:

import pandas as pd

# Creating a Series
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)

# Accessing the array representation
array_representation = series.array
print("Array Representation:")
print(array_representation)

# Accessing the index object
index_object = series.index
print("\nIndex Object:")
print(index_object)

Output:

Array Representation:
<PandasArray>
[10, 20, 30, 40, 50]
Length: 5, dtype: int64

Index Object:
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

💡

These attributes are useful for more advanced manipulations and understanding of the Series data structure. The .arrayattribute allows you to work directly with the data values, while the .index attribute gives you access to the index labels, both of which can be crucial for various data operations. Note that attributes are without ( ). eg obj.array and obj.index