Python Tools for Data Analysis
lecturer: Yinghao Li
1. Python Installation: Anaconda
Info
If you don't feel like using the local environment, you can try Google Colab for a free online python environment.
The examples are also available on Colab:
Ignore this session if you already have a python environment.
Anaconda is a complete, open source data science package with a community of over 6 million users. It is easy to download and install; and it supports Linux, macOS, and Windows (source).
In this tutorial, we'll use Miniconda for minimal installation. Please refer to this page for the difference between Anaconda and Miniconda and which one to choose.
1.1. Windows and macOS
- Download the latest Miniconda installer from the official website.
- Install the package according to the instructions.
- Start to use conda environment with Anaconda Prompt or other shells if you enabled this feature during installation.
Warning
Notice: To use conda
command in other shells/prompts, you need to add the conda directory to your PATH
environment variable.
Info
Please refer to this page for more information about Anaconda installation on Windows and this page on MacOS.
1.2. Linux with terminal
- Start the terminal.
- Switch to
~/Download/
with commandcd ~/Download/
. If the path does not exist, create one usingmkdir ~/Download/
. - Download the latest Linux Miniconda distribution using
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
. - Once the download is complete, install Miniconda using
bash Miniconda3-latest-Linux-x86_64.sh
. - Follow the prompts on the installer screens. If you are unsure about any setting, accept the defaults. You can change them later.
- To make the changes take effect, close and then re-open your terminal window or use the command
source ~/.bashrc
. - If you are using zsh or other shells, make sure conda is initiated. To do this, switch back to bash and type the command
conda init <shell name>
.
Info
Please refer to this page for more information about Anaconda installation on Linux.
1.3. Verify your installation
You can use the command conda list
to check your conda installation.
If the terminal returns a bunch of python packages, then your installation is successful.
Info
Please refer to this page for more information.
1.4. Conda environment
With conda, you can easily create, remove, and update environments, each with an independent version of Python interpreter and Python packages. This is always desirable when you work on different Python projects with different (often conflicting) package dependencies. In this tutorial, we will use the default base environment. For more information on environment management, please refer to conda: managing environments.
2. Package Installation
If you are using Anaconda or Miniconda, you can use the Anaconda package manager conda
.
You can also use other managers such as pip
when the packages are not provided by any conda channels.
To look for a specific package, you can visit this website and type the name of that package in the search box.
For today's instruction, we need to install numpy
, matplotlib
, scikit-learn
and pandas
.
First, switch to your conda environment using conda activate <env name>
(not necessary if you are using the default base environment), then install those packages using the following commands:
conda install -c conda-forge numpy matplotlib scikit-learn pandas
The package manager will automatically install the dependencies.
If you install scikit-learn
first, which depends on numpy
, you don't have to install numpy
manually and the conda package solver will do it for you.
If you prefer a fancier and more powerful python shell, you can choose to install ipython
and jupyter notebook
.
conda install -c conda-forge ipython
conda install jupyter
Jupyter notebook allows you to run your commands using the browser as an interface instead of the terminal.
3. Basic Python Concepts
Info
A more comprehensive tutorial can be found on the Stanford CS231n website.
We use Python >= 3.9 in this tutorial.
Warning
Notice that previous Python interpreter versions may behave differently. Please refer to the official document for more details.
First, in your terminal, type python
or ipython
or jupyter notebook
to start an interactive python shell.
ipython
or jupyter notebook
is recommended.
3.1. Variable definition, input and output (print
)
We do not need to specify the variable type while defining a variable. The interpreter will automatically infer the data type from the assigned value.
a = 123
b = '123'
c = "1234"
print(a, b, c, type(a), type(b), type(c))
A variable can be overwritten by a different type:
a = 123.456
print(type(a))
a = '123'
print(type(a))
The input
method allows you to interactively input information into the program through CLI:
x = input('Input something: ')
print(x, type(x))
Info
Input
is rarely used unless you are developing CLI programs.
A more practical input method is argparse.
3.2. List, tuple, set and dictionary
- List is a collection that is ordered and changeable. It allows duplicate members.
- Tuple is a collection that is ordered but not changeable. It also allows duplicate members.
- Set is a collection that is unordered and unindexed. It does not allow duplicate members. Elements in a set cannot be retrieved by index.
- Dictionary is a collection that is ordered, changeable and indexed. It does not allow duplicate members.
Warning
Notice that Dictionary used to be unordered before Python 3.7.
_list = [1, 2, 1.2, '1', '2', 1] # this is a list
_tuple = (1, 2, 1.2, '1', '2', 1) # this is a tuple
_set = {1, 2, 1.2, '1', '2', 1} # this is a set
_dict = { # this is a dict
1: '111',
2: '222',
'1': 567,
2.2: ['J', 'Q', 'K']
}
print(_list, '\n', _tuple, '\n', _set, '\n', _dict)
Access elements
print(_list[0], _list[-2], _list[1: 3])
print(_tuple[1], _tuple[-2])
print(_set[0], _set[-1])
print(_dict[1], _dict['1'], _dict[2.2])
Shallow copy
a = _list
a[0] = 888
print(a, '\n', _list)
3.3. If else
if 888 not in _dict.keys():
_dict[888] = '???'
elif 999 not in _dict.keys():
_dict[999] = '!@#$%'
else:
_dict['qwert'] = 'poiuy'
3.4. Loops
Info
Note: in Python, the indent is used to define a scope instead of curly brackets {}
.
Usually, people use 4 whitespaces or one tab character \t
as one layer of indent.
Be sure to make it consistent throughout the file.
for
loop:
for x in _list:
print(x)
for i in range(len(_list)):
print(_list[i])
while
loop:
i = 0
while i != len(_list):
print(_list[i])
i += 1
3.5 Function
Define a function:
def my_func(x):
x += 1
print('in function: ', x)
return x
Call a function
t = 10
tt = my_func(t)
print(f'out of funciton, t: {t}, tt: {tt}')
4. Basic Numpy Usage
4.1. Array creation
A numpy
array is a grid of values, all of the same type, and is indexed by a tuple of integers.
The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
We can initialize numpy
arrays from nested Python lists, and access elements using square brackets:
import numpy as np
a = np.array([1, 2, 3]) # Create a rank 1 array
print(type(a), a.dtype)
print(a.shape)
print(a[1])
b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array
print(b.shape)
print(b[0, 0], b[0, 1], b[1, 0])
Change the type of an array:
print(a.dtype)
a = a.astype(float)
print(a.dtype)
Other array creation methods:
a = np.zeros((2,2)) # Create an array of all zeros
print(a)
b = np.ones((1,2)) # Create an array of all ones
print(b)
c = np.full((2,2), 7, dtype=np.float32) # Create a constant array
print(c)
d = np.eye(3) # Create a 3x3 identity matrix
print(d)
e = np.random.random((3,3)) # Create an array filled with random values
print(e)
4.2. Array indexing
Similar to Python lists, numpy
arrays can be sliced.
# Create a rank 1 array and reshape it to a 3x4 matrix
a = np.arange(12).reshape(3, 4)
b = a[:2, 1:3]
print(a)
print(b)
# Shallow copy
b[0, 0] = 888
print(a)
You can mix integer indexing with slice indexing. However, integer indexing will yield an array of lower rank than the original array:
row_r1 = a[1, :] # Rank 1 view of the second row of a
row_r2 = a[1:2, :] # Rank 2 view of the second row of a
print(row_r1, row_r1.shape)
print(row_r2, row_r2.shape)
You can also access array elements through lists:
x = [0, 1, 2]
y = [3, 1, 0]
print(a[x, y])
Or through a boolean array:
b = a > 4
print(b)
print(a[b])
4.3. Array math
Basic mathematical functions operate element-wise on arrays, and are available both as operator overloads and as functions in the numpy
module:
x = np.arange(1, 5, dtype=float).reshape(2, 2)
y = np.arange(5, 9, dtype=float).reshape(2, 2)
print(x)
print(y)
# Elementwise sum
print(x + y)
print(np.add(x, y))
# Elementwise difference
print(x - y)
print(np.subtract(x, y))
# Elementwise product
print(x * y)
print(np.multiply(x, y))
# Elementwise division
print(x / y)
print(np.divide(x, y))
# Elementwise square
print(x ** 2)
print(np.power(x, 2))
# Elementwise square root
print(x ** 0.5)
print(np.sqrt(x))
Matrix multiplication is realized by np.dot
or operator @
:
x = np.arange(1, 5, dtype=float).reshape(2, 2)
y = np.arange(5, 9, dtype=float).reshape(2, 2)
print(x)
print(y)
v = np.array([9, 10], dtype=float)
w = np.array([11, 12], dtype=float)
# Inner product
print(v.dot(w))
print(np.dot(v, w))
print(v @ w)
# Matrix / vector product
print(x.dot(v))
print(np.dot(x, v))
print(x @ v)
# Matrix / matrix product
print(x.dot(y))
print(np.dot(x, y))
print(x @ y)
Warning
Attention: np.dot()
and @
behaves differently when the matrix rank is larger than 2.
Numpy
also provides functions for performing computations within an array:
print(np.sum(x)) # Compute sum of all elements; prints "10"
print(x.sum()) # same as above
print(np.sum(x, axis=0)) # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1)) # Compute sum of each row; prints "[3 7]"
To transpose a matrix, use the T
attribute of an array object:
print(x.T)
If you have a rank >2 matrix, you can use np.transpose
to specify how to permute the axes:
x = np.arange(24).reshape(2, 3, 4)
print(x.transpose(1, 0, 2).shape)
5. Using Matplotlib for Visualization
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib qt
# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)
# Plot the points using matplotlib
plt.plot(x, y)
plt.show() # You must call plt.show() to make graphics appear.
Info
Note: for jupyter notebook, you can use the command %matplotlib inline
to make the graphics embedded in the editor or %matplotlib qt
to make them pop out.
To plot multiple lines at once, and add a title, legend, and axis labels:
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()
You can plot different things in the same figure using the subplot
function. Here is an example:
# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)
# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')
# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')
# Show the figure.
plt.show()
6. Pandas and Scikit-Learn for Data Science
In this section, we will look at a data science example using pandas as data management tool and scikit-learn (sklearn) as algorithm implementation. This section is modified from this tutorial.
6.1. Import packages
import numpy as np
import pandas as pd
# automatically split the data into training and test set
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
# classifiers and regressors
from sklearn.ensemble import RandomForestRegressor
# Construct a Pipeline from the given estimators
from sklearn.pipeline import make_pipeline
# Exhaustive search over specified parameter values for an estimator.
from sklearn.model_selection import GridSearchCV
# Training objective and evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
# For model persistence
# you can use `from sklearn.externals import joblib` if your sklearn version is earlier than 0.23
import joblib
6.2. Load data
You can download the data by clicking the link or using wget
:
wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
and move the file to your current folder.
Then, load the csv
data into memory through pandas
:
data = pd.read_csv('winequality-red.csv', sep=';')
Or, you can directly load the data through URL.
dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')
You can also load datasets stored in other formats with pandas
.
A detailed document is at pandas: io.
6.3. Take a look of the loaded data
The data loaded is stored in the type of pandas.core.frame.DataFrame
To give a peak of the data, we can use
print(data)
This will return a nice-looking preview of the elements in the DataFrame.
To view the name of the features of a DataFrame, one can use
print(data.keys())
To access one column, i.e., all instances of a feature, e.g., pH
, one can use
# These will return the same result
print(data['pH'])
print(data.pH)
To access a row, you need the DataFrame.iloc
attribute:
print(data.iloc[10])
We can also easily print some summary statistics:
print(data.describe())
6.4. Split data
First, let's separate our target (y) feature from our input (X) features and divide the dataset into training and test sets using the train_test_split
function:
y = data.quality
X = data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0, stratify=y
)
Stratifying your sample by the target variable will ensure your training set looks similar to your test set, making your evaluation metrics more reliable.
6.5. Pre-processing
Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations. It is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# To prove the trainig and testing sets have (nearly) zero mean and one deviation
print(X_train_scaled.mean(axis=0))
print(X_train_scaled.std(axis=0))
print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))
6.6. Fit the model
If we do not need to fine-tune the hyperparameters, we can define a random forest regression model with the default hyperparameters and fit the model using
regr = RandomForestRegressor()
regr.fit(X_train_scaled, y_train)
To examine the performance, we use the test set to calculate the scores
pred = regr.predict(X_test_scaled)
print(r2_score(y_test, pred))
print(mean_squared_error(y_test, pred))
6.7. Define the cross-validation pipeline
Fine-tuning hyperparameters is an important job in Machine Learning since a set of carefully chosen hyperparameters may greatly improve the performance of the model.
In practice, when we set up the cross-validation pipeline, we won't even need to manually fit the data. Instead, we'll simply declare the class object, like so:
pipeline = make_pipeline(
preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=100)
)
To check the hyperparameters, we may use
print pipeline.get_params()
or refer to the official document.
Now, let's declare the hyperparameters we want to tune through cross-validation.
hyperparameters = {
'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
'randomforestregressor__max_depth': [None, 5, 3, 1]
}
Then, we can set a 10-fold cross validation as simple as
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
Finally, we can automatically fine-tune the model using
clf.fit(X_train, y_train)
After the model fitting, if we want to check the best hyperparameters, we can use
print(clf.best_params_)
Same as before, we evaluate the fitted model on test set
pred = clf.predict(X_test)
print(r2_score(y_test, pred))
print(mean_squared_error(y_test, pred))
6.8. Save and load models
After training, we may want to save the trained model for future use. For this purpose, we can use
joblib.dump(clf, 'rf_regressor.pkl')
When you want to load the model again, simply use this function:
clf2 = joblib.load('rf_regressor.pkl')
# Predict data set using loaded model
clf2.predict(X_test)
Info
A more comprehensive example of scikit-learn can be found here.