Introduction to NumPy, TensorFlow, and scikit-learn

Thomas Daniels

4.60/5 (9 votes)

Jun 23, 2020

CPOL

5 min read

12047

166

In this article we take a quick look at NumPy and TensorFlow also do a short overview of the scikit-learn library.

Download Module8.zip - 1.4 KB

This is the eighth and last module in our series on Python and its use in machine learning and AI. In the previous one, we discussed neural networks with Keras. Now we’re going to take a quick look at NumPy and TensorFlow. Because they’re the building blocks of machine learning libraries, you'll definitely come across them at some point. If you’re an enterprise developer, you won't be writing complete solutions with just these libraries (it takes much longer and is harder to maintain). That would be more for data scientists, dedicated AI/ML engineers, and developers of higher-level ML libraries. Nevertheless, it's a good idea to take a look at the lower-level libraries to see what they're about.

In this module, I'll also give a short overview of the scikit-learn library, because it's the most complete machine learning (excluding deep learning) library in the Python ecosystem.

Installation

If you went through the previous modules, everything you need is already installed!

NumPy

As noted in Module 4, the core of NumPy is its N-dimensional arrays, and it also offers features such as linear algebra and Fourier transforms. A NumPy array is a very common input value in functions of machine learning libraries. Therefore, you’ll often use NumPy directly when you have a dataset in one specific format and you have to transform it into another format. Or you might use NumPy as the result of a library function call.

A NumPy array, in as many dimensions as you want, can be directly created from nested lists, nested tuples, or a combination of those, as long as the dimensions make sense.

import numpy as np
arr = np.array([ [1, 2, 3], (4, 5, 6) ])
print(arr[0, 1])

Here, we're importing numpy using the shorter np parlance, which is an acceptable and very common practice.

Also, (0, 1) is a tuple used as an index.

NumPy arrays have slices that let you take a row or a column:

# returns the first row as a one-dimensional vector
print(arr[0, :])
# returns the first column as a one-dimensional vector 
print(arr[:, 0])

The same syntax works with a greater number of dimensions as well (though it's harder to speak of "rows" and "columns" here):

arr = np.array([ [ [1, 2, 3], [4, 5, 6] ], 
                 [ [7, 8, 9], [10, 11, 12] ] ])
print(arr[:, :, 0]) # [[ 1,  4], [ 7, 10]]
print(arr[1, :, 0]) # [ 7, 10]

NumPy's indexing and slicing is even more powerful than this. Check out the reference for a more complete overview.

NumPy arrays can be stacked horizontally or vertically (if the dimensions are correct) with hstack and vstack, both taking a tuple of arrays as the argument (get the number of parentheses right!):

arr1 = np.array([ [ 1, 1 ], [ 1, 1 ]])
arr2 = np.array([ [ 2, 2 ], [2, 2]])
print(np.hstack((arr1, arr2)))
print(np.vstack((arr1, arr2)))

A powerful method of NumPy is reshape. As the name implies, it changes the shape of an array. Here is a reshape example:

vector = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9 ])
matrix = vector.reshape((3, 3))

The argument to reshape is the new shape, a tuple of the desired dimensions. This is a rather simple example, but you can also use it for reshaping from and to arrays with more dimensions. Elements are read from the original array in a certain index order and written to a new array in the same index order. Refer to the reshape documentation to learn more about index orders.

TensorFlow

For working with neural networks at a high level, we looked at Keras in Introduction to Keras. At its core, TensorFlow is a library for tensor computations.

A tensor is a generalization of vectors and multidimensional matrices:

A 0-Tensor is a scalar
A 1-Tensor is a vector
A 2-Tensor is a matrix
A 3-Tensor is... just a 3-Tensor.

And so on.

Tensors can hold any kind of data: integers, floats, strings, and more. Although you usually won’t encounter these when using a high-level library such as Keras, it's still interesting to look at them because they’re the foundational building block of TensorFlow.

What's the difference, then, between a NumPy array and a tensor? Both objects represent more or less the same data, but a tensor is immutable.

TensorFlow can perform various operations on tensors. Here is an example that starts with three matrices, performs a matrix multiplication on the first two, adds the third matrix to that, and inverts the result.

import tensorflow as tf
a = tf.constant([ [ 0.6, 0.1 ], [ 0.4, -0.3 ] ])
b = tf.constant([ [ 1.2, 0.7 ], [ 0.9, 1.1 ] ])
c = tf.constant([ [ -0.1, 0.2 ], [ 0.3, 0.1 ] ])

d = tf.matmul(a, b)
e = tf.add(c, d)
f = tf.linalg.inv(e)

sess = tf.Session()
result = sess.run(f) # a NumPy array

The operations are not performed immediately. The result is only computed when a session is created and run. Before session creation, the above code constructs a graph of operations, which then gets evaluated.

scikit-learn

scikit-learn is a broad library offering many traditional machine learning methods (very roughly said: everything except machine learning). You can install it with pip in a Jupyter Notebook cell:

!pip install scikit-learn

Considering the breadth of the library, we won’t focus on one specific code example, but instead give an overview of what you can expect from this library. scikit-learn offers both supervised and unsupervised learning methods. Supervised means you have an expected output for every input in your training set; unsupervised means you don't and you’ll let the algorithm draw its own conclusions. Its main features for supervised learning are classification (identifying categories) and regression (predicting a continuous value), by means of algorithms such as support-vector machines, random forests/decision trees, nearest neighbors, naive Bayes, and more. Unsupervised learning is mainly focused on clustering (automatic grouping based on features), using algorithms such as k-means and mean-shift. Aside from the learning functionality itself, scikit-learn offers ways to validate, evaluate, and compare models and tools to preprocess your input data. A lot is left out here, so I invite you to take a look at their User Guide for a complete overview.

Conclusion

We barely scratched the surface of NumPy, TensorFlow, and scikit-learn, but now you have an idea of what they can do and why they’re important in Python's machine learning ecosystem. With the end of this module, we’ve also reached the finish line of our series. You are now armed with the fundamental knowledge to leverage the various AI/ML-related libraries in Python. Thank you for reading!