Deep learning Note 2 — Part 2 Optimization

In the last post we talked about regularization methods to improve deep neural networks. In this week I will summarize another two topics from the course Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization: Setup up your optimization problem and Optimization Algorithms.

Setup up your optimization problem

Assuming that we have defined our cost function, it’s time for us to jump right into optimize it, right? Well, not so fast. There are a few things to consider first.


Before we feed the training data into the network, we need to normalize them (subtracting mean and normalizing variance as shown below) .


The advantage of normalization is that it makes it easier to optimize the cost function (in the way that we can supply a larger step size/learning rate to speed up the convergence).


One thing worth noting is that make sure to apply the normalization on the test and validation set using the mean and variance obtained on the training set. There is usually no harm of doing normalization on the input data so just do it.

Random weights initialization

One problem with deep neural network is the vanishing/exploding gradients that the derivative/slope can sometimes become very big or small which makes the training difficult.


The screenshot above shows that if W is greater than 1, y will explode while if W is smaller than 1, y will vanish to a very small number. One way to (partially) solve this is to randomly initialize the weights, shown as follows:


Using these scaling methods, W will not explode or vanish too quickly, which allows us to train relatively deep neural networks.

Gradient check

If we are implementing back-propagation ourselves, we need this check to make sure that our implementation is correct. To do that we need to do something called numerical approximation of gradients. For a simple function f(theta), the gradient at theta is


This version is more accurate than the [f(theta + epsilon) – f(theta)]/epsilon.


However, if we are just using deep learning frameworks, we don’t really have to do this. This is mainly for debugging the implementation of back-propagation if we are doing this on our own.

Things to watch out

  • Gradient Checking is slow! Approximating the gradient is computationally costly. For this reason, we don’t run gradient checking at every iteration during training. Just a few times to check if the gradient is correct. Then turn it off and use backprop for the actual learning process.
  • Gradient Checking, at least as we’ve presented it, doesn’t work with dropout. We would usually run the gradient check algorithm without dropout to make sure backprop is correct, then add dropout.

Optimization Algorithms

We’ve learned about one optimization algorithm–Gradient Descent (GD). There are other algorithms out there, mostly as an performance improvement of GD.

Batch VS. Mini-Batch VS. Stochastic Gradient Descent


  • Batch gradient descent uses the training data all at once in an epoch. Only one gradient descent in one epoch.
  • Mini-batch, however, uses a subset of the training data at a time. Therefore, there are multiple batches of training and thus multiple steps of gradient descent in one epoch.
  • SGD, to the extreme, uses 1 example at a time (batch_size = 1).

Why mini-batch GD?

Batch gradient descent is slow since it uses all data at once. SGD has high oscillation and loses the speed benefit of vectorization. Mini-batch is somewhere in between. It could show progress relatively fast and uses vectorization.


Why are there oscillations?


This is because that one batch may be easy to learn but the next one could be difficult. If we are using batch gradient descent and it diverges in even one iteration, that could mean we have a problem, for example, we set the step size too big.

Mini-Batch size

  • If the training set size is small, don’t bother using mini-batch, just use the whole batch to train the network.
  • Otherwise, shuffle the training data first and choose the batch size between 32 to 512 (power of 2) or whatever fits in the memory. If a particular mini-batch size has problem, check the memory cost.

Faster than Gradient Descent?

There are optimization algorithms faster than Gradient Descent: Momentum, RMSProp, and Adam. To understand them we need to use something called exponentially weighted average (EWA). It is the key components of those advanced optimization algorithm. The idea of these advanced algorithm is to compute a EWA (or some other forms of combinations) of the gradient and use that to update your weights instead. These algorithms have been proven effective in deep learning practices, especially the last two.

Exponentially weighted averages

I won’t go into much details but it is a way to approximate the average over the data in the last 1/(1-beta) days. In the screenshot below, the red line is the actual temperature while the green line is the moving average. The average line is smoother and it adapts slower when the temperature changes since it averages over a larger window of days. The larger beta is, the higher weight we gave to the previous days’ average and makes it less sensitive to changes in today’s data theta.


Note that this is just an approximation of the moving average (so it is less accurate) but it has the advantage of using less memory when computing since we don’t have to save the past X days’ of data. For efficiency purpose, this is widely used in machine learning.

Can it be more accurate?

There is a way to make it more accurate called **Bias correction **. If we use the original equation above, the average at the very beginning could be off (smaller) a lot to the real average. One way to do it is to divide the average by 1 – beta^t to compensate the offset at the beginning of the averaging line. When t is small, 1-beta^t is closer to o so it scales up the average. As t becomes larger, 1 – beta^t becomes closer to 1 and thus has less affect.


In practice, people often don’t bother to do bias correction. They would rather just wait for the initial period to pass. However, if you are concerned about the average at the beginning, bias correction is the way to go.

Gradient Descent with Momentum

The idea of momentum is to compute a EWA of the gradient and use that to update your weight instead. Let’s see an example to get an intuition of this algorithm.


Gradient descent (the blue line) takes a lot of steps to get to the minimum. These up-and-down oscillation slows gradient descent down and prevent you to use a large learning rate (you may shoot too far on the wrong direction and end up diverging). On the horizontal direction, we want faster learning. Using momentum, we can achieve that goal. Since momentum is based on EWA, it can smooth out the oscillation on the vertical direction but keep the horizontal direction intact. In this way we move faster to the minimum and allows us to use a higher learning rate. Vdw and Vdb are both initialized to zeros.

Hyperparameters in Momentum

There are two parameters for Momentum. One is the learning rate alpha and the other is beta term we’ve seen in the EWA. A good default value for beta is 0.9 (about 10days average).


Momentum will almost always work better than gradient descent.

RMSprop (Root Mean Squre prop)

Similar to Momentum, RMSprop also updates the weight using some form of the EWA of the gradient, illustrated as follows:


See the square of dW and the root square of SdW in purple? That’s where the name of Root Mean Square comes from. The intuition here is that for directions where the slope is large, Sd(W1, W2, W3…) will be large and cause smaller update (damping the oscillation) while for directions where the slope is small, Sd(W4, W5, W6…) will be small and cause larger update. This applies to both W and b here and note that W and b are both high dimensional. The example using W and b as the direction is a bit confusing though.


Adam combines the two algorithms above together as shown below. Not much to say here but it is proven to be very effective.



Based on the assignment of the course, we found that

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.

Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you’ve seen that Adam converges a lot faster.

Some advantages of Adam include:

  • Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
  • Usually works well even with little tuning of hyperparameters (except α )

Hyper-parameter choices

These are the hyper-parameters for the Adam algorithm. In practice, people just use the default value of beta1, beta2 and epsilon. You still need to tune a set of alpha.


Learning Rate Decay

One way to speed up the learning process is to use learning rate decay. This sounds a bit counter-intuitive at first but here’s the explanation. The reason is that if we don’t reduce it, you may end up wandering around the minimum and never converge. The better way is to use large learning rate at first and then slowly reduce it. When you are around the minimum, you oscillate in a tighter region around the minimum rather than going off a lot.


(Correction: the learning rate for the first epoch here should be the largest. Andrew might make a mistake here). There are some other ways of decaying the learning rate out there. Please refer to the note of CS231N. It has a very detailed explanation.

Local Optima?

People worry about their algorithm getting stuck at local minimum. However, the concept of this issue is changing in deep learning era.

Saddle Point

The issue is that if you are plotting a cost function with your weights in two dimensions, you do get a lot of local optima. But in training neural network, most points with zero gradients are not local optima. They are actually saddle points. See the difference between a bowl shape and a saddle shape on the left side of the slide in red.


In high dimensional space, if you have 20000 dimensions, a real local optima requires all dimensions have a zero gradient of a bowl shape (rather than a saddle shape). The probability of that is 0.5^20000, which is very small. So maybe high dimensionally does have an advantage.

But seriously, this is a dinosaur, not a horse…



Deep Learning Note 2 — Part 1 Regularization

In this week I started the second courser Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization in the Coursera Deep Learning track. The note in this post focuses on the content of Setting up your machine learning application and Regularizing the network. This post is also published on Steemit under the username @steelwings.


Applied ML is a highly iterative process of Ideas -> Code -> Experiment and repeat.Accept and deal with it but we certainly should make this process easier. I guess that’s what the machine learning platforms out there are built for.

Before tuning the parameters

One question we have to answer is: what should we tune the parameters on? If you come from a machine learning background, you probably already know that the data should be split into 3 parts.

Train/Dev (Validation)/Test sets

  • Tuning the parameters or try different algorithms on the Dev set
  • Once you are happy, evaluate on the final test set. The final test set should only be touched once (besides applying the same data transformation you did on the training set).

Split ratio

  • Traditionally, we might use 60/20/20.
  • In the era of big data, as long as we have enough validation and test records, the percentage doesn’t matter. If your task only needs 10000 records for validation and testing, then the rest can all be used for training. Although, how do you know how many records you need for validation or testing?

Mismatched train/test distribution

One thing to watch out for is that we want our training and testing data may come from the same source/distribution. For example, if we have a cat image recognition application that trained on good-quality cat pics from the web, but most our app users upload low quality/resolution pictures from their cellphone, then we have a mismatch between the training and test data and our model’s performance could suffer because of that. This also seems like a red flag in the requirement gathering and use case analysis.

Bias and Variance

One aspect of tuning the model is to handle high bias and/or variance. Unlike the traditional machine learning, they are not necessarily trade-offs in Deep Learning era. Traditionally people talk about bias/variance tradeoff because there were no (or few) good ways to reduce one without increasing the other. However, this is not the case in the deep learning era.

How high is high?

High bias or variance is a relative concept, using the optimal (Bayes) error as the bar. If human gets 1% error, then 15% from the model is too high. But if human gets 13% error, then 15% from the model is not that bad.


The worst from both world

It is possible that our model could have high bias and variance at the same time, especially in high dimensional space that it can under-fit some regions but overfit some others.


Basic recipe to handle the problem


  • We start with checking whether the model has high bias on the training data set.
    • If yes, we can try bigger network, train longer (or use more advanced optimization algorithms without training longer), or do a search on the NN architecture that suits the problem.
    • If no, then we can move on to check if the model has high variance on the dev/validation set.
  • Checking high variance is to see if the model is overfitting the training data.
    • If yes, then we can try getting more data, applying regularization or doing a NN architecture search that suits the problem.

In the traditional machine learning, people have to carefully handle the situation to lower high bias/variance without increasing the other. In NN, as long as we have a well-regularized network, training a bigger network almost never hurts (except for computational time). Getting a bigger network almost always reduces the bias without necessarily hurting the variance as long as we regularize properly. On the other hand, getting more data for the network almost always reduces the variance and doesn’t hurt the bias.

Regularization to reduce high variance

L2 Regularization


Some people say L1 Regularization compresses the model, but in practice it only helps a bit. L2 are used much more often in practice. The lambda here is called the regularization parameter which controls how much regularization we want to apply. The value of λ is a hyperparameter that you can tune using a dev set. L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to “oversmooth”, resulting in a model with high bias.

What is L2-regularization actually doing?

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost function to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. L2 regularization is also called weight decay because of the term (1 – alpha * lambda / m) * W shows that the value of W is being pushed to smaller values in each back-propagation (See the screenshot below).


Why does regularization help reducing overfitting?

One (not so accurate but easy to understand) intuition is that if we set a very large lambda, it will drive the W to near 0, much like zeroing out a lot of nodes in the network, resulting in a simpler network. In reality, we still use all the nodes but they just have a smaller effect and we end up with a simpler network.


Another intuition is that say we use a tanh function as the activation function, if we increase lambda, W becomes smaller and so does the value of Z. If Z stays in a small range around 0, the tanh is close to linear. If every node or layer is like this, the whole network becomes close to linear, which is simpler.


Dropout Regularization

How does it work?

At each iteration, you shut down (= set to zero) each neuron of a layer with a probability 1−keep_prob or keep it with probability keep_prob. The dropped neurons don’t contribute to the training in both the forward and backward propagations of that iteration. Basically when the neurons are shutdown, we treat them as their output are zeros and keep it that way during both the forward and backward propagation.

Why does it work?

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that neuron might be shut down at any time so it has to spread out the weights. This has a similar affect as the L2 regularization to shrink the weights.

Dropout is used almost by default in Computer Vision because there are just not enough data for computer vision applications. But this doesn’t always apply to other applications. If your model is not overfitting, you shouldn’t apply dropout.


Inverted Dropout

One commonly used dropout technique is called inverted dropout. During training time we divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. If we don’t scale up, we lose information and the prediction result will be negatively impacted.

In testing phase, we do not apply dropout, otherwise you are just adding noise to your predictions. In theory, we could add dropout to testing but that requires us to repeat the prediction many times and then take an average. It gives similar result but with a higher computational cost.

Other regularizations

Data augmentation

Suppose our model is overfitting and we want to get more data. However, it may not be possible or expensive to do so. One way in Computer vision is that you take your image and randomly transform, rotate or flip it to generate “new” images. It is not as good as getting the real new data (as they are still duplicate in some sense) but it does cost less to get more data.

Early Stopping

Stops early in the training process to get a mid-size weight.
But it has a downside as explained below.


Orthogonalization (One task at a time)

In machine learning we already have so many hyper parameters to tune so it’s easier to think about when you have one set of tools to optimizing the cost function J and first focusing on minimizing J. Then as a completely separate task, we want the model to not overfit and we have a separate set of tools to do it.

  • Optimize a cost function J
    • Using optimization algorithms like gradient descent, momentum, Adam and so on.
  • Make sure your model does not overfit.
    • Regularization, getting more and so on.

Using early stopping, it couples the two problems mentioned above and we can no longer work on them independently. An alternative to early stopping is to just use L2 regularization, then we can train the network as long as possible. We do have to try different values of lambda though, assuming you can afford the computation to do so. To be fair, an advantage of early stopping is that you have one less hyper parameter to tune, lambda.


Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system. Some key points:

  • Regularization will help you reduce overfitting.
  • Regularization will drive your weights to lower values.
  • L2 regularization and Dropout are two very effective regularization techniques.


Deep Learning Note – Neural Networks and Deep Learning

I recently signed up for the Deep Learning Specialization on Cousera and have just completed the first course Neural Networks and Deep Learning. Although it is recommended for 4 weeks of study, with some backgrounds in Machine Learning and the help of 1.5x play speed, finishing it in 1 week is also achievable. In this post I just want to summarize some of the take-aways for myself and hope it also helps whoever’s reading it. If you are familiar with the implementation of neural network from scratch, you can just skip to the last section for Tips and Best practices mentioned in the course. Note that this post is also posted on Medium and Steemit under the username @steelwings.

Scale of the problem


  • On small training set, neural network may not have a big advantage. If you can come up with good features, you can still achieve better results using other traditional machine learning algorithms than neural network.
  • However, as the amount of data grows, the traditional learning algorithms can hit a plateau while the neural network’s performance keeps increasing. The larger the neural network, the larger the increase.
  • As the neural network grows, it takes more resource and times to compute. Many innovation on the neural network algorithms were initially driven by performance requirement.


Why does the traditional algorithm hit a plateau? To be continued…

Logistic Regression as a Neural Network

Almost all the books or courses on Deep Learning starts with either logistic regression or linear regression. This is because they can be treated as the simplest neural network with no hidden layer and only 1 neuron.


Using logistic regression as an example, we introduce the following components:

  • Scoring function: Z =, x) + b
  • Activation function: A = g(Z) where g here is the sigmoid function.
  • Loss/Error function: L(y^, y): the function we use to evaluate the difference between y^ and y where y^ is the prediction. Note that loss function is applied on a single example.
  • Cost function is the combination of the loss over the whole dataset.

Logistic regression computes the probability of y = 1 given x. There are many choices for the loss function. One of them is the mean squared error but since it is not convex, it could have multiple local minimums so the one used in the course is the cross-entropy loss.


Maximum Likelihood Estimation

If we look at m examples in the training set, assuming that they are independent and identically distributed (iid), training a logistic regression model is computing the probability of the labels given all the x and we are trying to maximize it.


Neural Network

A neural network is really just a collection of connected neurons. Each neuron shares the same compoents:

  • Scoring function Z = WX + b
  • Activation A = g(Z)


Zoom out for a more complete view. This is the forward propagation for a single example through the first layer of the neural network.


Note that W[1] here represents the W matrix for the 1st layer where each row represents the transpose of w in each neuron. The number inside the bracket [] represent the layer index. The input layer can be treated as layer 0.

Shallow Neural Network

If the neural network has only 1 hidden layer, then we call it a shallow neural network.

Hidden layer

So why is it called hidden? We know the input and output in the training set. However, we do not know those for the layers between the input and output layer so we call them hidden layers.

Deep Neural Network

If the network has more than 1 hidden layer, it’s a deep neural network.

Tips and Best Practices

Notation Cheatsheet

Part 1

This part summarizes the annotations and representations in neural network. I know that there are other forms out there but I really think this is the cleanest way, at least for my brain. The key point is that, anything related with an example is represented in a COLUMN.


Let’s say that you have m examples and each example has n features. Then the input matrix X is n by m. The target matrix Y is 1 by m. Technically we can think the target Y as a vector, but think it as a matrix of 1 by m will make our life easier down the road.

Another thing to pay attention to is that the index in parenthesis represents something related to the ith example.

Part 2


One take-away here is the dimension of W and b for layer L. If layer L has n(L) units and layer L-1 has n(L-1) layer, then:

  • W is in the shape of n(L) by n(L-1).
  • b is in the shape of n(L) by 1.
  • dW has the same dimension as W.
  • db has the same dimension as b.

Building a neural network

Preprocessing the data (using image classification as an example)

Common steps for pre-processing a new dataset are:

  • Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, …)
  • Reshape the datasets such that each example is now a vector of size (num_px * num_px * 3, 1). This is the image_to_vector processing.
  • “Standardize” the data. We center and standardize your dataset, meaning that you substract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array. This will help the gradient descent process.

The whole modelling process (and how to build it from scratch)

The main steps for building a Neural Network are:

  1. Define the model structure (such as number of input features, number of layers and other hyper parameters)
  2. Initialize the model’s parameters
  3. Loop:
    • Calculate current loss (forward propagation)
    • Calculate current gradient (backward propagation)
    • Update parameters (gradient descent)You often build 1-3 separately and integrate them into one function we call model(). We can break down even more to define the activation and propagate function.

How to initialize the parameter

For logistic regression, we can initialize all the parameters (W and b) to all zeros. However, we cannot do this for neural network since it will result in all neuron become identical. If all neuron are identical, we lose the meaning of using a neural work. We have to randomly initialize the parameters to break symmetry of the neurons. As a general rule, we can initialize them in the following way:

  • W = np.random.randn((d1, d2)) * 0.01 # we use small value (not necessarily 0.01) here because if we use sigmoid or tanh, the score can be large and the gradient at that place becomes small and slows down the training.
  • b = np.zeros((d1, 1)) # b does not have any symmetric issues.

Activation function

The scoring function WX + b is linear but it is the nonlinear output that makes neural networks powerful to learn complex structures in data.

Sigmoid and Tanh function

These two functions both suffer from the problem that when the input value z is very large or small, the gradient of the function becomes close to 0 and thus slows down the training process. Tanh has one advantage over the sigmoid function that the activation value is centred at 0 between (-1, 1), which is easier for gradient descent.

For binary classification, using the sigmoid function as the activation of the unit on the last layer is fine, but for other layers, we may want to consider the following two types below.

Rectified Linear Unit (ReLu)

This is one of the most widely used activation function in practice. A = max(0, Z)

Leaky ReLu


Performance and Code Quality

  • Use vectorization whenever you can. It will speed up the computation drastically.
  • DO NOT use numpy 1-D array. It may cause unexpected bug down the road. Even if the data is a single row or column, use a matrix to represent it (either 1-by-N or N-by-1).

Hyper-paratemter tuning

Learning rate

A good learning rate can help the network avoid getting stuck in a local minimum. One mechanism is called Bold Driver as described below:

To resolve this you can check the value of the error function by using the estimated parameters of the model at the end of each iteration. If your error rate was reduced since the last iteration, you can try increasing the learning rate by 5%. If your error rate was actually increased (meaning that you skipped the optimal point) you should reset the values of Wj to the values of the previous iteration and decrease the learning rate by 50%. This technique is called Bold Driver.

The number of units in the hidden layer

There is no hard rules about this but some general guidance from this Quora post:

  • The number of hidden nodes in each layer should be somewhere between the size of the input and output layer, potentially the mean.
  • The number of hidden nodes shouldn’t need to exceed twice the number of input nodes, as you are probably grossly overfitting at this point.

Keras Tutorial Notes

I’ve recently finished the first pass of CS231N Convolutional Neural Networks for Visual Recognition. Now it’s time to try out a library to get hands dirty. Keras seems to be an easy-to-use high-level library, which wraps over 3 different backend engine: TensorFlow, CNTK and Theano. Just perfect for a beginner in Deep Learning.

The tutorial I picked is the one on the MNIST dataset. I’m adding some notes along the way to refresh my memory on what I have learned as well as some links so that I can find the references in CS231N quickly in the future.

Step 1 Importing libraries and prepare parameters for training

'''Trains a simple convnet on the MNIST dataset.

Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

Not much to explain on the import statements, so let’s look at some of the parameters defined in this section.

What are batch_size and epochs?

A good explanation can be found Training a Model from DL4J. Epoc means to train the model on all of your data once—a single pass over the whole dataset. Why do we need to train the model with multiple epochs?

To answer this question, we need to know what happens in the training process in a Neural Network. Using the example in CS231N, it is minimizing the loss function using gradient descent. One gradient descent update most likely won’t give you the minimal loss, so we have to do multiple passes until it converges or hitting a pre-set limit—for example, the epoch number. Of course, not all machine learning require multiple passes like this, for example, K-Nearest Neigbour (K-NN) algorithm.

Now let’s talk about batch_size. It relates to how we train the model, specifically how to optimize the loss function. In the naive form, we compute the loss function over the whole dataset. Quoted from CS231N:

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update

However if we have millions of records, it becomes wasteful and inefficient to repeatedly compute the loss function to do a simple gradient update. Therefore, a common way to solve the scalability issue is to compute the gradient over batches of training data.

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

So why does this work? To quote from the course note:

“….the gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates.”

Gradient descent using mini-batch like this is called Minibatch Gradient Descent (MGD) but in practice this is usually referred as another concept Stochastic Gradient Descent (SGD) when the batch size is 1.

  • One question I have: with epoch and batch_size, does this mean that we update the gradient with SGD multiple times in one epoch?

Set the image dimension

  • We specified the image dimension in the code, which raised two questions:
  • Do all the images in the dataset have to be in the same dimension?
  • I assume if they don’t, we will have to resize them into the same size. How? Doesn’t the resizing make the subject in the image disproportional?

Step 2: Prepare the dataset for training and testing

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Specify the depth of the images

Since we are using CNN, one important step is to arrange the neurons in 3D (width, height and depth). I’ll skip the details but the depth here in code is 1. That means our images have only 1 channel, instead of 3 (RGB channels).

Normalize the mean and standard-deviation

It seems that the code above doesn’t perform this processing except for the two lines below:

x_train /= 255
x_test /= 255

As a guideline:

“Normally we would want to preprocess the dataset so that each feature has zero mean and unit standard deviation, but in this case the features are already in a nice range from -1 to 1, so we skip this step.”

Preprocess the class labels

Well, we need the class label to be a 10-dimensional array for each record. Not sure if this is related, but the scoring function of the model is a 10-dimensional array with each value representing a score assigned to a particular class. If we look at the labels, we will find the labels in a 1-dimensional array. Hence the conversion.

print y_train[:10]
# [5 0 4 1 9 2 1 3 1 4]

Step 3: Define the model structure

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))


A Sequential model a linear stack of layers. Here we added 8 layers. Why do we add these layers but not others? I don’t know. People spend a great deal of time trying out different architectures of the network. If we are just starting out, we might just rely on architectures that are proven to be useful, like the examples provided by Keras.

Layer Patterns

……A ConvNet is made up of Layers. Every Layer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters……We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture…..

So why do we use Convolutional layers instead of the regular ones? In short, to solve performance and scalability issues as well as to avoid overfitting when processing full images.

The links above covered Conv2D, MaxPooling, and Dense layers. What about Dropout and Flatten here?

At this point, the model structure is defined. We then specify the loss function, the way to optimize it and the measurement metric in the compile method.

Step 4: train the model and test it, y_train,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Not much to explain here. We train the model with the training data. However, one concern I have with this piece is that the model is validating itself on the testing data after each epoch and it’s also evaluating on the same testing data to get the score. What we should do is to have a dedicated validation set split from the training set (as suggested in the courser note validation set is considered already burned during training (see the last point in the summary)). Therefore, using the option validation_split may be a better idea.

Root-causing the random failures in the integration tests with ElasticSearch

In our recent development we were creating an integration test framework and some tests for manipulating data in the ElasticSearch cluster. Strangely the tests could succeed or fail randomly, even though we never made any changes to the code on the business logic at that time.

What did we have in the test cases?

  • @BeforeClass: load the test data into ElasticSearch cluster through ElasticSearch TransportClient.
  • @Test: retrieve test data and check equality on some fields.
  • @AfterClass: clean up the test data through ElasticSearch TransportClient.

Really just simple as this.

What did it the error message say when the tests failed? Well it complained about not being able to find the test data.

Strange. The @BeforeClass annotated method should always load the data into the cluster before executing the test cases and there were no errors about failing to load data. Feeling a bit stuck, I commented out the clean up code in the @AfterClass method. Now the tests passed consecutively on every test I issued but once I added back the cleanup code, it started failing occasionally again, especially when I ran the test right after the previous one finished.

This got me thinking: “Could it be possible that the test data was cleaned up at the end of the previous test but not loaded into the cluster in the next run even though @BeforeClass method was executed? ”

My suspicion was confirmed after some reading on how ElasticSearch loads data. Why did this happen? Because loading data into ElasticSearch cluster takes time and so does deleting them. The test cases were executed right after the load request was issued in the @BeforeClass method, but not necessarily after the request was processed by the cluster. In other words, it is asynchronous. We made a false assumption that the load request was processed and the data was present in the cluster immediately. This mindset may be OK in unit test but with integration test it can be problematic.

Stupid solution: Add a buffer before actually executing the tests, for example, Thread.sleep(30000) in the @BeforeClass method. However, this does not always guarantee the data was loaded if the data size is large.

Better solution: Send a request to verify that the request is actually processed given the request id. Wait in the @BeforeClass method until the request is finished.

Whatever you do, make sure that the test data are actually in the cluster before moving on.



How to use JavaConfig Bean in Spring XML

Our current project is at the first stage to wire all the components together and do a simple integration test. When I took on this task, I found that all beans were defined in XML. Given the number of beans I have to create, it would be tedious to write them all in XML. Personally I prefer using JavaConfig to the XML files as the navigation is easier for me in JavaConfig. But I don’t want to change the XML configurations into JavaConfig all at once. Can I define JavaConfig Beans and use them in the XML?

A bit of search revealed a simple way. Now assume that we have a provider class as follows:


public class ResourceProvider{
    public SQSWrapper sqsWrapper() {
        return new SQSWrapper();

Assume that we have an application.xml file and we want to use the SQSWrapper Bean in a bean definition in the file:

<bean id="SQSConsumer" class="">
    <constructor-arg ref="THE_ID_OF_THE_SQSWRAPPER_BEAN">

To do that we need to add two extra lines to the file and then we specify the id of the SQSWrapper bean by using the method name sqsWrapper. The complete xml file looks like this:


<!-- The following line brings in the beans defined in the ResourceProvider -->
<bean class="" />

<bean id="SQSConsumer" class="">
    <constructor-arg ref="sqsWrapper">

The first line “annotation-config” is crucial as noted in this stackoverflow answer: “while annotation-config is switched on, the container will recognize the @Configuration annotation and process the @Bean methods declared in JavaConfig properly”.

Now that saved me from creating more xml files!

Randomly Draw k unique integers out of an array of N unique integers

Given an array of n unique integers (1 to n), write an algorithm to draw a random combination of k distinct numbers (n >= k). (This problem comes from Core Java Vol I as an example.)

Unknown: A way to draw k distinct integers out of  n distinct integers.

Data: An array of integers 1 to n.

Constraint: k numbers must be distinct and randomly picked.

A straightforward solution would be:

  1. Randomly pick one number out of the an array
  2. If this number is not picked before, add it to the result. Return the result if we have k numbers.
  3. Otherwise, back to step 1.

Q: So what is the time complexity of this solution?

A: If we are unlucky, in the worst case, O(k^2) and if k close to n, O(n^2).

Q: How so?

A: At some point, we will have problem selecting a number that’s not in the result set.  The first number is easy, just once. The second, if unlucky, twice. The third, if unlucky, 3 times since the first 2 times picked something in the result set…so up to k numbers, it can take 1+2+3+…+k picks which is approximately O(k^2). If k is close to n, then we have a O(n^2) algorithm. check the link at the bottom for the code.

Q: Alright, can we make it faster? Say let’s make it O(n) time and you cannot use additional space except for the result set.

A: Hmm, the bottleneck of the previous solution is that every time we pick a number, we have to check if it exists in the result set. If it does, we have to go back and pick again. If we can skip this step it will be faster.

Q: What do you mean by skipping this step?

A: I mean that every time we pick a number, it is guaranteed not picked before.

Q: How do we do that?

A: Hmm. we need to keep track of what has not been picked instead. Since we cannot use additional space, I assume that we have to do something on the original array. I can replace the picked number with some special value like n+1, but this sounds useless since if I happened to pick this number, I would have to choose again, exactly like before. I don’t know…

Q: OK, in what situation can we safely draw an unpicked number?

A: If the array only contains unpicked numbers, we can do that safely. But again, I don’t think we can recreate a new array to exclude the picked one in every pass. That’s O(n^2) again.

Q: True. So why can’t we draw the numbers safely now? What’s the matter?

A: Because there are picked values between unpicked ones.

Q: Good. You mentioned about excluding them. Is there a way to do that without creating a new array?

A: I suppose I can re-arrange the array? For example, if I picked the number at index i, I can move the numbers from i+1 to n-1 forward. But then I should pick a random index between 0 to n-1 (exclusive). Wait, this is still O(n^2)…

Q: Do we have to move all the elements after index i? Can we reduce this O(n) move to a O(1) move?

A: O(1)? So I should move only 1 element instead. But which one…

Q: Let’s use an example: 1,2,3,4,5 Say we picked 3. In your case, we change the array to 1,2,4,5,5 and then we pick from index 0 to 3 next time. We do the move because we want to make sure next time we are choosing from 1,2,4,5. So is there another way to do it?

A: Yes! I can move the last element to that position to achieve the same effect! So every time after the pick, I move the last element within the current range to the picked position then reduce the range by 1.

Q: That’s right 🙂

Link to the code: