Data 310 project summaries
Final Project:
In training a model using a CNN, one must consider which optimizer to use. The optimizer works with the loss function by making guesses about the data, of which the loss function judges the accuracy. In the past, I’ve used either the Adam or RMSprop optimizer and, in this case, RMSprop is the ideal choice. To explain why this is true, I’ll fist explain the concept of stochastic gradient descent (SGD). SGD is the ‘guess and check’ process the optimizer and loss function complete when training a model. Depending on the weights assigned within the model, the model will have a different measure error. The goal of SGD is to reduce this error- picture an empty pond. The ground slopes up and down, there are several local minimums in the pond and one global minimum (the lowest point). This surface represents the multidimensional graph that SGD anaylyzes by ‘blindly walking around’ guessing and checking if it is at the minimum possible error. Different optimizers work better depending on the shape of this graph. Adam, for example, works best when minima are flat. RMSprop works by using a specified a learning rate to defines how the mathematical functions in the model’s transformers can learn using SGD. In other words, the learning rate defines how large of steps the function will be taking around our graph so by setting a low learning rate (0.001), we can ensure that the model does not train too quickly, which will increase accuracy (especially since we are working with the large cats and dogs dataset). Moreover, RMSprop tends to perform well on larger datasets while Adam does not.
Sources: Source 1, source 2
In this model, we used the binary_crossentropy loss function. As the name states, this loss function may be appropriate if we are working with binary classification data (that is, the data has 2 classes such as cats and dogs). The goal of the loss function is to evaluate how good or bad of predictions the optimizer makes during training, and penalize the model accordingly. This penality, or loss, when points are misclasified is derived from how ‘bad’ the prediction is. For example, consider a model that classifies points into groups A and B. If the point c belongs to class A, then we have 2 possible extremes for the associated cases. First, if the model predicts point c belongs to group A with a a high degree of certaintly (high predicted probability), the loss will be very low. Conversely, if the model predicts that point c belongs to class B, the predicted probability that the point is actually in A is low so the loss will be very high. As you may notice, these two measures are iversely proportional. The goal of training is to minimize the loss funtion (means we have more accuracy in making predictions)
Source