Homework 5ESE 402/542Due on 11/20/2019(For Problems 1 and 2, no other package except numpy and matplotlib should be used for the programming questions. For problem 3 you can use the packages of your choice.)Problem 1.(a) In this problem we will analyze logistic regression learned in class. Sigmoid function can be written as S(x) = 1 1+e−x • For a given variable X assume P(Y = +1|X) is modeled as P(Y = +1|X) = S(β0 + β1X). Plot a 3d ﬁgure showing the relation between output and variable β0 and β1 when X = 1. Take values between [-2, 2] for both β0 and β1 with a step size of 0.1 to plot the 3d plot (b) In class, we have done binary classiﬁcation with labels Y = {0,1}. In this problem, we will be using the labels as Y = {−1,1} as it will be easier to derive the likelihood of the P(Y|X). • Show that if Y ∈ {−1,1} the probability of Y given X can be written as (not programming) P(Y|X) = 1 1 + e−y(β0+β1x) • We have learned that the coeﬃcients β0 and β1 can be found using MLE estimates. Show that the Log Likelihood function for m data points can be written as(Not Programming)lnL(β0,β1) = −m X i=1ln1 + e−yi(β0+β1xi)• Plot a 3d ﬁgure showing the relation between Log Likelihood function and variable β0 , β1 when X = 1, Y = -1 and X = 1, Y = 1. Take values between [-2, 2] for both β0 and β1 with a step size of 0.1 to plot the 3d plot.• Based on the graph, is it possible to maximize this function?1Problem 2.1. While we can formalize the Likelihood Function there is no close form expression for the coeﬃcients β0,β1 maximizing the above log-likelihood in Problem 1. Hence, we will use an iterative algorithm to solve for the coeﬃcients. We can see thatmax(−m X i=1ln1 + e−yi(β0+β1xi)) = min( m X i=1ln1 + e−yi(β0+β1xi))We will describe our function loss as L = 1 mPm i=1 ln1 + e−yi(β0+β1xi). Our objective is to iteratively decrease this loss as we keep computing the optimal coeﬃcients. Here xi ∈ RIn this problem we will be working with real image data where the goal is to classify if the image is 0 or 1 using logistic regression. The input X ∈ R m x d, is a matrix with dimensions [m x d], where a single data point xi ∈ Rd with d = 784. The labels matrix Y ∈ Rm, where each label yi ∈{0,1} • Load the data into the memory and visualize one input as an image for each of label 0 and label 1. (The data should be reshaped back to [28 x 28] to be able to visualize it.) • The data is in between 0 to 255. Normalise the data to 0 and 1 • Set yi = 1 for images labeled 0 and yi = -1 for images labeled 1. Split the data randomly into train and test with a ratio of 80:20. Why is random splitting better than sequential splitting in our case? • Initialize the coeﬃcients using a univariate “normal” (Gaussian) distribution of mean 0 and variance 1. (Remember that coeﬃcients are a vector of [β0,β1…βd], where d is the dimension of the input) • Compute the loss using the above mentioned Loss L. (The loss can be written as L = 1 mPm i=1 ln1 + e−yi(β0+Pd−1 j=0 β(j+1)·xi,j), where (i,j) represent the ith data point, where i ∈ {1,2,..,m} and jth dimension of thedata point xi for j ∈{0,…d−1}) • To minimize the loss function a widely known algorithm is going in the direction opposite to the gradients of the loss function. (It’s helpful to write the coeﬃcients [β1,…,βd] as a vector β, and β0 as a scalar. Now β ∈ Rd and β0 ∈ R) We can write the gradients of loss function as a matrix operation∂L ∂β0= −1 mm X i=1e−yi·(β0+β·xT i ) 1 + e−(yi·(β0+β·xT i ))yi = dβ0∂L ∂β= −1 mm X i=1e−yi·(β0+β·xT i ) 1 + e−(yi·(β0+β·xT i ))yixi = dβ2Write a function to compute the gradients • Update the parameters as β = β −0.05∗dβ β0 = β0 −0.05∗dβ0 (Gradient updates should be computed based on the train set) • Repeat the process for 50 iterations and report the loss after the 50th epoch. • Plot the loss for each iteration for the train and test sets • Logistic regression is a classiﬁcation problem. We classify as +1 if P(Y = 1|X) ≥ 0.5. Derive the classiﬁcation rule for the threshold 0.5. (Not a programming question) • For the classiﬁcation rule derived compute the accuracy on the test set for each iteration and plot the accuracyThe ﬁnal code should be along this formatimport numpy as np from matplotlib import pyplot as pltdef compute_loss(data, labels, B, B_0):return loglossdef compute_gradients(data, labels, B, B_0):return dB, dB_0if __name__ == '__main__': x = np.load(data) y = np.load(label)## Split the data to train and test x_train, y_train, x_test, y_test = #split_dataB = np.random.randn(1, x.shape[1]) B_0 = np.random.randn(1)lr = 0.05for _ in range(50):## Compute Loss loss = compute_loss(x_train, y_train, B, B_0)3## Compute Gradients dB, dB_0 = compute_gradients(x_train, y_train, B, B_0)## Update Parameters B = B – lr*dB B_0 = B_0 – lr*dB_0##Compute Accuracy and Loss on Test set (x_test, y_test) accuracy_test = loss_test =##Plot Loss and AccuracyMake sure to vectorize the code. Ideally 50 iterations should run in 10 seconds or less. If possible avoid using for loops, except for the 50 iterations of gradient updates given in the sample code4Problem 3. Recall that in classiﬁcation we assume that each data point is an i.i.d. sample from a(n unknown) distribution P(X = x,Y = y). In this question, we are going to design the data distribution P and evaluate the performance of logistic regression on data generated using P. Keep in mind that we would like to make P as simple as we could. In the following, we assume x ∈ R and y ∈ {0,1}, i.e. the data is one-dimensional and the label is binary. Write P(X = x,Y = y) = P(X = x)P(Y = y|X = x). We will generate X = x according to the uniform distribution on the interval [0,1] (thus P(X = x) is just the pdf of the uniform distribution). 1. Design P(Y = y|X = x) such that (i) P(y = 0) = P(y = 1) = 0.5; and (ii) the classiﬁcation accuracy of any classiﬁer is at most 0.9; and (iii) the accuracy of the Bayes optimal possible classiﬁer is at least 0.8.2. Using Python, generate n = 100 training data points according to the distribution you designed above and train a binary classiﬁer using logistic regression on training data.3. Generate and n = 100 test data points according to the distribution you designed in part 1 and compute the prediction accuracy (on the test data) of the classiﬁer that you designed in part 2. Also, compute the accuracy of the Bayes optimal classiﬁer on the test data. Why do you think Bayes optimal classiﬁer is performing better?4. Redo parts 2,3 with n = 1000. Are the results any diﬀerent than part 3? Why?5Problem 4. K-means clustering can be viewed as an optimization problem that attempts to minimize some objective function. For the given objectives, determine the update rule for the centroid, ck of the k-th cluster Ck . In other word, ﬁnd the optimal ck that minimizes the objective function. The data x contains p features.1. Show that setting the objective to the sum of the squared Euclidean distances of points from the center of their clusters,K X k=1X x∈Ckp X i=1(cki −xi)2results in an update rule where the optimal centroid is the mean of the points in the cluster.2. Show that setting the objective to the sum of the Manhattan distances of points from the center of their clusters,K X k=1X x∈Ckp X i=1|cki −xi|results in an update rule where the optimal centroid is the median of the points in the cluster.

- Assignment status: Already Solved By Our Experts
*(USA, AUS, UK & CA PhD. Writers)***CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS**

**NO PLAGIARISM**– CUSTOM PAPER