Skip to content

Homework 2 - Optimization and Kepler's Laws

This homework reviews optimization techniques including Newton’s method, explores the gradient descent proof, and applies proportional relationships to real-world astronomical data.

(a) Read the article found here: https://www.quantamagazine.org/three-hundred-years-later-a-tool-from-isaac-newton-gets-an-update-20250324/. What date was this article published?

Solution

It was published on March 24, 2025.

(b) True or False? Most problems that involve finding an optimal solution boil down to minimizing a function.

Solution

True. The article states: “Mathematically, these problems get translated into a search for the minimum values of functions.”

(c) In the third paragraph, does the description of Newton’s method in optimization sound like any topic we discussed? If so, what topic? (Note that Newton’s method is sometimes called Newton-Raphson).

Solution

Yes, it sounds similar to gradient descent. Newton’s method works by using derivative information to iteratively find a better approximation of the minimum, similar to how gradient descent uses the slope to guide the search.

(d) In using Newton’s method in optimization, how many derivatives are required?

Solution

Two derivatives are required: the first derivative (slope) and the second derivative (the rate at which the slope is changing).

(e) What converges faster, Newton’s method or gradient descent? Note that the article states this answer and Wikipedia also has a nice visual: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization.

Solution

Newton’s method converges faster. It converges at a quadratic rate, while gradient descent converges at a linear rate. This means Newton’s method can identify the minimum value in fewer iterations.

(f) Given your answer above, why might someone opt to use gradient descent versus Newton’s method?

Solution

Newton’s method is more computationally expensive than gradient descent. Therefore, researchers prefer gradient descent for certain applications, such as training neural networks in machine learning, where the computational cost per iteration matters more.

(g) True or False? Ahmadi, Chaudhry and Zhang’s work “piggybacks” off previous work.

Solution

True. The article states: “Now Ahmadi, Chaudhry and Zhang have taken Nesterov’s result another step further.” Their work builds upon Yurii Nesterov’s 2021 breakthrough on approximating functions with cubic equations, extending it to work for arbitrarily many derivatives.

(h) What properties makes an equation easy to minimize?

Solution

Two properties make an equation easy to minimize: (1) The equation should be bowl-shaped or “convex,” with just one valley instead of many, so you don’t have to worry about mistaking a local valley for the global minimum. (2) If the equation can be written as a sum of squares, for example 5x2+16x+13=(x+2)2+(2x+3)25x^2 + 16x + 13 = (x+2)^2 + (2x+3)^2.

(i) What did Ahmadi, Chaudhry and Zhang do to the Taylor expansion?

Solution

They added a “fudge factor” to the Taylor expansion using a technique called semidefinite programming. This modification made the Taylor approximation both convex and expressible as a sum of squares, without letting it drift too far from the original function it was meant to approximate.

(j) As of 2026, which technique is our best bet for optimization? Is it Newton’s, gradient descent or Ahmadi, Chaudhry and Zhang’s technique?

Solution

As of 2026, gradient descent is still the best bet for practical applications such as self-driving cars, machine learning algorithms, and air traffic control systems. While Ahmadi, Chaudhry and Zhang’s algorithm is theoretically faster, each iteration is still more computationally expensive than gradient descent.

(k) True or False? It is possible that Ahmadi, Chaudhry and Zhang’s technique can replace gradient descent in the future.

Solution

True. If computational technology becomes more efficient in the future, making each iteration of their algorithm less computationally expensive. Ahmadi hopes that in 10 to 20 years, their algorithm will also be faster in practice, not just in theory.

Problem 2: Proof of Gradient Descent Update Formula

Section titled “Problem 2: Proof of Gradient Descent Update Formula”

Recall Mitchell’s definition of machine learning. A program is said to learn from experience EE with respect to some class of tasks TT and performance measure PP if its performance at tasks in TT, as measured by PP, improves with experience EE.

(a) What is our goal in gradient descent? Specifically, if we are at f(xold)f(x_{old}) then what can we say about f(xnew)f(x_{new}) in relation to f(xold)f(x_{old})?

Solution

Our goal is to minimize the function. We want f(xnew)f(x_{new}) to be less than f(xold)f(x_{old}), meaning we want to move to a point where the function value is smaller.

(b) Geometrically, what does it mean that xnew=xold+hx_{new} = x_{old} + h?

Solution

It means we are moving from the point xoldx_{old} by hh number of units. The variable hh indicates that we are either moving to the left (if h<0h < 0) or to the right (if h>0h > 0) on the number line.

(c) Suppose f(x)f(x) is a differentiable function and we draw the tangent line to f(x)f(x) at the point (a,f(a))(a, f(a) ). What is the equation of this tangent line? Call the equation L(x)L(x) where LL stands for linearization.

Solution

The equation of the tangent line is L(x)=f(a)+f(a)(xa)L(x) = f(a) + f'(a)(x - a). This is the linearization of f(x)f(x) at the point aa.

(d) If xx is near aa, then what can we say about f(x)f(x) in relation to L(x)L(x)?

Solution

If xx is near aa, then f(x)L(x)f(x) \approx L(x). The linearization is a good approximation of the function near the point aa.

(e) Let’s define x:=xold+h=xnewx := x_{old} + h = x_{new} and define a:=xolda := x_{old}. What does your answer in part (d) become?

Solution f(x)L(x)f(x) \approx L(x) f(x)f(a)+f(a)(xa)\Rightarrow f(x) \approx f(a) + f'(a)(x - a) f(xnew)f(xold)+f(xold)(xold+hxold)\Rightarrow f(x_{new}) \approx f(x_{old}) + f'(x_{old})(x_{old} + h - x_{old}) f(xnew)f(xold)+hf(xold)\Rightarrow f(x_{new}) \approx f(x_{old}) + h f'(x_{old}) f(xnew)f(xold)hf(xold)\Rightarrow f(x_{new}) - f(x_{old}) \approx h f'(x_{old})

(f) Explain why hf(xold)h f'(x_{old}) is negative.

Solution

Since we need f(xnew)<f(xold)f(x_{new}) < f(x_{old}), then by subtracting f(xold)f(x_{old}) from both sides, we get f(xnew)f(xold)<0f(x_{new}) - f(x_{old}) < 0. Since f(xnew)f(xold)hf(xold)f(x_{new}) - f(x_{old}) \approx hf'(x_{old}) and f(xnew)f(xold)f(x_{new}) - f(x_{old}) is negative, hf(xold)hf'(x_{old}) must also be negative.

(g) Given your answer above, what does this imply about hh?

Solution

For the inequality hf(xold)<0hf'(x_{old}) < 0 to hold, hh must always have the opposite sign of f(xold)f'(x_{old}). For example, if both hh and f(xold)f'(x_{old}) were negative, their product would be positive, violating our requirement that hf(xold)<0hf'(x_{old}) < 0.

(h) Why don’t we take h=f(xold)h = - f'(x_{old})?

Solution

Taking h=f(xold)h = -f'(x_{old}) directly can be an issue because the derivative f(xold)f'(x_{old}) may be very large, resulting in excessively large jumps between consecutive function values. This can cause us to skip over the minimum entirely.

(i) What should we take hh to be?

Solution

We should take hh to be a small number: h=αf(xold)h = -\alpha f'(x_{old}) where α\alpha is a small positive scalar (like 0.01 or smaller).

(j) Substitute your answer above into xnew=xold+hx_{new} = x_{old} + h. What formula do we get?

Solution

Plugging h=αf(xold)h = -\alpha f'(x_{old}) into xnew=xold+hx_{new} = x_{old} + h yields:

xnew=xoldαf(xold)x_{new} = x_{old} - \alpha f'(x_{old})

(k) Intuitively, what does α\alpha represent?

Solution

α\alpha is the learning rate, which determines how big of a step we take at each iteration when updating xx.

Problem 3: Mitchell’s Definition and Linear Models

Section titled “Problem 3: Mitchell’s Definition and Linear Models”

Recall Mitchell’s definition of machine learning. A program is said to learn from experience EE with respect to some class of tasks TT and performance measure PP if its performance at tasks in TT, as measured by PP, improves with experience EE.

Suppose we are interested in predicting a student’s score on a final exam.

For student ii, let:

  • xi1x_{i1} = hours spent studying (over the semester),
  • xi2x_{i2} = number of class meetings attended,
  • xi3x_{i3} = homework average (as a percent),
  • yiy_i = final exam score (from 00 to 100100).

Assume we use the linear model

y^i=b+m1xi1+m2xi2+m3xi3\hat{y}_i = b + m_1 x_{i1} + m_2 x_{i2} + m_3 x_{i3}

(a) What would the dataset, D\mathbb{D}, look like?

Solution

The dataset D\mathbb{D} would contain all the tuples (xi1,xi2,xi3,yi)i=1n(x_{i1}, x_{i2}, x_{i3}, y_i)_{i=1}^n for each of the nn students, where xi1x_{i1} is the hours they studied, xi2x_{i2} is the number of class meetings they attended, xi3x_{i3} is their homework average as a percent, and yiy_i is their final exam score. In table form, the columns would represent each variable, and each row would represent one student’s data:

Hours StudiedClasses AttendedHomework Avg (%)Exam Score
40239085
30208580
50259595
\vdots\vdots\vdots\vdots

(b) In the context of Mitchell’s definition, what is TT?

Solution

The task TT in Mitchell’s definition is the students’ hours studied, number of classes attended, and homework average.

(c) In the context of Mitchell’s definition, what is a good way to define PP?

Solution

A good way to define the performance measure PP is to use a metric that quantifies how well the model’s predictions match the actual exam scores. For example, we could use the mean squared error (MSE) between the predicted scores y^i\hat{y}_i and the actual scores yiy_i:

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n ({y}_i - \hat{y}_i)^2

where y^i=b+m1xi1+m2xi2+m3xi3\hat{y}_i = b + m_1 x_{i1} + m_2 x_{i2} + m_3 x_{i3}

(d) In the context of Mitchell’s definition, what is EE?

Solution

The experience EE is all the data points (xi1,xi2,xi3,yi)i=1n(x_{i1}, x_{i2}, x_{i3}, y_i)_{i=1}^n from the dataset D\mathbb{D} that we use to train the model.

(e) What happens in the training step? Try to be as detailed as possible.

Solution

We start by dividing D\mathbb{D} into three separate parts: a training set (80% of D\mathbb{D}), a validation set (10% of D\mathbb{D}), and a test set (10% of D\mathbb{D}). Using the training set, we try different learning rates α1,α2,,αt\alpha_1, \alpha_2, \ldots, \alpha_t. For each learning rate αi\alpha_i, we perform gradient descent to find the optimal parameters bi,m1i,m2i,m3ib_i, m_{1i}, m_{2i}, m_{3i}. For example, α1\alpha_1 yields (b1,m11,m21,m31)(b_1, m_{11}, m_{21}, m_{31}), α2\alpha_2 yields (b2,m12,m22,m32)(b_2, m_{12}, m_{22}, m_{32}), and so on until αt\alpha_t yields (bt,m1t,m2t,m3t)(b_t, m_{1t}, m_{2t}, m_{3t}).

(f) What happens in the validation step?

Solution

In the validation step, we calculate the performance measure PP for every candidate model (αi,bi,m1i,m2i,m3i)i=1t(\alpha_i, b_i, m_{1i}, m_{2i}, m_{3i})_{i=1}^t. We then select the parameters that yield the best performance, meaning the lowest MSE on the validation set.

(g) What happens in the test step?

Solution

In the test step, we use the best model we selected from the validation step and evaluate its performance on the test set, which contains data that the model has never seen before.

Problem 4: Kepler’s Laws of Planetary Motion

Section titled “Problem 4: Kepler’s Laws of Planetary Motion”

In the house pricing example discussed in class, we made the erroneous assumption that the house price was a linear function of square footage. During office hours, a student had asked “Does linear data actually appear in real life?” Allow us to now study a different phenomenon.

For instance, y=2xy = 2x, y=3xy = 3x, y=245.01xy = 245.01x are all proportional relationships. How can we use this definition in modeling? Suppose we see the following data of the form (xi,yi)(x_i, y_i):

D={(1,3.1),(2,5.8),(3,8.85),(4,12.12)}\mathbb{D} = \{ (1,3.1), (2,5.8), (3,8.85), (4,12.12) \}

and we want to find out if this data follows a proportional relationship. What we could do is take each value of yy and divide by xx. If every quotient gives roughly the same constant kk, then it is reasonable to build a model of the form yi^=kxi\hat{y_i} = kx_i. Let’s investigate:

xxyyk=yxk=\frac{y}{x}
13.13.1
25.82.9
38.852.95
412.123.03

Since all of the quotients are near 33, then it’s perhaps reasonable to say that yi^=3xi\hat{y_i} = 3x_{i}. With this example in mind, let’s see the following application of proportionality which was discovered approximately 400 years ago.

(a) In astronomy, the orbital period refers to the amount of time it takes for a planet to complete one orbit around the Sun. For instance, the orbital period of Earth is about 365 days since it takes the Earth 365 days to move 360°360° around the Sun. Using the Internet, find out the orbital period for the following planets and fill out the chart. Please include your source by providing a link.

Solution
PlanetOrbital Period (measured in Days)
Mercury88
Venus225
Earth365
Mars687
Jupiter4,333
Saturn10,759
Uranus30,687
Neptune60,190

Source: https://spaceplace.nasa.gov/years-on-other-planets/en/

(b) Kepler’s First Law of Planetary Motion posits that each planet’s orbit about the Sun is an ellipse. Given an ellipse, consider the line segment which runs through the two foci and touches the perimeter of the ellipse. We call this line segment the major axis and half of the major axis is called the semi-major axis.

Ellipse diagram

Using the Internet, find the length of the semi-major axis for each planet’s ellipse. Please include your source by providing a link.

Solution
PlanetMajor Semi-Axis (measured in AU, Astronomical Units)
Mercury0.387
Venus0.723
Earth1.0
Mars1.524
Jupiter5.204
Saturn9.5826
Uranus19.218
Neptune30.110

Source: https://en.wikipedia.org/wiki/Semi-major_and_semi-minor_axes

(c) Suppose xx denotes the length of the major semi-axis while yy denotes the orbital period. Does it seem likely that yxy \propto x? If so then estimate the proportionality factor, kk. Be sure to explain your answer.

Solution
xx (Semi-Major Axis)yy (Orbital Period in days)k=y/xk = y/x
0.38788227.4
0.723225311.2
1.0365365.0
1.524687451.1
5.2034,333832.7
9.53710,7591,127.6
19.19130,6881,599.9
30.06960,1822,001.4

Based on the table above, yxy \propto x does not hold because the ratios k=y/xk = y/x is changing by a large amount.

(d) Does it seem likely that y2x3y^2 \propto x^3? If so then estimate the proportionality factor, kk. What does this imply?

Solution
x3x^3 (Semi-Major Axis Cubed)y2y^2 (Orbital Period Squared in days²)k=y2/x3k = y^2/x^3
0.05797,744133,780.9
0.377750,625134,126.9
1.0133,225133,225.0
3.5398471,969133,298.1
140.764418,774,889133,341.4
867.6277115,755,081133,438.8
7,070.3024941,750,144133,221.7
27,181.71173,621,873,124133,309.1

Yes, according to the table it seems that y2x3y^2 \propto x^3 holds up well. The proportionality factor k=y2/x3k = y^2/x^3 stays roughly constant at about 133,320. This means we can represent this relationship as y2=133,320x3y^2 = 133,320 \cdot x^3. Even though yxy \propto x didn’t work, a hidden relationship was found by transforming the variables.