Skip to content

Lecture 24 (05/06/2026) - Prove MWU Expert's Theorem (Potential Function); Introduce Paging Problem

Scribes: Moinuddin Rahat and Michelle Lam

Topics covered:

  • Introducing the Multiplicative Weight Updates (MWU) framework and the Weighted Majority algorithm
  • Experts are assigned weights that decrease when they make mistakes
  • Proving a mistake bound: the algorithm performs nearly as well as the best expert in hindsight using a potential function argument
  • Introducing the paging problem (cache management in online settings)
  • Discussing eviction strategies: LRU, LFU, FIFO, and optimal Farthest-in-Future
  • Applying competitive analysis to evaluate paging algorithms
  • Showing LRU and FIFO are kk-competitive and discussing resource augmentation

In many situations, we need to make decisions repeatedly over time.

  • We are given nn experts, where each expert provides a suggestion at every step.
  • The decisions are binary, meaning there are only two possible choices (e.g., buy or sell).
  • At each step, we choose an action based on the experts’ suggestions.
  • At the end of each step, we receive feedback and can determine whether expert ii‘s suggestion was correct or a mistake.

Goal: Design a strategy that performs nearly as well as the best expert in hindsight, even though we do not know this expert in advance.

Is there a strategy that performs almost as well as the best expert? Can we design a method whose performance is close to the best expert in hindsight?

We are given nn experts, indexed by i[n]i \in [n].

  • Initially, all experts are assigned equal weights:
wi=1i[n]w_i = 1 \quad \forall\, i \in [n]
  • At each step tt, every expert has a weight wi(t)w_i^{(t)}.

At step t+1t+1, the weights are updated as follows:

  • If expert ii was correct:
wi(t+1)=wi(t)w_i^{(t+1)} = w_i^{(t)}
  • If expert ii was incorrect:
wi(t+1)=(1ε)wi(t)w_i^{(t+1)} = (1 - \varepsilon)\, w_i^{(t)}

Here, ε>0\varepsilon > 0 is a small constant.

At time step tt, we use the weighted votes of all experts:

  • Compute the total weight of experts recommending each action.
  • Choose Sell if:
iSellwi(t)12i=1nwi(t)\sum_{i \in \text{Sell}} w_i^{(t)} \ge \frac{1}{2} \sum_{i=1}^{n} w_i^{(t)}
  • Otherwise, choose Buy.

The number of mistakes made by the Weighted Majority (WM) algorithm satisfies:

Mistakes(WM)αMistakes(best expert)\text{Mistakes(WM)} \le \alpha \cdot \text{Mistakes(best expert)}

More generally, for all experts i[n]i \in [n]:

Mistakes(WM)αMistakes(i)\text{Mistakes(WM)} \le \alpha \cdot \text{Mistakes}(i)

In particular, this inequality holds when ii is the best expert in hindsight.

After tt steps, let:

  • m(t)m^{(t)}: number of mistakes made by the WM algorithm so far
  • mi(t)m_i^{(t)}: number of mistakes made by expert ii so far

Then, for all experts i[n]i \in [n]:

m(t)2(1+ε)mi(t)+2lnnεm^{(t)} \le 2(1+\varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon}

In particular, for the best expert:

m(t)2(1+ε)mbest(t)+2lnnεm^{(t)} \le 2(1+\varepsilon)\, m_{\text{best}}^{(t)} + \frac{2 \ln n}{\varepsilon}

Meaning of symbols:

  • tt: number of rounds (time steps)
  • m(t)m^{(t)}: total mistakes made by the algorithm up to time tt
  • mi(t)m_i^{(t)}: mistakes made by expert ii up to time tt
  • nn: number of experts
  • ε>0\varepsilon > 0: a small parameter used in weight updates

Proof: Weighted Majority via Potential Function

Section titled “Proof: Weighted Majority via Potential Function”

We prove the mistake bound using a potential function argument.

  • Initially, each expert has weight wi(1)=1w_i^{(1)} = 1.
  • After tt steps:
wi(t)=(1ε)mi(t)w_i^{(t)} = (1 - \varepsilon)^{m_i^{(t)}}

Each time expert ii makes a mistake, its weight is multiplied by (1ε)(1 - \varepsilon). After mi(t)m_i^{(t)} mistakes, this results in an exponential decrease in weight.

Define the potential function:

Φ(t)=i=1nwi(t)\Phi^{(t)} = \sum_{i=1}^{n} w_i^{(t)}

Initially: Φ(1)=n\Phi^{(1)} = n.

The potential function measures the total “trust” we place in all experts.

Suppose the algorithm makes a mistake at time tt.

Partition the experts into two groups:

  • PP: total weight of correct experts
  • YY: total weight of incorrect experts

Then P+Y=Φ(t)P + Y = \Phi^{(t)}.

After updating the weights:

Φ(t+1)=P+(1ε)Y=Φ(t)εY\Phi^{(t+1)} = P + (1 - \varepsilon)Y = \Phi^{(t)} - \varepsilon Y

Since incorrect experts are penalized, the total weight decreases by εY\varepsilon Y.

If the algorithm makes a mistake at time tt, then at least half of the total weight must have supported the wrong decision. Hence:

YΦ(t)2Y \ge \frac{\Phi^{(t)}}{2}

Substituting into the update from Step 3:

Φ(t+1)Φ(t)εΦ(t)2=Φ(t)(1ε2)\Phi^{(t+1)} \le \Phi^{(t)} - \varepsilon \cdot \frac{\Phi^{(t)}}{2} = \Phi^{(t)}\left(1 - \frac{\varepsilon}{2}\right)

Step 5: After Multiple Mistakes (Upper Bound)

Section titled “Step 5: After Multiple Mistakes (Upper Bound)”

If the algorithm makes m(t)m^{(t)} mistakes total, then each mistake multiplies the potential by at most (1ε2)\left(1 - \frac{\varepsilon}{2}\right). Starting from Φ(1)=n\Phi^{(1)} = n:

Φ(t)n(1ε2)m(t)\Phi^{(t)} \le n \left(1 - \frac{\varepsilon}{2}\right)^{m^{(t)}}

Thus, each mistake causes the potential to shrink multiplicatively.

For any expert ii:

Φ(t)wi(t)=(1ε)mi(t)\Phi^{(t)} \ge w_i^{(t)} = (1 - \varepsilon)^{m_i^{(t)}}

The total weight is at least as large as any individual expert’s weight.

Combining the upper and lower bounds on the potential function:

n(1ε2)m(t)(1ε)mi(t)n \left(1 - \frac{\varepsilon}{2}\right)^{m^{(t)}} \ge (1 - \varepsilon)^{m_i^{(t)}}

This relates the algorithm’s mistakes to the mistakes of any expert.

lnn+m(t)ln ⁣(1ε2)mi(t)ln(1ε)\ln n + m^{(t)} \ln\!\left(1 - \frac{\varepsilon}{2}\right) \ge m_i^{(t)} \ln(1 - \varepsilon)

Rearranging:

m(t)ln ⁣(1ε2)mi(t)ln(1ε)lnnm^{(t)} \ln\!\left(1 - \frac{\varepsilon}{2}\right) \ge m_i^{(t)} \ln(1 - \varepsilon) - \ln n

Multiplying both sides by 1-1 (reversing the inequality):

m(t)(ln(1ε2))mi(t)(ln(1ε))+lnnm^{(t)} \big(-\ln(1 - \tfrac{\varepsilon}{2})\big) \le m_i^{(t)} \big(-\ln(1 - \varepsilon)\big) + \ln n

Step 10: Apply Logarithmic Inequalities (via Taylor Series)

Section titled “Step 10: Apply Logarithmic Inequalities (via Taylor Series)”

Starting from Step 9, divide both sides by ln(1ε/2)-\ln(1 - \varepsilon/2) (which is positive) to isolate m(t)m^{(t)}:

m(t)mi(t)(ln(1ε))+lnnln(1ε/2)m^{(t)} \le \frac{m_i^{(t)} \cdot \big(-\ln(1 - \varepsilon)\big) + \ln n}{-\ln(1 - \varepsilon/2)}

To bound this, we need two things going in opposite directions:

  • A lower bound on the denominator ln(1ε/2)-\ln(1-\varepsilon/2), so the overall fraction stays an upper bound on m(t)m^{(t)}.
  • An upper bound on the numerator factor ln(1ε)-\ln(1-\varepsilon).

Both follow from the Taylor series. Since all omitted terms are positive:

ln(1x)=xx22x33\ln(1 - x) = -x - \frac{x^2}{2} - \frac{x^3}{3} - \cdots

truncating gives the bounds (for small ε\varepsilon):

ln(1x)x(lower bound, keep only first term)-\ln(1 - x) \ge x \quad \text{(lower bound, keep only first term)} ln(1x)x+x2(upper bound, keep first two terms)-\ln(1 - x) \le x + x^2 \quad \text{(upper bound, keep first two terms)}

Applying the lower bound to the denominator and the upper bound to the numerator:

m(t)mi(t)(ε+ε2)+lnnε/2=2(1+ε)mi(t)+2lnnεm^{(t)} \le \frac{m_i^{(t)} \cdot (\varepsilon + \varepsilon^2) + \ln n}{\varepsilon/2} = 2(1+\varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon}

For all experts i[n]i \in [n]:

m(t)2(1+ε)mi(t)+2lnnεm^{(t)} \le 2(1+\varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon}

In particular, this bound holds for the best expert in hindsight.

Intuition Behind the Potential Function Proof

Section titled “Intuition Behind the Potential Function Proof”

The proof tracks a quantity called the potential function, defined as the total weight of all experts. This potential represents how much overall “trust” we place in the experts at any time.

  • Good experts keep weight: Experts who make few mistakes retain relatively high weight over time.
  • Bad experts lose weight: Experts who make mistakes repeatedly see their weight decrease exponentially.
  • Mistakes reduce total weight: When the algorithm makes a mistake, at least half of the total weight supported the wrong decision. As a result, a significant portion of the weight is penalized, causing the total weight to decrease.

The key idea is to compare two views of the same quantity (the potential):

  • An upper bound, showing that the total weight decreases quickly as the algorithm makes mistakes.
  • A lower bound, showing that at least one good expert maintains relatively large weight.

By combining these two, we conclude that:

The total weight cannot decrease too fast unless the best expert also makes many mistakes.

In many computing systems, we maintain a small, fast-access cache alongside a large, slow memory. When a requested item is already in the cache, it is served immediately. When it is not, a cache miss occurs: the item must be fetched from slower memory, and if the cache is full, an existing item must be evicted to make room.

  • We are given a cache of size kk and a sequence of data requests S={r1,r2,r3,}S = \{r_1, r_2, r_3, \ldots\}.
  • At each step, we receive a request rtr_t. If rtr_t is not in the cache, a cache miss occurs and we must evict some item to insert rtr_t.
  • The cost of an algorithm on a given sequence is the total number of cache misses it incurs.
  • Eviction decisions must be made without knowledge of future requests, making this an online algorithm problem.

Goal: Design an eviction strategy that minimizes the number of cache misses, even without seeing future requests.

We are given a cache of size kk and a request sequence SS.

  • At each step tt, the cache holds at most kk items.
  • When a cache miss occurs, the eviction policy determines which item to remove.

At step tt, upon a cache miss, the following online policies can be applied:

  • LRU (Least Recently Used): Evict the item that has not been requested for the longest time. Concretely, for every item in the cache, we look at when it was last requested in the access sequence, and evict whichever was requested earliest.

  • LFU (Least Frequently Used): Track the cumulative access frequency of each cached item. Evict the item with the lowest frequency. Items accessed more often are assumed more likely to be needed again.

  • FIFO (First In, First Out): Evict the item that has been in the cache the longest — that is, the item that was brought into the cache earliest. This is different from LRU because LRU tracks when an item was last requested, while FIFO tracks when it was inserted into the cache. An item can be requested many times and still be the first evicted under FIFO if it was the first to enter.

All three policies are online algorithms because they operate using only information about past requests and require no knowledge of future requests.

Example (cache of size k=3k = 3)

Consider the following request sequence:

S={3,8,9,1,3,9,2,1,4,}S = \{3, 8, 9, 1, 3, 9, 2, 1, 4, \ldots\}

After the first three requests, the cache is fully populated with no misses:

Cache={3,8,9}\text{Cache} = \{3, 8, 9\}

The next request is for item 11, which is not in the cache — a cache miss occurs, so the cost increases by 1. We must evict one of {3,8,9}\{3, 8, 9\} to make room. Each policy makes a different decision:

  • LRU evicts item 33, since it was requested least recently among {3,8,9}\{3, 8, 9\}.
  • LFU may evict any item, since all three have been requested exactly once. However, after 33 and 99 are requested again they will not be evicted when inserting 22 because they will have a request frequency of 2.
  • FIFO evicts item 33, since it was the first item to enter the cache.
  • Farthest in Future (God’s Algorithm): When a cache miss occurs, look at all items currently in the cache and identify which one will be requested farthest in the future. Evict that item. If an item in the cache will never be requested again, evict it first.

This algorithm was proven optimal by Bélády: no algorithm, even one with full knowledge of the future, can incur fewer cache misses. Since it requires complete knowledge of the entire request sequence, it is an offline algorithm and cannot be implemented in practice. However, it serves as a benchmark for competitive analysis.

Example

Using the same example as earlier:

S={3,8,9,1,3,9,2,1,4,}k=3S = \{3, 8, 9, 1, 3, 9, 2, 1, 4, \ldots\} \quad k = 3

After the first three requests: Cache={3,8,9}\text{Cache} = \{3, 8, 9\}.

OPT looks ahead in the sequence and evicts item 88, since 88 does not appear again in the future.

The cache after evicting the first item:

Cache={3,1,9}\text{Cache} = \{3, 1, 9\}

Theorem (Sleator-Tarjan)

  • Both LRU and FIFO are kk-competitive, where kk is the size of the cache.
  • No online algorithm can be less than kk-competitive.

Since no online algorithm can beat kk-competitiveness in the standard setting, we ask: how much more cache do we need to give an online algorithm to be able to compete with OPT? This idea is formalized by resource augmentation.

  • Suppose our online algorithm is given a cache of size kk, while OPT is restricted to a cache of size k<kk' < k.
  • Under this model, LRU and FIFO are kkk+1\dfrac{k}{k - k' + 1}-competitive against OPT.

Example: Suppose we are given a cache of size k=2000k = 2000 and OPT is given k=1000k' = 1000. Then:

kkk+1200020001000=2\frac{k}{k - k' + 1} \approx \frac{2000}{2000 - 1000} = 2

So LRU and FIFO are roughly 2-competitive against an optimal algorithm with half the cache. We can compensate for not seeing the future by using a cache twice as large.