Lecture 24 (05/06/2026) - Prove MWU Expert's Theorem (Potential Function); Introduce Paging Problem | CSCI 328

Scribes: Moinuddin Rahat and Michelle Lam

Topics covered:

Introducing the Multiplicative Weight Updates (MWU) framework and the Weighted Majority algorithm
Experts are assigned weights that decrease when they make mistakes
Proving a mistake bound: the algorithm performs nearly as well as the best expert in hindsight using a potential function argument
Introducing the paging problem (cache management in online settings)
Discussing eviction strategies: LRU, LFU, FIFO, and optimal Farthest-in-Future
Applying competitive analysis to evaluate paging algorithms
Showing LRU and FIFO are $k$ -competitive and discussing resource augmentation

Experts’ Theorem

In many situations, we need to make decisions repeatedly over time.

We are given $n$ experts, where each expert provides a suggestion at every step.
The decisions are binary, meaning there are only two possible choices (e.g., buy or sell).
At each step, we choose an action based on the experts’ suggestions.
At the end of each step, we receive feedback and can determine whether expert $i$ ‘s suggestion was correct or a mistake.

Goal: Design a strategy that performs nearly as well as the best expert in hindsight, even though we do not know this expert in advance.

Question

Is there a strategy that performs almost as well as the best expert? Can we design a method whose performance is close to the best expert in hindsight?

Weighted Majority Algorithm (MWU)

We are given $n$ experts, indexed by $i \in [n]$ .

Initially, all experts are assigned equal weights:

w_i = 1 \quad \forall\, i \in [n]

At each step $t$ , every expert has a weight $w_i^{(t)}$ .

Weight Update Rule

At step $t+1$ , the weights are updated as follows:

If expert $i$ was correct:

w_i^{(t+1)} = w_i^{(t)}

If expert $i$ was incorrect:

w_i^{(t+1)} = (1 - \varepsilon)\, w_i^{(t)}

Here, $\varepsilon > 0$ is a small constant.

Decision Rule

At time step $t$ , we use the weighted votes of all experts:

Compute the total weight of experts recommending each action.
Choose Sell if:

\sum_{i \in \text{Sell}} w_i^{(t)} \ge \frac{1}{2} \sum_{i=1}^{n} w_i^{(t)}

Otherwise, choose Buy.

Mistake Bound (Performance Guarantee)

The number of mistakes made by the Weighted Majority (WM) algorithm satisfies:

\text{Mistakes(WM)} \le \alpha \cdot \text{Mistakes(best expert)}

More generally, for all experts $i \in [n]$ :

\text{Mistakes(WM)} \le \alpha \cdot \text{Mistakes}(i)

In particular, this inequality holds when $i$ is the best expert in hindsight.

Theorem: Mistake Bound for MWU

After $t$ steps, let:

$m^{(t)}$ : number of mistakes made by the WM algorithm so far
$m_i^{(t)}$ : number of mistakes made by expert $i$ so far

Then, for all experts $i \in [n]$ :

m^{(t)} \le 2(1+\varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon}

In particular, for the best expert:

m^{(t)} \le 2(1+\varepsilon)\, m_{\text{best}}^{(t)} + \frac{2 \ln n}{\varepsilon}

Meaning of symbols:

$t$ : number of rounds (time steps)
$m^{(t)}$ : total mistakes made by the algorithm up to time $t$
$m_i^{(t)}$ : mistakes made by expert $i$ up to time $t$
$n$ : number of experts
$\varepsilon > 0$ : a small parameter used in weight updates

Proof: Weighted Majority via Potential Function

We prove the mistake bound using a potential function argument.

Step 1: Initialization and Weights

Initially, each expert has weight $w_i^{(1)} = 1$ .
After $t$ steps:

w_i^{(t)} = (1 - \varepsilon)^{m_i^{(t)}}

Each time expert $i$ makes a mistake, its weight is multiplied by $(1 - \varepsilon)$ . After $m_i^{(t)}$ mistakes, this results in an exponential decrease in weight.

Step 2: Potential Function

Define the potential function:

\Phi^{(t)} = \sum_{i=1}^{n} w_i^{(t)}

Initially: $\Phi^{(1)} = n$ .

The potential function measures the total “trust” we place in all experts.

Step 3: Effect of a Mistake

Suppose the algorithm makes a mistake at time $t$ .

Partition the experts into two groups:

$P$ : total weight of correct experts
$Y$ : total weight of incorrect experts

Then $P + Y = \Phi^{(t)}$ .

After updating the weights:

\Phi^{(t+1)} = P + (1 - \varepsilon)Y = \Phi^{(t)} - \varepsilon Y

Since incorrect experts are penalized, the total weight decreases by $\varepsilon Y$ .

Step 4: Key Inequality

If the algorithm makes a mistake at time $t$ , then at least half of the total weight must have supported the wrong decision. Hence:

Y \ge \frac{\Phi^{(t)}}{2}

Substituting into the update from Step 3:

\Phi^{(t+1)} \le \Phi^{(t)} - \varepsilon \cdot \frac{\Phi^{(t)}}{2} = \Phi^{(t)}\left(1 - \frac{\varepsilon}{2}\right)

Step 5: After Multiple Mistakes (Upper Bound)

If the algorithm makes $m^{(t)}$ mistakes total, then each mistake multiplies the potential by at most $\left(1 - \frac{\varepsilon}{2}\right)$ . Starting from $\Phi^{(1)} = n$ :

\Phi^{(t)} \le n \left(1 - \frac{\varepsilon}{2}\right)^{m^{(t)}}

Thus, each mistake causes the potential to shrink multiplicatively.

Step 6: Lower Bound via Any Expert

For any expert $i$ :

\Phi^{(t)} \ge w_i^{(t)} = (1 - \varepsilon)^{m_i^{(t)}}

The total weight is at least as large as any individual expert’s weight.

Step 7: Combine Bounds

Combining the upper and lower bounds on the potential function:

n \left(1 - \frac{\varepsilon}{2}\right)^{m^{(t)}} \ge (1 - \varepsilon)^{m_i^{(t)}}

This relates the algorithm’s mistakes to the mistakes of any expert.

Step 8: Take Logarithms

\ln n + m^{(t)} \ln\!\left(1 - \frac{\varepsilon}{2}\right) \ge m_i^{(t)} \ln(1 - \varepsilon)

Step 9: Rearrange

Rearranging:

m^{(t)} \ln\!\left(1 - \frac{\varepsilon}{2}\right) \ge m_i^{(t)} \ln(1 - \varepsilon) - \ln n

Multiplying both sides by $-1$ (reversing the inequality):

m^{(t)} \big(-\ln(1 - \tfrac{\varepsilon}{2})\big) \le m_i^{(t)} \big(-\ln(1 - \varepsilon)\big) + \ln n

Step 10: Apply Logarithmic Inequalities (via Taylor Series)

Starting from Step 9, divide both sides by $-\ln(1 - \varepsilon/2)$ (which is positive) to isolate $m^{(t)}$ :

m^{(t)} \le \frac{m_i^{(t)} \cdot \big(-\ln(1 - \varepsilon)\big) + \ln n}{-\ln(1 - \varepsilon/2)}

To bound this, we need two things going in opposite directions:

A lower bound on the denominator $-\ln(1-\varepsilon/2)$ , so the overall fraction stays an upper bound on $m^{(t)}$ .
An upper bound on the numerator factor $-\ln(1-\varepsilon)$ .

Both follow from the Taylor series. Since all omitted terms are positive:

\ln(1 - x) = -x - \frac{x^2}{2} - \frac{x^3}{3} - \cdots

truncating gives the bounds (for small $\varepsilon$ ):

-\ln(1 - x) \ge x \quad \text{(lower bound, keep only first term)}

-\ln(1 - x) \le x + x^2 \quad \text{(upper bound, keep first two terms)}

Applying the lower bound to the denominator and the upper bound to the numerator:

m^{(t)} \le \frac{m_i^{(t)} \cdot (\varepsilon + \varepsilon^2) + \ln n}{\varepsilon/2} = 2(1+\varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon}

Result

For all experts $i \in [n]$ :

m^{(t)} \le 2(1+\varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon}

In particular, this bound holds for the best expert in hindsight.

Intuition Behind the Potential Function Proof

The proof tracks a quantity called the potential function, defined as the total weight of all experts. This potential represents how much overall “trust” we place in the experts at any time.

Good experts keep weight: Experts who make few mistakes retain relatively high weight over time.
Bad experts lose weight: Experts who make mistakes repeatedly see their weight decrease exponentially.
Mistakes reduce total weight: When the algorithm makes a mistake, at least half of the total weight supported the wrong decision. As a result, a significant portion of the weight is penalized, causing the total weight to decrease.

The key idea is to compare two views of the same quantity (the potential):

An upper bound, showing that the total weight decreases quickly as the algorithm makes mistakes.
A lower bound, showing that at least one good expert maintains relatively large weight.

By combining these two, we conclude that:

The total weight cannot decrease too fast unless the best expert also makes many mistakes.

Paging and Cache Eviction

In many computing systems, we maintain a small, fast-access cache alongside a large, slow memory. When a requested item is already in the cache, it is served immediately. When it is not, a cache miss occurs: the item must be fetched from slower memory, and if the cache is full, an existing item must be evicted to make room.

We are given a cache of size $k$ and a sequence of data requests $S = \{r_1, r_2, r_3, \ldots\}$ .
At each step, we receive a request $r_t$ . If $r_t$ is not in the cache, a cache miss occurs and we must evict some item to insert $r_t$ .
The cost of an algorithm on a given sequence is the total number of cache misses it incurs.
Eviction decisions must be made without knowledge of future requests, making this an online algorithm problem.

Goal: Design an eviction strategy that minimizes the number of cache misses, even without seeing future requests.

Cache Eviction Heuristics

We are given a cache of size $k$ and a request sequence $S$ .

At each step $t$ , the cache holds at most $k$ items.
When a cache miss occurs, the eviction policy determines which item to remove.

Eviction Policies

At step $t$ , upon a cache miss, the following online policies can be applied:

LRU (Least Recently Used): Evict the item that has not been requested for the longest time. Concretely, for every item in the cache, we look at when it was last requested in the access sequence, and evict whichever was requested earliest.
LFU (Least Frequently Used): Track the cumulative access frequency of each cached item. Evict the item with the lowest frequency. Items accessed more often are assumed more likely to be needed again.
FIFO (First In, First Out): Evict the item that has been in the cache the longest — that is, the item that was brought into the cache earliest. This is different from LRU because LRU tracks when an item was last requested, while FIFO tracks when it was inserted into the cache. An item can be requested many times and still be the first evicted under FIFO if it was the first to enter.

All three policies are online algorithms because they operate using only information about past requests and require no knowledge of future requests.

Example (cache of size $k = 3$ )

Consider the following request sequence:

S = \{3, 8, 9, 1, 3, 9, 2, 1, 4, \ldots\}

After the first three requests, the cache is fully populated with no misses:

\text{Cache} = \{3, 8, 9\}

The next request is for item $1$ , which is not in the cache — a cache miss occurs, so the cost increases by 1. We must evict one of $\{3, 8, 9\}$ to make room. Each policy makes a different decision:

LRU evicts item $3$ , since it was requested least recently among $\{3, 8, 9\}$ .
LFU may evict any item, since all three have been requested exactly once. However, after $3$ and $9$ are requested again they will not be evicted when inserting $2$ because they will have a request frequency of 2.
FIFO evicts item $3$ , since it was the first item to enter the cache.

Optimal Algorithm (OPT)

Farthest in Future (God’s Algorithm): When a cache miss occurs, look at all items currently in the cache and identify which one will be requested farthest in the future. Evict that item. If an item in the cache will never be requested again, evict it first.

This algorithm was proven optimal by Bélády: no algorithm, even one with full knowledge of the future, can incur fewer cache misses. Since it requires complete knowledge of the entire request sequence, it is an offline algorithm and cannot be implemented in practice. However, it serves as a benchmark for competitive analysis.

Example

Using the same example as earlier:

S = \{3, 8, 9, 1, 3, 9, 2, 1, 4, \ldots\} \quad k = 3

After the first three requests: $\text{Cache} = \{3, 8, 9\}$ .

OPT looks ahead in the sequence and evicts item $8$ , since $8$ does not appear again in the future.

The cache after evicting the first item:

\text{Cache} = \{3, 1, 9\}

Resource Augmentation

Theorem (Sleator-Tarjan)

Both LRU and FIFO are $k$ -competitive, where $k$ is the size of the cache.
No online algorithm can be less than $k$ -competitive.

Since no online algorithm can beat $k$ -competitiveness in the standard setting, we ask: how much more cache do we need to give an online algorithm to be able to compete with OPT? This idea is formalized by resource augmentation.

Suppose our online algorithm is given a cache of size $k$ , while OPT is restricted to a cache of size $k' < k$ .
Under this model, LRU and FIFO are $\dfrac{k}{k - k' + 1}$ -competitive against OPT.

Example: Suppose we are given a cache of size $k = 2000$ and OPT is given $k' = 1000$ . Then:

\frac{k}{k - k' + 1} \approx \frac{2000}{2000 - 1000} = 2

So LRU and FIFO are roughly 2-competitive against an optimal algorithm with half the cache. We can compensate for not seeing the future by using a cache twice as large.