OUTPUT

The blog of Maxime Kjaer

CS-450 Advanced Algorithms

A prerequisite for this course is CS-250 Algorithms.

When does the greedy algorithm work?

Maximum weight spanning trees

We’ll start with a problem called maximum weight spanning trees.

  • Input: A graph with edge weights
  • Output: A spanning tree of maximum weight

A spanning tree is a subgraph connecting all vertices of in the minimum possible number of edges. Being a tree, it is acyclic by definition.

Kruskal’s algorithm

We’ll consider the simplest greedy algorithm: Kruskal’s algorithm1, which greedily adds the largest weight edge if it doesn’t create a cycle.

1
2
3
4
5
6
7
8
9
10
11
12
def kruskal_greedy(G, w):
    """
    Input:   a connected undirected graph G=(V, E)
             edge weights w
    Output:  a maximum weight spanning tree S
    """
    S = set()
    sort edges in decreasing weight order
    for i = 1 to |E|:
        if S + E[i] is acyclic:
            S += E[i]
    return S

Running time

The running time of this algorithm is:

  • Sorting: where .
  • For-loop: to determine whether a graph is acyclic, we can use union-find, which is (where is almost a constant)

Thus, the total running time is .

Useful facts for the correctness proof

In the correctness proof below, we’ll need a few useful facts, which we won’t take the time to prove. We define .

  • In a connected graph, the spanning tree has edges.
  • Suppose we have a graph , where is an acyclic edge set of edges. Then the number of connected components is . Intuitively, we get to this result by observing the following:
    • If , then there are no edges, and every node is a connected component, so there are connected components
    • If , then there’s only one edge, then we have components (the added edge connected two components)
    • If , then there are connected components

Correctness proof

We will prove the following lemma by contradiction.

Lemma: Kruskal correctness

kruskal_greedy returns a maximum weight spanning tree.

We’ll suppose for the sake of contradiction that kruskal_greedy doesn’t return a maximum weight spanning tree. This means that we suppose it picks edges 2, but that there exists a tree of higher weight.

Note that the edges are indexed in decreasing order. is of higher weight than , so there exists a first index such that . We can use this index to define two sets:

  • is the set of the first edges of . Because kruskal_greedy picks edges in order of decreasing weight and because , all these weights are .
  • is the set of the first edges of . Because kruskal_greedy picked these before , all these weights are .

We can notice that , which is useful because of the following key property:

Key property: As , such that is acyclic.3

Proof of the key property

This follows from the fact that an acyclic graph with edges has connected components, as we previously stated. Thus, for any acyclic edge sets with , the graph has fewer components than , so there must be an edge connecting two components in ; it follows that and is acyclic.

More succinctly, such that is acyclic.

This edge is , so . Since it has higher weight, must have been considered by the algorithm before .

When was considered, the algorithm checked whether could be added to the current set at the time, which we’ll call . In the above, we stated that is acyclic. Since and since a subset of acyclic edges is also acyclic4, is also acyclic.

Since adding the edge to doesn’t create a cycle, according to the algorithm, it must have been selected to be a part of . At this point, we have , but also (because of our initial assumption that ), which is a contradiction.

Conclusion

In the correctness proof above, we used two properties of acyclic graphs:

  • A subset of acyclic edges is acyclic
  • For two acyclic edge sets and such that , there is such that is acyclic

The generalization of these two properties will lead us to matroids.

Matroids

Definition

Definition: Matroid

A matroid is defined on a finite ground set of elements, and a family of subsets of , satisfying two properties:

  • : If and then
  • : If and and then

The property is called downward-closedness, or the hereditary property; it tells us that by losing elements of a feasible solution, we still retain a feasible solution.

The sets of are called independent sets: if , we say that is independent.

The property is the augmentation property. It implies that every maximal5 independent set is of maximum cardinality6. This means that all maximal independent sets have the same cardinality. A set of maximum cardinality is called a base of the matroid.

Since can be exponential in the size of the ground set , we assume that we have a membership oracle which we can use to efficiently check for a set .

Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
def greedy(M, w):
    """
    Input:  a matroid M = (E, I),
            |E| weights w
    Output: a maximum weight base S
    """
    S = set()
    sort elements of E in decreasing weight order
    for i = 1 to |E|:
        if S + E[i] in I:
            S += E[i]
    return S

Correctness proof

We’d like to prove the following theorem:

Theorem

For any ground set and a family of subsets , greedy finds a maximum weight base for any set of weights if and only if is a matroid.

The if direction () follows from the correctness proof we did for Kruskal’s algorithm.

For the only if direction (), we must prove the following claim, which we give in the contrapositive form7.

Claim

Suppose is not a matroid. Then there exists an assignment of weights such that greedy does not return a maximum weight base.

To prove this, we’re going to cook up some weights for which greedy doesn’t return a maximum weight base when isn’t a matroid (i.e. the tuple either violates or ).

First, consider the case where isn’t a matroid because it violates . Therefore, there exists two sets such that but . Let’s pick the following weights:

This weight assignment has been chosen so that the algorithm considers elements in the following order: first , then , then everything else. Since , the algorithm only selects (it’s a strict subset as ). Best case, the strict subset has size , so the solution’s weight is at most:

Second, consider the case where is violated (but isn’t). Let be two independent sets, such that , and , .

Let’s use the following weights, which again have been carefully chosen to have the algorithm consider elements in , then , then everything else. The value of the weights is also carefully chosen to get a nice closed form later.

Because of (and our assumption that ), greedy will first select all elements in . But at this point, because we assumed to be violated, we have , . That means that we cannot pick any elements in to add to the solution, so the solution would have weight:

However, since and , we expect the optimal solution to have value .

Examples


Graphic matroid

In our initial example about maximum spanning trees, we were actually considering a matroid in which is the set of edges, and is defined as:

k-Uniform matroid

The k-Uniform matroid is a matroid in which is given by:

is satisfied because dropping elements still satisfies . is satisfied because and implies that .

Partition matroid

In a partition matroid, the ground set is partitioned into disjoint subsets (we can think of them as representing different colors, for example). Each such subset has an integer associated to it, stating how many elements can be picked from each subset at most.

Linear matroid

The linear matroid is defined for a matrix . The ground set is the index set of the columns of ; a subset gives us a matrix containing the columns indexed by the set . The matroid is then given by:

Truncated matroid

Matroids can be constructed from other matroids. For instance, the truncated matroid is constructed from the matroid , such that:

Laminar matroid

Laminar matroids are defined by a family of subsets of the ground set that satisfy the following. If , then either:

We can think of these subsets as being part of a tree, where each node is the union of its children. For each set , we define an integer . The matroid is then defined by:

Gammoid (“Netflix matroid”)

A gammoid describes a set of vertices that can be reached by vertex-disjoint paths in a directed graph. An example of this is Netflix, which may want to know how many people they can reach from their server nodes if each link has a certain max capacity.

Matroid intersection

Matroids form a rich set of problems that can be solved by the greedy algorithm, but there are also many problems with efficient algorithms that aren’t matroids. This is the case for problems that aren’t matroids themselves, but can be defined as the intersection of two matroids.

The intersection of two matroids and is:

The intersection of two matroids satisfies , but generally not .

The following theorem adds a lot of power to the concept of matroids.

Theorem

There is an efficient algorithm for finding a max-weight independent set in the intersection of two matroids.

Here, efficient means polynomial time, if we assume a polynomial time membership oracle. This is the case for all the matroid examples seen in class.

Definition of bipartite matching

For instance, we can consider the example of bipartite matching.

  • Input: A bipartite graph , where and are two disjoint vertex sets
  • Output: Matching of maximum weight (or maximum cardinality8).

A matching means that each vertex is incident to at most one edge, i.e. that:

For this problem, we can define with being the edge set, and defined as:

This satisfies downward closedness but not .

Bipartite matching as a matroid intersection

The bipartite matching problem can be defined as the intersection of two matroids:

We’ll use two partition matroids and imposing the same restrictions, but on and , respectively. The ground set for both matroids is . The partitioning of the ground set in is defined as follows, and similarly for :

Here, denotes the edges incident to a vertex . Note that since is bipartite, none of the vertices in are connected; this means that the above indeed creates a partitioning in disjoint sets. We let for every in , so we can define the family of independent sets for the first matroid is:

In other words, a set of edges is independent for (i.e. ) if it has at most one edge incident to every vertex of (no restrictions on how many edges can be incident to vertices on ). Defining similarly, we see why the matroid intersection corresponds to a matching in :

Colorful spanning trees

  • Input: A graph , where every edge has one of colors, which we denote by , with
  • Output: Whether the graph has a spanning tree in which all edges have a different color

We want to find a spanning tree, so we’ll use a graphic matroid for .

The color assignment gives us the partition of into disjoint sets , where each represents all the edges that have the same color . This gives us a hint for the second matroid: we need a partition matroid. We let be defined by:

Arborescences

  • Input: Directed graph and a special root vertex that has no incoming edges
  • Output: An arborescence of

An arborescence of a directed graph is a spanning tree (ignoring edge directions), directed away from . In an arborescence , every vertex in is reachable from .

More formally, is an arborescence if:

  1. is a spanning tree (ignoring edge direction)
  2. has in-degree 0 and all other nodes have in-degree 1 in the edge set (considering edge direction)

These two criteria give us a good hint of which matroids to choose. We let be a graphic matroid for (disregarding the edge direction), and be a partition matroid in which:

Here, denotes the set of arcs incoming to . In this partition matroid, the independent edge sets are those that have at most one arc incoming to every vertex .

Any arborescence is independent in both matroids, by definition. Conversely, a set independent in both matroids of cardinality (so that it is a base in both) is an arborescence: being a spanning tree, it has a unique path between and any vertex .

Linear Programming

Maximum cardinality bipartite matching

Definitions

We’ve seen the definition and one algorithm for bipartite matching. Let’s look into an alternative method.

Recall that a path is a collection of edges where all the ’s are distinct vertices. We can represent a path as .

A maximal path is a path to which we cannot add any edges to make it longer.

A maximum length path is the longest path. Note that it is also a maximal path, but that this is a stronger assertion.

An alternating path with respect to an edge set is a path that alternates between edges in and edges in .

An augmenting path with respect to an edge set is an alternating path in which the first and last vertices are unmatched, meaning that they are not incident to an edge in .

In a bipartite graph, the matching may define alternating paths in which we cannot revisit a node (by definition of a matching). Also note that an augmenting path is one that increases the matching; this is the core idea behind the following algorithm.

Algorithm

1
2
3
4
5
6
7
8
9
def augmenting_path_algorithm(G):
    """
    Input:  Bipartite graph G = (V, E)
    Output: Matching M of maximum cardinality
    """
    M = set()
    while exists an augmenting path P wrt. M:
        M = M  P
    return M

is the symmetric difference, which we can also denote:

An efficient algorithm to find an augmenting path is make edges in directed from to , and edges in directed from to . We can then run BFS from unmatched vertices in , looking for unmatched vertices in . This can be run in time for bipartite graphs (it is harder in general graphs). Seeing that there can be up to matchings in the graph and that each augmenting path increases the matching size by one, we have to run loops. Therefore, the total runtime of the algorithm is .

Here’s how the algorithm works. Suppose we already have a graph with a matching (in red):

G cluster_left cluster_right B B F F B--F E E B--E A A A--E D D A--D C C C--F

The path corresponds to all the edges in the graph. The symmetric difference corresponds to a new matching of cardinality :

G cluster_left cluster_right A A D D A--D B B E E B--E C C F F C--F

Correctness proof

We now prove the correctness of this algorithm, which is to say that it indeed finds a maximum matching. The algorithm returns a set with respect to which there are no augmenting paths, so to prove correctness we must prove the following:

Theorem: Augmenting Path Algorithm Correctness

A matching is maximal if and only if there are no augmenting paths with respect to .

The proof is by contradiction.

First, let’s prove the direction. Suppose towards contradiction that is maximal, but that there exists an augmenting path with respect to . Then is a matching of greater cardinality than , which contradicts the optimality of .

Then, let’s prove the direction. We must prove that the lack of augmenting paths implies that is maximal. We’ll prove the contrapositive, so suppose is not maximal, i.e. that there is a maximal matching such that . Let ; intuitively, this edge set represents the edges that and disagree on.

From there on, we reason as follows:

  • has more edges from than from (since , which implies that )
  • In , every vertex has degree , with at most one edge from , and at most one edge from . Thus, every component in is either:
    • a path (where middle nodes have degree two and the ends of the path have degree one), or
    • a cycle (where all nodes have degree two)
  • The cycles and paths that compose alternate between edges from and (we cannot have vertices incident to two edges of the same set, as and are matchings). This leads us to the following observations:
    • In cycles, there is the same number of edges from and
    • In paths, there number of edges from is than the number of edges from
  • Let’s remove cycles from consideration, and concentrate on paths. We can do so and still retain the property of having more edges from than from . Since , there must be at least one path with strictly more edges from than from ; it must start and end with a edge, and alternate between the sets in between. This path is an augmenting path with respect to .

Therefore, there must exist an augmenting path with respect to . This proves the contrapositive, so the direction stating that no augmenting paths is maximal is proven.

Generalization

In the above, we’ve seen an algorithm for unweighted bipartite matching. If we’d like to generalize it to weighted bipartite matching8, we’ll have to introduce a powerful algorithmic tool: linear programming.

Definition of Linear Programming

A linear program (LP) is the problem of finding values for variables that minimize (or equivalently, maximize) a given linear objective function, subject to linear constraints:

Where .

Strictly speaking, it isn’t necessary to allow equality and upper-bound constraints:

  • We can get an equality constraint with a lower-bound and an upper-bound.
  • We can transform an upper-bound into a lower-bound by multiplying all factors and the bound by .

Still, we allow them for simpler notation for now. Strict inequality, however, is disallowed (otherwise, we wouldn’t have a closed set, and there would never be an optimal solution, as we could always get closer to the bound by an infinitesimal amount).

Extreme points

A feasible solution is an extreme point if it cannot be written as a convex combination of other feasible solutions.

A convex combination of points is a point of the form where the satisfy .

To gain some intuition about what an extreme point is, we can take a look at the diagram below:

A feasible which isn't an extreme point

In the above, is not an extreme point because . If we can walk an amount in opposing directions, then it is not an extreme point.

Extreme points are important because they have useful structural properties that we can exploit to design algorithms and construct proofs. We can state the following theorem about extreme points:

Theorem

If the feasible region is bounded, then there always exists an optimum which is an extreme point.

The proof is as follows. As the feasible region is bounded, there is an optimal solution . If happens to be an extreme point, we are done. The real work in this proof is for the case where isn’t an extreme point. To prove this, we’ll have to introduce a small lemma:

Lemma

Any feasible point can be written as a convex combination of the extreme points.

This is essentially proven by the following diagram:

A feasible point and the extreme points that it is constructed from

Indeed, if we draw a line from a feasible point in a bounded domain, we’ll hit the bounds in two locations and , which are convex combination of the extreme points , and , , respectively. is itself a convex combination of and , and thus of the extreme points , , and .

With this lemma in place, we can write the feasible solution as a convex combination of extreme points:

Where the are extreme points, and satisfy .

Let be the vector defining the objective that we wish to maximize (the proof is the same if we minimize). Then we have:

This proves . In other words, the value of the objective function is a weighted average of the extreme points.

We conclude the proof with the “tallest person in the class” argument. If we measured the height of all the people in a class, and got the average value of 170cm, we could say that at least one person has height 170cm. For the same reason, we can conclude from the above:

This extreme point is gives us a higher value for the objective function than . Since was chosen to be any feasible point, this means that is an optimal solution.

Maximum weight bipartite perfect matching

The problem corresponds to the following:

  • Input: A bipartite graph where and , and edge weights
  • Output: A perfect matching maximizing

A matching is perfect if every vertex is incident to exactly one edge in the matching (i.e. every vertex is matched). We need for this to work.

This problem can be formulated as the following linear program:

The constraints say that every vertex is chosen exactly once. The intended meaning for variables is:

But in linear programming, we cannot take on discrete values; instead, we can only have , which gives us the constraint . The upper bound of 1 is implied by the other constraints, so for the sake of conciseness, we only give the minimal set of constraints above.

Still, we will see that relaxing to be in still gives us an optimal solution for which .

Claim

For bipartite graphs, any extreme point solution to the LP is integral.

Integral means that it is an integer; coupled with the constraint that , being integral implies .

We’ll prove this by contradiction. Let be an extreme point for the graph , and let be the set of edges for which optimal extreme point solution is not integral. We suppose toward contradiction that the solution contains such edges, i.e. that .

We have the constraint that for any given vertex or , . This means that either:

  • All the incident edges to the vertex have integral weights , and the vertex therefore has no incident edges in .
  • At least one incident edge to the vertex has non-integral weight . But because of the above constraint, it takes at least two non-integral weights and to sum up to 1, so having one incident edge in implies having at least two incident edges in .

This implies that must contain a cycle. By construction, because we only have a finite number of vertices in and because they cannot have degree 1, the path must close back to the original vertex.

Further, cycles in a bipartite graph must have even length. This follows from the fact that to get back to a vertex in , we must go to and back to times, for a total of edges.

All these edges in the cycle are fractional (being in ). Since this is a proof by contradiction, we’ll try to find two feasible points and such that , which would be a contradiction of the assumption that is an extreme point.

Let be the edges of the cycle. Let and be defined for each edge as follows:

The degree constraints of and are satisfied, because they alternate between adding and subtracting in a cycle of even length, and we assumed that satisfies them.

To ensure feasibility, we must choose a small to guarantee that all . We can choose the following value to do so:

Note that this indeed satisfies , so we have a contradiction of the assumption that is an extreme point.

Bipartite perfect matching polytope

The polytope9 corresponding to the bipartite perfect matching LP constraints is called the bipartite perfect matching polytope.

We can solve the maximum weight bipartite matching problem by solving the above linear program.

General graphs

Unfortunately, this is only for bipartite graphs; for general graphs, we have a problem with odd cycles. We could solve this by imposing an addition constraint on odd cycles:

Unfortunately, this adds exponentially many constraints. A proof 2 or 3 years ago established that there is no way around this.

Vertex cover for bipartite graphs

The vertex cover problem is defined as follows:

  • Input: A graph with node weights
  • Output: A vertex cover that minimizes

A vertex cover is a vertex set that ensures that every edge has at least one endpoint in . In other words, is a vertex cover if , we have or .

Just like maximum weight bipartite perfect matching, the vertex cover problem can be formulated as LP constraints:

The constraints tell us that at least one of the endpoints should be in the vertex cover, as the intended meaning for is:

Claim

For bipartite graphs, any extreme point to the vertex cover LP is integral.

We’ll prove this by contradiction. Let be an extreme point, and be the vertices with fractional values in . Suppose toward contradiction that .

Since is bipartite, the vertices are split into disjoint subsets and . In the same way, we can divide into disjoint subsets and , which are the fractional vertices in and , respectively.

As previously, we’ll define and such that , by setting , and letting:

Let’s verify that these are feasible solutions. We’ll just verify , but the proof for follows the same reasoning.

By the selection of , we have , , which satisfies the second constraint.

To verify the first constraint, we must verify that for every edge .

  • If (or ), then (or ), so (or ), so the constraint holds.
  • If , then and , which also respects the constraint, because:

The last inequality holds because is a feasible solution and thus satisfies the first LP constraint.

These and are feasible solutions, and therefore show a contradiction in our initial assumption that is an extreme point; the claim is therefore verified.

For general graphs, we have the same problem with odd cycles as for matchings. But the situation is actually even worse in this case, as the problem is NP-hard for general graphs, and we do not expect to have efficient algorithms for it. Still, we’ll see an approximation algorithm later on.

In the above, we have seen two integrality proofs. These both have the same general approach: construct and adding or subtracting from such that , and , and such that and are feasible solutions. To prove feasibility, we much choose a construction where all the cancel out. If that isn’t possible, then we must the variables for which the don’t cancel out are “on their own” (i.e. not with any other variables) in the constraint.

Duality

Intuition

Consider the following linear program:

We’re looking for an optimal solution OPT. To find this value, we can ask about the upper bound and about the lower bound.

  • Is there a solution of cost ? Is OPT ?
  • Is there no better solution? Is OPT ?

For instance, for the max weight bipartite perfect matching problem, we can ask10:

  • Is there a matching of value at least ? In other words, is ?
  • Do all matchings have value at most ? In other words, is ?

Answering the first kind of question is easy enough: we can just find a feasible solution to the LP with an objective function that fits the bill (i.e. is ).

To answer the second kind of question though, we have to be a little smarter. The intuition is that a linear combination of primal constraints cannot exceed the primal objective function. This leads us to define the following dual LP with variables and :

Let’s formalize this approach for the general case.

General case

Consider the following linear program with variables, for and constraints:

The matrix of factors is an matrix (with ranging over columns, and over rows). This means that the factors in the primal are in the same layout as in ; we’ll see that they’re transposed in the dual.

Indeed, the dual program has variables for and constraints:

Note that the dual of the dual is the primal, so we can convert problems both ways.

Weak duality

Weak duality tells us that every dual-feasible solution is a lower bound of primal-feasible solutions.

Theorem: Weak duality

If is a feasible solution to the primal problem and is feasible to the dual, then:

We can prove this by rewriting the right-hand side:

The first inequality uses the constraints of the primal, and the second uses the constraints of the dual. We also use the fact that throughout.

This leads us to the conclusion that the optimal solution of the primal is lower bounded by the optimal solution to the dual. In fact, we can make a stronger assertion than that, which is what we do in strong duality.

Strong duality

Strong duality tells us that the dual solutions aren’t just a lower bound, but that the optimal solutions to the dual and the primal are equal.

Theorem: Strong Duality

If is an optimal primal solution and is an optimal dual solution, then:

Furthermore, if the primal problem is unbounded, then the dual problem is infeasible, and vice versa: if the dual is unbounded, the primal is infeasible.

Example: Maximum Cardinality Matching and Vertex Cover on bipartite graphs

These two problems are the dual of each other. Remember that max cardinality matching is the following problem:

  • Input: a bipartite graph
  • Output: a matching

We define a variable for each edge , with the intended meaning that represents the number of incident edges to in the matching . This leads us to defining the following LP:

The dual program is:

This dual problem corresponds to the relaxation of vertex cover, which returns a vertex set . By weak duality, we have , for any matching and vertex cover . Since both the primal and the dual are integral for bipartite graphs, we have strong duality, which implies Kőnig’s theorem.

Kőnig’s theorem

Kőnig’s theorem describes an equivalence between the maximum matching and the vertex cover problem in bipartite graphs. Another Hungarian mathematician, Jenő Egerváry, discovered it independently the same year11.

Theorem: Kőnig’s Theorem

Let be a maximum cardinality matching and be a minimum vertex cover of a bipartite graph. Then:

Complementarity slackness

As a consequence of strong duality, we have a strong relationship between primal and dual optimal solutions:

Theorem: Complementarity Slackness

Let be a feasible solution to the primal, and let be a feasible solution to the dual. Then:

Note that we could equivalently write instead of because we assume to have the constraint that (similarly for ).

The “space” between the value of a constraint and its bound is what we call “slackness”. In this case, since the right-hand side of the iff says that the variables being positive implies that the constraints are met exactly (and aren’t just bounds), we have no slackness.

Proof

We can prove this complementarity slackness by applying strong duality to weak duality.

First, let’s prove the direction. Let be the optimal primal solution. From the proof of weak duality, we have:

Since we’re assuming that are optimal solutions, we also have strong duality, which tells us that:

With this in mind, all the inequalities above become equalities:

From this, we can quite trivially arrive to the following conclusion:

This means that as long as , we have:

Now, let’s prove the direction. We know that:

Therefore, we can do just as in the proof of weak duality:

This is equivalent to being optimal solutions, by weak duality.

Duality of Min-Cost Perfect Bipartite Matching

  • Input: , a bipartite weighted graph with edge weights
  • Output: Perfect matching of maximum cost

We assume that is a complete bipartite graph, meaning that all possible edges exist. This is equivalent to not having a complete graph, as we can just consider missing edges in an incomplete graph as having infinite weight in a complete graph.

In the LP for this problem, we let be a variable for every edge , taking value 1 if is picked in the matching, and 0 if not. The problem is then:

As we saw previously, any extreme point of this problem is integral, so we can solve the min-cost perfect matching by solving the above LP. But that is not the most efficient approach. Using duality, we can reduce the problem to that of finding a perfect matching in an unweighted graph, which we saw how to solve using augmenting paths.

To obtain the dual, we’ll first write the above LP in a form using inequalities, which gives us the following primal:

This notation is a little tedious, so we’ll introduce a variable for each constraint. For each , we introduce for the first constraint and for the second; similarly for each , we introduce and for the third and fourth constraint respectively. These will play the same role as the dual variables, in that every constraint in the primal has a variable in the dual. The dual is thus:

By complementarity slackness, and are feasible if and only if the following holds:

We also get a series of less interesting implications, which are trivially true as their right-hand side is always true (because the original constraints already specified equality):

For the following discussion, we’ll define:

Note that these are not constrained to be . If we need to go back from notation to notation, we can define and when , and and if not (equivalently for ).

We can simplify the notation of our dual LP to the equivalent notation:

With this simplified notation, complementarity slackness gives us that and are feasible if and only if the following holds:

In other words, we can summarize this as the following lemma:

Lemma: Min-cost perfect matching and dual solution

A perfect matching is of minimum cost if and only if there is a feasible dual solution such that:

In other words, if we can find a vertex weight assignment such that every edge has the same weight as the sum of its vertex weights, we’ve found a min-cost perfect matching. This is an interesting insight that will lead us to the Hungarian algorithm.

Hungarian algorithm

The Hungarian algorithm finds a min-cost perfect matching. It works with the dual problem to try to construct a feasible primal solution.

Example

Consider the following bipartite graph:

G cluster_left cluster_right A A D D A--D B B B--D E E B--E C C C--D F F C--F

The thin black edges have cost 1, and the thick red edge has cost 2. The Hungarian algorithm uses the lemma from above to always keep a dual solution that is feasible at all times. For any fixed dual solution, the lemma tells us that the perfect matching can only contain tight edges, which are edges for which .

The Hungarian algorithm initializes the vertex weights to the following trivial solutions:

The right vertices get weight 0, and the left vertices get the weight of the smallest edge they’re incident to. The weights and , and the set of tight edges is displayed below.

G cluster_left cluster_right A A = 1 D D = 0 A--D B B = 1 B--D C C = 1 C--D F F = 0 C--F E E = 0

Then, we try to find a perfect matching in this graph using the augmenting path algorithm, for instance. However, this graph has no perfect matching (node is disconnected, and are both only connected to ). Still, we can use this fact to improve the dual solution , using Hall’s theorem:

Theorem: Hall’s theorem

An by bypartite graph has a perfect matching if and only if for all

Here, is the neighborhood of . In the example above, we have no perfect matching, and we have and .

G cluster_left cluster_S S cluster_right cluster_NS N(S) A A = 1 D D = 0 A--D B B = 1 B--D C C = 1 C--D F F = 0 C--F E E = 0

This set is a certificate that can be used to update the dual lower bound. If we pick a (we’ll see which value to pick later), we can increase for all vertices in by an amount , and decrease by . Let’s take a look at which edges remain tight:

  • Edges between and remain tight as
  • Edges between and are unaffected and remain tight
  • Any tight edge between and will stop being tight
  • By definition of the neighborhood, there are no edges from to

Because we’ve changed the set of tight edges, we’ve also changed our solution set to something we can maybe find an augmenting path in. For instance, picking in the graph above gives us a new set of tight edges:

G cluster_left cluster_right A A = 2 D D = -1 A--D B B = 2 B--D E E = 0 B--E C C = 1 F F = 0 C--F

The augmenting path algorithm can find a perfect matching in this graph, which is optimal by the lemma.

Algorithm

Initialize:

Iterate:

  • Consider where is the set of all tight edges
  • Find a maximum-cardinality matching in .
    • If it is a perfect matching, we are done by complementarity slackness’ direction.
    • Otherwise, we can find a “certificate” set such that . This is guaranteed to exist by Hall’s theorem.
      • Update weights by
      • Go to step 1

The weights are updated as follows:

This remains feasible. The dual objective value increases by ; as by Hall’s theorem, so : we only increase the value!

To get the maximal amount of growth in a single step, we should choose as large as possible while keeping a feasible solution:

This algorithm is , but can more easily be implemented in .

Approximation algorithms

Many optimization problems are NP-hard. Unless P = NP, there is no algorithm for these problems that have the following three properties:

  1. Efficiency (polynomial time)
  2. Reliability (works for any input instance)
  3. Optimality (finds an optimal solution)

Properties 2 and 3 are related to correctness. Perhaps we could relax property 3 a little to obtain property 1? Doing so will lead us to approximation algorithms.

An -approximation algorithm is an algorithm that runs in polynomial time and outputs a solution a solution such that (for a minimization problem):

Alternatively, for maximization problems:

Here, “cost” and “profit” refer to the objective functions. We will have for minimization problems, and for maximization. If we have then we have a precise algorithm (not really an approximation algorithm).

Giving a value for can be hard: we don’t know the cost of the optimal solution, which is what we’re stuck trying to compute in the first place. Instead, for minimization problems, we can do compare ourselves with a lower bound on the optimum, which gives us:

To get this bound, we typically proceed as follows:

  1. Give an exact formulation of the problem as an Integer Linear Program (ILP), usually with .
  2. Relax the ILP to a LP with
  3. Solve the LP to get an optimal solution which is a lower bound (or upper bound for a maximization problem) on the optimal solution to the ILP, and thus also on the original problem.
  4. Somehow round “without losing too much”, which will determine .

Vertex Cover for general graphs

As we saw previously, vertex cover can be formulated as the following LP:

Letting gives us our ILP. If we relax this to , we get our LP. We proved that this LP works for bipartite graphs (i.e. that any extreme point for bipartite graphs is integral), but we’re now considering the general case, in which we know we won’t always get integral solutions.

Therefore, to go from LP to ILP, we must define a rounding scheme: given an optimal solution to the LP, we will return:

This is still a feasible solution, because for any edge , we have , so at least one of and is .

We’ll now talk about the value of .

Claim

The weight of is at most twice the value of the optimal solution of vertex cover.

We prove the claim as follows:

Therefore, we have a 2-approximation algorithm for Vertex Cover.

Note that the above LP formulation can be generalized to -uniform hypergraphs12 by summing over all vertices of an edge in the constraints. In the rounding step, we select all vertices that have weight . The algorithm then becomes a -approximation algorithm.

Integrality gap

Definition

As a recap, we’re talking about the following programs:

  • ILP: Integral LP, the original definition of the problem.
  • LP: Relaxation of the ILP. Its optimal solution has the lowest cost of all the programs; it’s lower or equal to that of the ILP because the LP allows variables to take on more values than the ILP ( instead of ).
  • Approximation algorithm: Rounding of LP. Rounding increases cost by a factor .

The solutions we’re looking at are:

  • OPTLP: The optimal solution to the LP
  • OPT: The (hypothetical) optimal solution

Because the LP is a relaxation, it admits more solutions. So for a minimization problem, .

The notion of integrality gap allows us to bound the power of our LP relaxation. It defines the “gap” in cost between the the optimal solution to the LP (OPTLP), and the hypothetical perfect, optimal solution (OPT).

Let be the set of all instances of a given problem. For minimization problems, the integrality gap is defined as:

This allows us to give some guarantees on bounds: for instance, suppose and our LP found . Since the problem instance might have been the one maximizing the integrality, all we can guarantee is that . In this case, we cannot make an approximation algorithm that approximates better than within a factor .

Integrality gap of Vertex Cover

Claim: Vertex cover integrality gap

Let be the integrality gap for the Vertex Cover problem on a graph of vertices.

On a complete graph with vertices, we have because if there are two vertices we don’t choose, the edge between them isn’t covered.

Assigning to every vertex is a feasible solution to the LP of cost , so the optimum must be smaller or equal.

We can use these two facts to compute the integrality gap:

Our 2-approximation algorithm for vertex cover implies that the integrality gap is at most 2, so we have:

Set cover

Problem definition

Set cover is a generalization of vertex cover.

  • Input:
    • A universe of elements
    • A family of subsets
    • A cost function
  • Output: A collection of subsets of minimum cost that cover all elements

More formally, the constraint of is:

First attempt at an approximation algorithm

We can formulate this as an ILP of constraints, with variables. We’ll let each be 1 if , and 0 otherwise. The LP is:

As for the rounding, suppose that the element that belongs in the most sets is represented in sets ; each element belongs to at most sets. We’ll do the following rounding to ensure feasibility:

Analyzing this in the same way as we did for Vertex Cover would short us that this is an approximation within a factor of .

Better approximation algorithm

If we introduce some randomness and allow our algorithm to sometimes output non-feasible solutions, we can get much better results in expectation.

We’ll run the following algorithm:

  1. Solve the LP to get an optimal solution
  2. Choose some positive integer constant
  3. Start with an empty result set and run the following times:
    • For , add set to the solution with probability (choosing independently for each set).

The algorithm gives each set chances to be picked. This randomness introduces the possibility of two “bad events”, the probabilities of which we’ll study later:

  • The solution has too high cost
  • The solution is not feasible: there is at least one element that isn’t covered by

As we’ll see in the two following claims, we can bound both of them.

Claim: Expected cost in one execution

The expected cost of all sets added in one execution of Step 3 is:

For each set , we let be a random indicator variable telling us whether was picked to be in :

Unsurprisingly, the expected value of is:

The expected cost is therefore:

From this, we can immediately derive the following corollary:

Corollary: Expected cost in d execution

The expected cost of after executions of Step 3 is at most:

Note that because LP is a relaxation of the ILP, so its optimum can only be better (i.e. lower for minimization problems like this one). This gives us some probabilistic bound on the first “bad event” of the cost being high.

We also need to look into the other “bad event” of the solution not being feasible:

Claim: Probability of unsatisfied constraint

The probability that a constraint is unsatisfied after a single execution is at most .

A constraint is unsatisfied if is not covered by . Suppose that the unsatisfied constraint contains variables:

Let be the sets covering . Then:

Note that:

  • The second step uses the independence of the ; while their probabilities are linked, their inclusion in is determined by independent “coin flips”.
  • The first inequality step uses . This is a small lemma that follows from the Taylor expansion (it’s good to know, but it’s not vital to do so).
  • The second inequality uses the constraints of the problem, which tell us that .

So we have probability of leaving a constraint unsatisfied. This is pretty bad!

To get a better bound, we could have our algorithm pick set with probability , which would make , and the probability would be upper-bounded by .

Another technique to “boost” the probability of having a feasible solution is to run the loop times instead of once. In the following, we’ll pick because it will enable us to get nice closed forms. But these proofs could be done for any other formulation of .

Claim: Probability of feasible solution

The output after executions is a feasible solution with probability at least:

The probability that a given constraint is unsatisfied after executions of step 3 is at most:

If we pick , we get:

By union bound, we get that the probability of any constraint being unsatisfied is at most:

At this point, we have an idea of the probabilities of the two “bad events”: we have an expected value of the cost, and a bound of the probability of the solution not being feasible. Still, there might be a bad correlation between the two: maybe the feasible outputs have very high cost? Maybe all infeasible solutions have low cost? The following claim deals with that worry.

Claim: Probability of both good events

The algorithm outputs a feasible solution of cost at most with probability greater that .

Let be the expected cost, which by the corollary is . We can upper-bound the “bad event” of the cost being very bad: by Markov’s inequality we have . We chose a factor here because this will give us a nice bound later on; we could pick any number to obtain another bound. We can also upper-bound the “bad event” of the solution being infeasible, which we know (thanks to the previous claim) to be upper-bounded by for iterations. By union bound, the probability that no bad event happens is at least . Supposing , this probability is indeed greater than .

This claim tells us that choosing , we have a approximation algorithm for the set cover problem.

Multiplicative Weights Algorithm

Warm-up

Suppose that you want to invest money on the stock market. To help you invest this money, you listen to experts. Every day, each of them tells you whether the stock will go up or down.

Day Expert 1 Expert 2 Expert 3 Expert 4 Actual
1 📈 📈 📈 📈 📈
2 📈 📈 📉 📈 📈
3 📈 📉 📈 📉 📉

All the experts were right on day 1, but expert 3 was wrong on day 2, and experts 1 and 3 were wrong on day 3.

Your goal is to devise a strategy that allows you to do as well as the best expert. For this warm-up example, we’ll work under the assumption that there is a “perfect expert” who is never wrong.

Our strategy is then the following. At day , let be the set of experts who have been right so far. We will follow the majority in . This ensures that we’ll be wrong at most times (we follow the majority, so we can divide the pool of trusted experts at least in half whenever we make a mistake).

Formalization of the problem

We’ll generalize and formalize the problem we saw above. The following game is played between an omniscient Adversary (the stock market) and an Aggregator (you), who is advised by experts.

For (days on the stock market), each expert advises “yes” or “no” (“up” or “down”). The Aggregator predicts either yes or no. The Adversary, with knowledge of the Aggregator’s prediction and of the expert advice, decides the yes-or-no outcome. The Aggregator observes the outcome, and suffers some cost if his prediction was incorrect.

Note that the Adversary’s role is not to make the Aggregator be wrong all the time, which would be easy: seeing that it’s omniscient, the Adversary could just observe the opposite outcome of what the Aggregator predicted. Instead, we can consider that the Adversary’s role is to make the Aggregator look as bad as possible compared to the best expert.

The Aggregator’s role is to make as few mistakes as possible, but since the experts may be unhelpful and the outcomes can be bad, the Aggregator can only hope for a performance guarantee relative to the best expert, in hindsight. To do so, it can track which experts are helpful; we’ll see some good strategies for this.

The number of mistakes in excess of the best expert is called (external) regret.

Weighted Majority (WM)

In a world where experts can be wrong (yes, really), we can somewhat “soften” our strategy. Instead of discarding them completely on their first mistake, we can just discount their advice. This leads us to the Weighted Majority algorithm, which is defined as follows.

We begin by assigning a weight to each expert , initialized at 1. Thereafter, for each day :

  • Aggregator predicts yes or no based on a majority vote, weighted by
  • After observing the outcome, for every mistaken expert , set

We chose to update the experts’ weights by a factor , which leads us to some fast learning (maybe too fast!).

Theorem: 16.1

For any sequence of outcomes, any duration and any expert :

The seems very arbitrary, but we’ll show where it comes from.

Let be any expert. Let defined as follows be a “potential function”:

Our strategy will be to bound from below with expert ’s mistakes, and from above with WM’s mistakes.

Lower bound: We can observe that:

The inequality stems from the fact that all weights are always .

Upper bound: let’s start by observing the following: as all weights are initialized to 1. Additionally, whenever WM errs, we halve the weights for experts representing at least half of the total weights (since we follow the weighted majority). This means that we loose at least of the total weight:

This implies that we can bound the final value of the potential function as follows:

Combining these bounds together, we get:

Taking the of both bounds yields:

So finally:

We have , which proves the theorem.

The 2.41 constant is a consequence of our arbitrary choice to halve the weights; if instead we choose to divide by a factor in the update rule:

Then we achieve a more general formulation:

Generalized Game with Randomized Strategies

We now allow the Aggregator to play a random strategy instead of always making deterministic predictions. Randomization is in fact often a good strategy to limit the effect of adversaries.

In this variation of the problem, we still have days, on each of which all experts give their advice. Previously, the answer could only be one of two options (up/down or yes/no), but we will generalize this to any number of possible options.

The Aggregator (whom we’ll now call the Allocator) picks some distribution over the experts. Now, represents the probability of following expert ’s advice.

The Adversary is still omniscient: with knowledge of the expert advice and of , it determines the cost vector . The intended meaning of is the cost13 of following expert ’s advice on day .

The expected cost is therefore:

Hedge Strategy

The Hedge strategy operates in the problem setting described above. It is parametrized by a learning parameter .

  • Initially, set all expert weights to 1.
  • For each day :
    • Pick the distribution , where is the potential function as defined previously.
    • After observing the cost vector , set

Note that:

  • if
  • if

This means that the weights increase when it was profitable to follow the expert, and decrease when it wasn’t.

An alternative strategy is the Multiplicative Weights Update (MWU) strategy, which uses . This is just as good, as it leads us to the same guarantee:

Theorem

Suppose , and is chosen by Hedge for . Then for any expert :

Note that this inequality holds for any expert , and in particular, for the best expert. Let’s take a look at the terms in this inequality:

This means that Hedge does as well as the best expert, within a small additive term, which is the external regret. Note that this small error is minimized to be when .

Let’s prove this theorem. Like before, our proof strategy will be to upper bound and lower bound the potential function .

Lower bound: We can lower-bound the final potential as before:

Upper bound: the proof for the upper bound will be somewhat longer. Let’s start by bounding the individual update rule of the potential function:

Step uses the the Taylor expansion of :

For , we have that . We can use this fact to our advantage in the following, as the exponentiated term is in (because we supposed in the theorem).

Step uses the fact that as by definition.

Step uses , which stems from the definition of the Hedge strategy.

Step uses the same Taylor expansion, but in the other direction, using the fact that .

This gives us an upper bound on in terms of . We can use this bound repeatedly to find a bound for :

The last step uses the fact that all weights were initialized to 1, and thus that . Putting the bounds together, we get:

Taking the natural logarithm on both sides of the bound yields:

We get to the final result by dividing by and rearranging the terms.

From this, we can infer a corollary that will be useful for solving covering LPs. For these problems, it’s useful to consider the average cost incurred per day. We will also generalize the cost vector so that it can take values in instead of just . This is called the width.

Corollary: Long-term average regret

Suppose . For , let be picked by Hedge, and assume the cost vectors are .

If , then for any expert :

This tells us that the average daily performance is as good as the best expert’s average daily performance, within some linear term . The “average regret” is the difference between the algorithm’s and the expert’s average daily performances.

Covering LPs

Covering LPs are defined as follows:

Definition

A linear program of the form:

is a covering LP if all coefficients of the constraints and objective function are non-negative:

Note that covering LPs require the variables to be in (whereas the more general form of LPs simply requires ). Vertex cover and set cover, for instance, satisfy this requirement.

Hedging for LPs

Example

An example of a covering LP is:

This LP has two constraints: we could simplify it by adding up those constraints into one, with weights , which would give us the following LP:

This has an optimal solution of , but that doesn’t work in the original LP. In general, any solution that is valid in the original LP will be valid in the simplified summed LP, but that implication does not hold the other way around. This means that the cost of the simplified LP is lower (as it allows for more solutions).

To remediate this, we can perhaps update the weights and : we need to increase the weights of the satisfied constraint, and decrease the weights of the unsatisfied constraint. This will lead us to using the Hedge strategy to solve LPs.

General idea

When using Hedge for LPs, the “experts” are the constraints of the LP. Hedge maintains a weight distribution over the constraints, and iteratively updates those weights based on the cost function at each step.

The “simplified problem” as we defined it above can more strictly be defined as:

The oracle we need takes a probability distribution over the experts such that , and outputs an optimal solution to the above “simplified problem”.

Hedge Algorithm for Covering Linear Programs

The Hedge Algorithm plays the role of the Aggregator/Allocator, determining the , but also that of the Adversary, determining the cost using the result of the oracle.

Initially, we assign each constraint a weight .

Then, for each step , we pick the distribution as follows:

Now, we need to play the role of the Adversary. We let be the solution returned by the oracle on the LP created using the convex combination of constraints. As we said above, the cost of the “simplified” LP is at most that of the original LP. We can define the cost of the constraint as:

We have a positive cost if the constraint is satisfied (so the weight will be decreased by Hedge), and a positive cost if not (which increases the weight).

After observing the cost, we can go back to playing the Allocator and update the weights as always in Hedge:

After steps, the algorithm should output the average of the constructed solutions:

Analysis

Let’s now analyze how we should pick and see the properties of the algorithm.

First off, since we proved that Hedge works for an adversarial, omniscient construction of the cost vectors, it will definitely work for this construction.

Let’s define to be a bound on , basically answering “how big can the cost be”? Let:

This means that and , . By the corollary, for and , and for any constraint , we have:

Let’s consider the sum in the left-hand side of this expression, corresponding to the “final performance”:

Note that is non-negative since we’re working in a covering LP and because the oracle only outputs feasible solutions , which allows us to conclude in the final step that the whole expression is non-negative. The inequality we derived from the corollary is therefore:

Rearranging the terms, we get that:

This implies that for every constraint :

This means that the solution is always almost feasible: it might break some constraints with up to .

The cost is at most the cost of an optimal solution to the original LP since each solution to the “simplified LP” has a cost lower than the original LP.

How do we set the right parameters? Let’s look at the set cover LP relaxation. We have that since the LP uses and is always either 0 or 1. Therefore, if we define as above, this holds. We can set 14: this gives us an “almost feasible” solution :

We can obtain a feasible (almost optimal) solution by using:

Simplex method

The simplex method is an algorithm to solve LPs. Its running time is exponential in the worst case, but runs very fast in practice. This makes it a very used algorithm, more so than other polynomial-time algorithms that are often slower in practice.

A small overview of the steps of the simplex method is given below. This is a very rough overview; the lecture notes go over example runs of the simplex method, so check those for the details of how to run simplex.

  1. Rewrite the constraints as equality, introducing “slack variables”
  2. Set to be equal to the objective function
  3. Set , so ,
  4. Maintain a simplex tableau with non-zero variables to the left of the equality, and the rest to the right.
  5. Select a variable ( or ) with positive weight (or negative if we’re minimizing) in the formulation of to increase. Say we chose . Increase it as much as the constraints will allow.
  6. Compensate by modifying the value of all left-hand side constraints in which appears.
  7. Pivot15: in the constraint that dictated the new value of , swap the left-hand side (call it ) with : .
  8. Rewrite the constraints to use the newly determined formulation of . This will also update .
  9. Go to step 5; stop when we can no longer increase any variables.

A few problems can arise when trying to apply the simplex method, which we’ll discuss here:

  • Unboundedness: If the polytope defined by the given constraints is not bounded, the optimal solution is infinite.
  • Degeneracy: it might happen that we cannot increase any variable but still need to pivot two of them to proceed. This doesn’t mean that the simplex method won’t terminate, but this is an edge case that must be handled. One possible solution is to use a lexicographic ordering of variables.
  • Initial vertex: Sometimes, it’s easy to find that is a feasible solution, but other times it may be harder. To find a feasible starting point, we may have to solve another LP to find a starting point (Phase I), and then do what we described above to solve the original LP (Phase II).
  • Infeasibility: If the constraints define an empty polytope, there is no feasible starting point; Phase I will detect this.

Randomized Algorithms

As before, when we discussed the possible “bad events” in approximation algorithms, there are two kinds of claims that we want to make about randomized algorithms:

  • It gives (close to) correct solutions with good probability
  • The expected cost (value of the objective function) is close to the cost of the optimal solution.

If we have an algorithm that generates the correct (optimal) solution with probability , we can obtain an algorithm that succeeds with high probability, even when is low. The trick is to “boost” the probabilities by running the algorithm times: the probability that one run succeeds is .

For small values of , we’ll generally use the approximation of for . It follows that one of the repeated runs will find the correct (optimal) solution with probability at least . We can pick so that this number is close to 1.

Minimum cut problem

  • Input: An undirected graph with vertices
  • Output: An edge set that is a min-cut of the graph

To define this min-cut more formally, we’ll need to introduce , the set of edges that have exactly one endpoint in :

In the min-cut problem, we’re looking the partition of where the two parts are joined together with the minimum number of edges:

Karger’s algorithm

Algorithm

Karger’s algorithm solves the min-cut problem in . A single run, as defined below, is , but as we’ll see later, we may have to run it times to find a min-cut with high probability.

1
2
3
4
5
6
7
8
9
10
def karger(G, n):
    """
    Input: Graph G = (V, E) of size n
    Output: Size of the min-cut
    """
    for i in range(1, n - 2):
        choose edge (u, v) uniformly at random
        contract it
    size_min_cut = number of edges between two final super-nodes
    return size_min_cut

Let’s define how the contracting procedure happens. Consider the following graph:

G a a b b a--b e a--b c c b--c b--c b--c d d c--d c--d d--a

If we contract edge , we get the following graph:

G ab ab c c ab--c ab--c ab--c d d c--d c--d d--ab

We’ve created a new super-node . This reduces the total number of nodes in the graph by 1. We do not remove loops when contracting an edge.

Analysis

Let be the optimal minimum cut of , of size . We’ll work our way towards analyzing the probability that the algorithm finds the optimal cut.

First, let’s take a look at the probability that a single edge is in the optimal min-cut. This corresponds to the probability of the algorithm chose “the wrong edge” when picking uniformly at random: if it contracts an edge that should have been in , then it will not output the optimal solution.

Claim: Probability of picking an edge in the cut

The probability that a uniformly random edge is in is at most

Consider the first contraction of the algorithm (this is without loss of generality). Let be a uniformly random edge in :

We do not know in advance what the value of is, so we’ll upper-bound by lower-bounding .

Let be the degree of a vertex . The handshaking lemma tells us that:

If the min-cut is of size , we can lower bound by . This is because if the size of the min-cut is , no vertex can have degree less than ; otherwise, we’d have chosen and achieved a smaller min-cut. Therefore, .

It follows then that:

With this, we’ve bounded our probability:

In the following, we’ll need to use the following property of the algorithm:

Claim: Min-cut size does not decrease

For any graph , when we contract an edge , the size of the minimum cut does not decrease.

We won’t prove this, but it seems intuitive. If is a version of where has been contracted, then .

Now, let’s try to analyze the probability that the full algorithm returns the correct solution:

Theorem: Karger probability of success

For any graph with nodes and a min-cut , Karger’s algorithm returns with probability at least

Let be the probability that the edge picked in step of the loop is not in . The algorithm succeeds in finding the min-cut if all occur. By the chain rule, the probability of this is:

By the two previous claims, seeing that there are at most edges to choose from at step , then for all :

Therefore:

Step uses the inequality we defined above.

Step rewrites as .

Step uses the structure of these fractions to cancel out terms (notice how the numerator appears in the denominator two fractions later).

Finally, uses the definition of the binomial coefficient.

This leads us to the following corollary:

Corollary: Maximum number of min-cuts

Any graph has at most min-cuts.

The proof for this is short and sweet: suppose that there is a graph that has more than min-cuts. Then, for one of those cuts, the algorithm would find that exact cut with probability less that , which is a contradiction.

An example of a graph with min-cuts is a cycle. We can choose any two vertices, and define a min-cut of size 2 by letting be all the nodes between them.

Boosting the probability of success

The algorithm runs in , but the probability of success is , which is . As we said above, we can “boost” this probability of success to by running the same algorithm a bunch of times. If we run it times, we can get a probability of success of :

Karger-Stein’s algorithm

Karger’s algorithm only fails when it contracts an edge of the min-cut. In the beginning, that probability is , and towards the end it goes up to a constant; in the very last step, the probability that we contract an edge of the min-cut is .

This means that we’ll often run the first steps correctly, only to make mistakes towards the very end. Karger-Stein fixes this by running more copies as the graph gets smaller.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def karger_stein(G, n):
    """
    Input: Graph G = (V, E) of size n
    Output: (min cut, size)
    """
    if n == 2:
        return E, len(E)
    for i in range(1, n - n/sqrt(2)):
        choose edge (u, v) uniformly at random
        contract it
    let G2 be the contracted graph of size m
    E1, n1 = karger_stein(G2, m)
    E2, n2 = karger_stein(G2, m)
    return best cut of the two

To analyze the running time, let be the time that it takes to compute the min-cut of a graph of size :

Using the master theorem to solve this recursion, we find that the algorithm runs in .

Theorem: Karger-Stein probability of success

The Karger-Stein algorithm finds a min-cut with probability at least .

Suppose we have a graph , which is a version of on which some contractions have been made. Suppose it has vertices, and let be a min-cut of . Just as in the proof of success probability of Karger, the probability of not picking an edge in in any of the steps of the loop is:

As before, the terms of the fraction cancel out, and we’re left with two terms in the numerator, and two terms in the denominator, hence the above conclusion.

As long as we don’t contract an edge of in the loop, we’re guaranteed success. We’ll prove inductively that the probability of success is .

The algorithm calls itself recursively with , the graph resulting from our contractions on . Let be the probability of success on the recursive call on . Our induction hypothesis is that the algorithm succeeds on with probability .

The probability that either (or both) of the recursive calls succeeds is (because , and because the two events are independent).

The probability of the algorithm succeeding in is therefore that of not picking any edge in , and of either of the recursive calls succeeding:

To prove this theorem, we must show:

Using the induction hypothesis, we have that in the worst case. Plugging this into the above leaves us to prove:

Multiplying by 2 and rearranging the terms, we get:

However, we can observe that:

as desired.

We can run independent copies of the algorithm to achieve a probability of success, in a total runtime of .

Polynomial identity testing

Suppose that we’re given two polynomials and of degree . We only have access to an oracle that allows us to evaluate the polynomial, but not to see its definition. We’d like to know if and are identical, i.e. whether for all inputs .

Schwartz-Zippel lemma

For polynomials of a single variable, it’s quite easy to test if it is zero. By the fundamental theorem of algebra, a degree polynomial of a single variable has at most real roots. We can therefore test roots. The polynomial is zero if all evaluations return 0.

For the multivariate case, things aren’t so simple. In fact, a multivariate polynomial may have infinitely many roots. However, not all hope is lost. The Schwartz-Zippel lemma tells us:

Lemma: Schwartz-Zippel

Let be a nonzero polynomial of variables with degree . Let be a finite set, with at least elements in it. If we assign values from independently and uniformly at random, then:

Proof for one dimension

Let’s start by proving this for . We want to know whether for all inputs . As we saw above, this is equivalent to answering whether . When the polynomial is one-dimensional, we can write it as follows:

If then . Suppose is a non-zero polynomial; we can then write the probability of it evaluating to zero as:

Writing the probability in this form is called the principle of deferred decision. We set all the except arbitrarily; this means that we can now see the sum in the last line as a constant . What’s left is the following:

There are choices for , and at most one satisfies , so the final result is:

General proof

We proceed by strong induction.

Base case: , which is the univariate case, which we talked about above. The fundamental theorem of algebra proves this case.

Inductive step: Suppose that the lemma holds for any polynomial with less than variables. We need to show that it also holds for .

Fix arbitrarily. All values in are known except , so becomes a univariate of of degree .

We’ve reduced it to the univariate case again, but this is actually not enough. An adversary could select such that the resulting polynomial is 0, despite being nonzero. To salvage the argument, we must prove that this adversarial scenario is rare. We’ll need to make use of long division for this.

Definition: Long division for polynomials

Let be a polynomial of degree and be the polynomial with degree . Then we can write as:

Where the quotient has degree at most , the remainder has degree at most . The polynomial is the divisor.

Let be the largest degree of in all monomials in . This means can be “long divided” by as follows:

Using the principle of deferred decision, we assign values to uniformly at random from . We save the randomness of for later use. Under our induction hypothesis, we have:

If is nonzero, then it follows from the formulation of the long division that is a univariate polynomial in , as the coefficient of is nonzero. We therefore have:

Using these two probabilities, by Bayes rule, we have:

Note that in the second step, we don’t need to know , we just upper-bound it by 1.

Matrix identity

We can use Schwartz-Zippel for identity testing of matrices too; suppose we are given three matrices . We’d like to test whether . Matrix multiplication is expensive ( or slightly less). Instead, we can use the Schwartz-Zippel:

Theorem: Schwartz-Zippel for matrices

If then:

Rather than infer it from Schwartz-Zippel, we’ll give a direct proof of this. Let’s write and as row vectors:

If then there exists at least one row where . The inner products and are likely different:

This follows from the proof for one dimension, as and are degree 1 polynomials of variables .

Bipartite perfect matching

Let be a bipartite graph. We can represent the graph as a adjacency matrix having an entry if nodes are connected by an edge:

For instance:

G cluster_left cluster_right a a d d a--d b b e e b--e c c f f c--f e--c f--b

The above graph would have the following matrix representation:

We have a nice theorem on this adjacency matrix:

Theorem: Perfect Matching and Determinant

A graph has a perfect matching if and only if the determinant is not identical to zero.

Let’s first prove the direction. Suppose has a perfect matching, i.e. a bijection mapping every to a unique . We can see as a permutation on the set of integers .

This is one term (monomial) of the polynomial . One way to define the determinant is by summing over all possible permutations :

The sign of a permutation is the number of swaps needed to order it1617.

In particular, when in the sum (we sum over all possible permutations, so we’re guaranteed to get this case), we get a monomial with non-zero coefficient (we know that is non-zero as describes edges, so there’s a variable in that location of the matrix). This monomial is different from all other monomials in , there are no cancellations. This means is non-zero.

Now, onto the direction. Suppose is non-zero. Therefore, using the above definition of determinant, there is some permutation for which all terms are non-zero. Since is a bijection, this indicates a perfect matching.

However, this algorithm doesn’t give us the matching, it only tells us whether we have one. To fix this, we can run the following procedure. We pick a big set and set where is chosen independently and uniformly at random in . With high probability, there is a unique minimum weight perfect matching18, call it . We won’t prove it here, but we can write the determinant in the following form to reveal :

The ensures that the term exists in the sum, and the even number ensures that it is the smallest (i.e. that all other powers of 2 in the determinant can be divided by ).

With this in hand, for every edge in we can check whether it is part of by deleting the edge, compute the determinant as above, and see whether the newly found is equal to . If yes, then , otherwise not.

This algorithm can be implemented in using polynomial parallelism.

General graphs

This result actually generalizes to general graphs, though the proof is much more difficult. For general graphs, we must construct the following matrix (called the skew-symmetric matrix, or Tutte matrix) instead of the adjacency matrix.

Sampling and Concentration Inequalities

Law of large numbers

Suppose that there is an unknown distribution , and that we draw independent samples from and return the mean:

The law of large numbers tells us that as goes to infinity, the empirical average converges to the mean . In this chapter, we address the question of how large should be to get an -additive error in our empirical measure.

Markov’s inequality

Consider the following simple example. Suppose that the average Swiss salary is 6000 CHF per month. What fraction of the working population receives at most 8000 CHF per month?

One extreme is that everyone earns exactly the same amount of 6000 CHF, in which case the answer is 100%. The other extreme is that a fraction of all workers earn 0 CHF/month; in the worst case, of all workers earn 0 CHF/month, while the rest earns 8000.

Markov’s inequality tells us about this worst-case scenario. It allows us to give a bound on the average salary for the “worst possible” distribution.

Theorem: Markov’s inequality

Let be a random variable. Then, for all :

Equivalently:

In the salary example, denotes the average salary. We know that ; we want to look at the probability that , which we can write as for . This confirms what we said above. The probability that the salary is at most is at most .

The proof can be stated in a single line:

This proof is tight, meaning that all inequalities are equalities if the distribution of only has two points mass:

Variance

Markov’s inequality is a bound we can give if we don’t know much about the distribution . Indeed, if all we know is the expectation of , Markov’s inequality is the best bound we can give.

We can get a stronger bound if we know the variance of the distribution.

Definition: Variance

The variance of a random variable is defined as:

Chebyshev’s inequality

Using the variance of the distribution, we can give the following bound on how much the empirical average will diverge from the true mean:

Theorem: Chebyshev’s inequality

For any random variable ,

Equivalently:

where is the standard deviation of .

Note that we used strict inequality, but that for continuous random variables, these can be replaced with non-strict inequalities without loss of generality.

It turns out that Chebyshev’s inequality is just Markov’s inequality applied to the variance random variable . Indeed, by Markov:

In other words:

Taking the square root on both sides yields:

Taking , where trivially gives us the equivalent formulation.

Polling

We can use Chebyshev’s inequality to answer the question we raised in the beginning of this section, namely of estimating a mean using independent samples of the distribution .

By linearity of expectation, mean is the same thing as :

We can use Chebyshev’s inequality to upper bound the following:

To use Chebyshev’s inequality, we first need to calculate the variance. To do that, we will first introduce the idea of pairwise independence.

Definition: Pairwise independence

A set of random variables are pairwise independent if for all :

Note that full independence implies pairwise independence.

Lemma: Variance of a sum

For any set of pairwise independent :

We can write:

In step we used pairwise independence as follows:

This means that the terms where cancel out with the subtracted sum. We are thus left with terms where after step .

Going back the polling example, we use the above lemma. Since the samples are independent, they’re also pairwise independent, so by the lemma:

By Chebyshev’s inequality:

For the following discussion, let’s suppose that our poll was about whether people would vote yes or no in a referendum. This means that the random variables are independent Bernoulli variables:

Here, represents the fraction of the population that would vote yes. We want to estimate (the expectation of the Bernoulli trial) within additive error. To do this, we calculate the variance of , for which we need to know and .

We know that the expectation of a Bernoulli variable is . The second moment is:

Therefore, the variance is:

This confirms that the variance of a Bernoulli variable is , which we maybe already knew. In the worst case, gives us an upper-bound of on the variance.

By the bound we obtained with Chebyshev, we have:

This means that to achieve an -additive error with probability , we need many samples. The important part of the above inequality is that the size of the sample is independent of the size of the population.

For instance, suppose we chose 10 000 individuals randomly from the population, and calculated the empirical mean. By the above inequality, with probability 15/16, our estimate is within 2% of the true mean19.

Chernoff bounds

The Chernoff bounds, also called strong concentration bounds, give us quantitative bounds for the convergence described by the law of large numbers. If is an average of independent random variables with standard deviation and satisfy other properties, then:

This is an exponentially improved bound compared to Chebyshev’s inequality. To get this strong bound, we need to be an average of mutually independent random variables, which is a stronger assumption than the pairwise independence needed for Chebyshev.

There are different formulations of Chernoff bounds, each tuned to different assumptions. We start with the statement for a sum of independent Bernoulli trials:

Theorem: Chernoff bounds for a sum of independent Bernoulli trials

Let where = 1 with probability and with probability , and all are independent. Let . Then:

  1. Upper tail: for all
  2. Lower tail: for all

Another formulation of the Chernoff bounds, called Hoeffding’s inequality, applies to bounded random variables, regardless of their distribution:

Theorem: Hoeffding’s Inequality

Let be independent random variables such that for all Let and set . Then:

  1. Upper tail: for all
  2. Lower tail: for all


Polling with Chernoff bounds

We can use the Chernoff bounds in our polling example:

If we want to estimate within an additive error with probability we can simply let:

To illustrate the difference between Chernoff and Chebyshev, suppose we wanted to estimate with additive error with probability . If we wanted this probability of success to be , with Chebyshev we would need samples, whereas we only need with Chernoff.

Proof sketch of Chernoff bounds

We will give the proof for the upper tail bound; the lower tail is analogous. For any :

The “magic trick” here is to take the exponent on both sides. We can then use Markov to get an upper bound. Let us analyze the numerator. Seeing that the variables are independent:

We can use the fact that is a Bernoulli random variable taking value 1 with probability , and 0 with probability . This means that the random variable takes value with probability , and 1 with probability . This allows us to give a bound for this product of expectations:

The inequality step uses the same fact we’ve used in Hedge, that because of the Taylor expansion of , we have that . We simply plug to get this result.

Now, we can set and , and get20

This can simplified to be at most .

Hashing

Introduction

Hashing is a technique of mapping from an input domain to a domain of size , where typically . To do so, we use a hash function . The question we want to study is how to choose .

A typical application for hashing is that we want to store a subset of the large universe , where . For each , we want to support three operations: insert(x), delete(x) and query(x).

A hash table supports these operations; is placed in , where is a table of size . For collisions (i.e. values that hash to the same bucket), we use a linked list.

An ideal hash function should have the property that the probability that two or more elements hash to the same value is low, in order to use the table efficiently, and the linked lists short.

However, the following statement proves to be problematic if we pick a fixed hash function: for a fixed ,

In other words, if we fix a hash function , an adversary could select a “bad dataset” for which all values hash to the same location, thus rendering hashing useless. We cannot prove any worst-case guarantees for hashing in that case. This will lead us to choose our function at random from a family .

The question of how to choose remains. We want to choose it such that a random function from maps to a given location of the hash table uniformly and independently at random. Choosing according to this goal ensures that the number of collisions is low. To see that, consider a value , and let be the length of the linked list containing . Let be a random variable:

We will hash , and also every and see how many collisions that brings about. After doing this, the length of the linked list will be one for , plus the number of values in that hash to the same value:

Since we’re choosing at random from , we need to look at the expectation rather than a single realization. By linearity of expectation:

Usually we’ll choose , in which case we expect less than one collision (so 2 items in the linked list).

The last step in the above uses our assumption that if we choose a function uniformly at random in , then a given value hashes to a uniformly random location in the table. This means that for a value :

Let’s suppose that the family contains all possible permutations of mappings. That would mean . To store it, we would need bits, which is too big.

Luckily, we can save on storage by using the following observation: in the above calculations, we don’t need full mutual independence between all values and ; we only need pairwise independence between and each . This means that we don’t have to choose uniformly at random from all possible mappings, but instead from a smaller set of functions with limited independence.

2-universal hash families

Let’s start by defining 2-universality:

Definition: 2-universal hash families (Carter Wegman 1979)

A family of hash functions is 2-universal if for any , the following inequality holds:

We can design 2-universal hash families in the following way. Choose a prime and let:

We let the hash function be:

This works because the integers modulo form a field when is prime, so we can define addition, multiplication and division among them.

Lemma: Lemma 2, 2-Universal Hash Families

For any and , the following system has exactly one solution:

Since constitutes a finite field, we have that and .

Note that this proof also explains why we require : if we had the only solution would be , in which case would return for all , thus rendering hashing useless.

Moreover, there are possible choices for the pair and . Therefore, over a uniformly at random choice of and , if we select any , , and such that :

This probability describes two cases, which we will comment:

  • If then the probability we’re looking at is equivalent to . Since is prime, is a field and thus implies , which cannot happen by assumption (we chose ).
  • If , then we’re looking at the probability that and hash to different values. The probability that they both hash to the given and is that of having chosen the one correct pair producing and for and . Since there are possible pairs , the probability is .

These observations lead us to the following lemma:

Lemma: Lemma 3, 2-Universal Hash Families

is universal.

To prove this, we will have to check the definition, i.e. whether the probability of two different values hashing to the same value is less than . For any :

Step uses an indicator variable to capture whether so that we can reformulate the probability in the same form as in . Step uses the . Step follows from the fact that for each , we have at most different such that and .

The importance of all of the above is that we can now create a 2-universal hash functions solely by picking two numbers and uniformly at random, and we only need to store and , which requires space instead of the completely random hash function which required bits.

Let us calculate the expected number of collisions:

being a 2-universal hash family, the probability of collision is . There are possible pairs such that , which leads us to the above result.

Two-layer hash tables

If we select to be greater than , we can have no collisions with high probability. In practice, such a large table is unrealistic, so we use linked lists or a second layer of hash table for collisions.

The two-layer table works as follows. Let denote the actual number of collisions in bucket . If we construct a second layer for bucket , of size , we can easily find separate locations in the second layer for all values that collided in the first layer.

The total size of the second layers is:

We can compute the expected total size of the second layers by decomposing the squares as follows:

Step is true if we assume . Step places a (large) upper bound using the fact that the we cannot expect more collisions than there are elements in the set of hashed values .

k-wise independence

Intuitively, a collection of events is -wise independent if any subset of of them are mutually independent. We can formalize this as follows:

Definition: k-wise independent hash family

A family of has functions is -wise independent if for any distinct elements and any numbers we have:

Recall the definition we gave previously for pairwise independence. Generalizing this to -wise independence, we get:

Lemma: Expectation of product of k-wise independent variables

For any set of -wise independent :

For some prime number consider the family of functions constructed by choosing uniformly at random in , and letting the function be defined as:

This function is -wise independent. We can store it with memory.

We can adapt the discussion on 2-universality to pairwise independence (also called 2-wise independence).

Definition: 2-wise independent hash family

We say that is 2-wise independent if for any and any pair of ,

Note that 2-wise independence implies 1-wise independence.

Definition: 1-wise independent hash family

We say that is 1-wise independent if for any and any pair of ,

Load balancing

Let’s discuss how large the linked lists can get. For simplicity, we’ll consider a situation in which we hash keys into a hash table of size . We also assume that the funciton is completely random, rather than just 2-universal as above.

This situation can be explained by the analogy of balls-and-bins. We throw balls at random, and they each land in a bin, independently and uniformly at random. Clearly, we expect one ball in each bin, but the maximum number of balls in a single bin can be higher. For a given :

The first step uses union bound: the probability that any of the events happen is at most the sum of their individual properties. Stirling’s formula allows us to approximate the factorial:

Choosing ensures . Thus:

Long story short, with probability larger than , the max load is less than . We can even boost this probability of success to by changing the parameters a little.

Power of two choices

This is not bad, but we can do even better using the following cool fact (that we won’t prove).

The trick we use at the supermarket is to go to the checkout counter with the shortest queue. This is a linear process in the number of checkout registers, so consider this simpler trick: when throwing a ball, pick two random bins, and select the one with the fewest balls (shortest linked list). This ensures that the maximal load drops to , which is a huge improvement on . This is called the power of two choices.

Streaming algorithms

Streaming algorithms take as input a long stream consisting of elements taking values from the universe .

We can assume that and are huge; if we were Facebook, could be the number of profiles and the number of different cities in the world.

The algorithm must be super fast, and cannot store all the data in memory. Our goal is to process the input stream (left to right) using a small amount of space, while calculating some interesting function .

Ideally, we would achieve this in pass with memory, but we may have to allow some error margins to achieve these.

Finding Frequent Items Deterministically

We have a stream where each . This implicitly defines a frequency vector describing the number of times each value has been represented in the stream. Note that this is not the input of the algorithm, but simply another way of denoting the stream. Also note that .

We will consider the following two problems:

  • Majority: if there exists a such that , output . Otherwise, output “NONE”.
  • Frequent: Output the set

Unfortunately, there is no deterministic one-pass algorithm that doesn’t need linear memory .

However, it is possible to define an algorithm that only needs logarithmic memory if we allow for some margin of error. The Misra-Gries algorithm is a single-pass () algorithm that solves the related problem of estimating the frequencies , and can thus also be used to solve the above problems.

The algorithm uses a parameter that controls the quality of the answers it gives.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def initialize():
    A = dict() # associative key-value array

def process(j):
    if j in A.keys():
        A[j] += 1
    elif len(A.keys()) < k - 1:
        A[j] = 1
    else:
        for l in A.keys():
            A[l] -= 1
            if A[l] == 0:
                del A[l]

def result(a):
    if a in A.keys():
        return A[a]
    else:
        return 0

The algorithm stores at most key/value pairs. Each key requires bits to store, and each value at most bits, so the total space is .

To analyze the quality of the solutions, we’ll proceed in two steps: first an upper bound, then a lower bound. To make things easier, we’ll pretend that the A map consists of key-value pairs, where A[j] == 0 if j is not actually stored in A by the algorithm.

Let denote the answer output by the algorithm, and be the true answer.

We only increase the counter for when we have seen .

We need to look into how often we decrement the counters. To simplify the argumentation, we allow ourselves a small re-interpretation, but everything stays exactly equivalent. We’ll consider that whenever A[j] is decremented, we also decrement other counters, for a total of decrements at a time.

Since the stream consists of elements, there can be at most decrements. This can be explained by the “elevator argument”. Consider an elevator that, every time it goes up, goes up by 1 floor. Every time it goes down, it goes down by floors. Let be the number of times it goes up in a day. How many times can it go down? The answer is that it must have gone down times.

We can summarize these two claims in a theorem about Misra-Gries:

Theorem: Misra-Gries

The Misra-Gries algorithm with parameter uses one pass and bits of memory. For any token , it provides an estimate satisfying:

We can use the Misra-Gries algorithm to solve the Frequent problem with one additional pass. If some token has then its counter A[j] will be at the end of the algorithm. Thus we can make a second pass over the input stream counting the exact frequencies of all elements and output the set of desired elements.

Estimating the number of distinct elements

We consider the following problem:

  • Distinct elements: Output an approximation to the number of distinct elements that appear in the stream

This is the “Facebook problem” of estimating the number of different cities on the site. The function is also called the norm.

It is not possible to solve this in sublinear space with deterministic or exact algorithms. We therefore look for a randomized approximation algorithm. The algorithm, which we’ll call , should give a guarantee on the quality and failure probability of its output , of the following type:

That is, with probability we have a 3-approximate solution. The amount of space we use will be .

Ingredients

We need a pairwise independent hash family . The following fact will be useful:

Lemma: Lemma 2: Pairwise independent hash family

There exists a pairwise independent hash family so that can be sampled by picking random bits. Moreover, can be calculated in space .

We will also need the function. For an integer , let denote the number of zeros that the binary representation of ends with (trailing 0s). Formally:

For instance:

Binary representation
1 1 0
2 10 1
3 11 0
4 100 2
5 101 0
6 110 1
7 111 0
8 1000 3


Algorithm

Let be a number of bits, i.e. . There are possible values for . Considering all possible values, we can see that each bit is 0 or 1 with probability , independently. Therefore, to have , the independent event of a bit being 0 must happen times in a row, so:

Equivalently, we have:

Therefore, if we have distinct numbers, we expect at least one number to have .

With this intuition in mind, the algorithm is:

1
2
3
4
5
6
7
8
9
def initialize():
    h = random [n] -> [n] hash function from pairwise indep. family.
    z = 0

def process(j):
    z = max(z, zeros(h(j)))

def output():
    return 2 ** (z + 1/2)

This algorithm uses space to store the binary representation of , which counts the number of zeros in the binary representation of , and for the hash function. Let’s analyze the quality of the output.

Analysis

For each and each integer , let be an indicator variable for the event “”:

Let count the number of values that hash to something with at least zeros:

Let denote the value of when the algorithm terminates. By definition, iff there’s at least one value that hashes to more than zeros, then the final is .

This can be equivalently stated as:

Since is uniformly distributed over -bit strings we have as before. This allows us to give the expectation of . Let be the solution to the problem, i.e. . Then:

The variance is:

Step uses the fact that is an indicator variable that is either 0 or 1, so . Step uses pairwise independence as follows:

Applying this in the left-hand side of cancels out the and where , as those terms are subtracted by sum of expectations. We’re left only with terms where , which is why we can rewrite the sum to only sum over instead of over and . Note that this is exactly the same argument as in our proof of the lemma on the variance of a sum.

By using Markov’s inequality, we get:

Since we also know the variance, we can use Chebyshev’s inequality21:

Let be the algorithm’s output, the approximate estimation of . Recall that was the final value of . Then .

Let be the smallest integer such that . Using and :

Let be the largest integer such that . Using and :

This gives us failure bounds on both sides of , which is quite high. To decrease this, we can adjust our goal to allow for better than a 3-approximation. Alternatively, we can use the median trick to boost the probability of success.

Median trick

If we run copies of the above algorithm in parallel, using mutually independent random hash functions, we can output the median of the answers.

If this median exceeds then individual answers must exceed . We can expect of them to exceed , so by the Chernoff bounds, the probability of this event is . This works similarly for the lower bound.

If we choose we can get to a probability of success of . This means increasing total memory to .

Note that using the average instead of the median would not work! Messing up the average only requires one bad sample, and we have no guarantee of how far off each wrong answer can be; we only have guarantees on answers being within a successful 3-approximation.

F2 estimation

We’re now interested estimating the second frequency moment :

Naive attempt

Let’s consider a first, naive attempt.

1
2
3
4
5
6
7
8
def initialize():
    z = 0

def process(j):
    z = z + 1

def output():
    return z ** 2

Let’s look at the expectation:

To get rid of this additional term, instead of incrementing for each element , the solution is to flip a coin on whether to increment or decrement . This is the basis for the following algorithm.

Alon Matias Szegedy algorithm

1
2
3
4
5
6
7
8
9
def initialize():
    h = random 4-wise independent hash function, h: [n] -> {-1, +1}
    z = 0

def process(j):
    z = z + h(j)

def output():
    return z ** 2

At the end of the algorithm, we have .

Claim: unbiased estimator

is an unbiased estimator:

Let’s prove this.

In step , we simply expand the square. Step separates products of identical terms from those of non-identical terms using linearity of expectation. Step uses the fact that as , as well as linearity of expectation and 4-wise independence to push down the expectation into the sum22. Finally, step uses the fact that .

Claim: variance

Recall that we can compute the variance using the following formula:

We already have the expectation, so let’s compute the first term:

Let’s consider several types of terms resulting from this multiplication:

  • All the indices are equal:

  • The indices are matched 2 by 2:

  • Terms with a single unmatched multiplier (meaning at least one index different from all others) can be ignored. Indeed, when we compute the expectation, we will get a term containing . Suppose is different from , and (but we impose no restrictions on these three: they could be the same or they could be different). Since is 4-wise independent, it we can “push down” the expectation by the lemma on products of independent variables, at the very least separating the term:

    We’ll get for any , which confirms that the term can be ignored.

Therefore:

The variance of is thus:

Step uses the following:

Boosting the precision

We can improve the precision of the estimate by repeating the algorithm a sufficient number of times independently, and outputting the average.

In this case, the algorithm is:

  1. Maintain identical and independent copies of the above algorithm, and let denote the output of these copies.
  2. Output

We make the following claim about this algorithm:

Claim

Let denote the exact correct answer. The algorithm produces outputs an answer satisfying:

By linearity of expectation, remains an unbiased estimator as:

Since we know the variance of a single run, the variance of the average of all runs is:

By Chebyshev’s inequality, this allows us to prove that we’re achieving our goal :

Note that we didn’t use the median trick in this case, but we used the average instead. Using the median is especially problematic when the failure probability is . As an analogy, when you flip a coin that gives tails 60% of the time, the median will always give tails as , so be careful!

Locality sensitive hashing

Definition

Intuitively, a locality sensitive hashing function (LSH function) is a hash function that hashes “close points” to the same value, and “distant points” to different values.

To be more precise, if we want to map and to the same value with high probability; if , we want and to hash to different values with high probability. More formally:

Definition: Locality sensitive hashing families

Let be the universe containing the points . Suppose he have a family of functions mapping from to . We say is -LSH if for all :

Here, the distance function could be anything, but we’ll just consider it to be Euclidean distance for the sake of simplicity.

Let’s try to see what this means visually:

Visualization of the distances and areas involved

We consider to be at the center of the diagram. Let’s consider the guarantees we make for different positions of :

  • If is within the red circle, a function selected uniformly at random from will hash and to the same value with probability at least .
  • If is within the blue circle (but not within the red), we make no guarantees.
  • If is outside both circles (in the black square), then the two values collide with probability at most .

Boosting probabilities

Given an -LSH family (under the assumption ), we can boost it to get and . We do this in two steps.

Reducing

To reduce , we simply draw functions from independently. The overall hashing function maps a point to a -dimensional vector.

By independence of the functions , we have:

The last step happens because each term is since we’re considering the case .

Augmenting

Unfortunately, the above also reduced : the above hash function hashes close points to the same vector with probability , which is not what we want. We want to be able to make a better guarantee than that. To boost this probability, we run independent copies of the above -dimensional vector. This defines something like a matrix of hashes:

For a sufficiently large , there is a high probability that there is an such that . Indeed, if , then:

Step is by independence of the functions, and step is because each individual probability term is as it’s the inverse of the event of two vectors being identical, which we previously said happened with probability .

Tuning parameters

In the above, we want to find any function , , that hashes and to the same vector of size . In other words, we want any one of function to produce all the same values; we want a disjunction of a conjunction ( of ) to be true in order to consider that a collision has happened.

How do we tune the parameters and in this setting?

First, we choose according to either one of these equivalent equations:

We’ll assume for some . We’ll see that this parameter is of paramount importance, determining the running time and memory of the algorithm. We choose .

Example: LSH for binary vectors

Let’s give an example of a LSH family for binary vectors. Consider , and let be the Manhattan distance (i.e. the number of bits that are different between the two binary vectors).

Claim

The family , where selects the th bit of , is -LSH.

Observe that for each :

Therefore:

Therefore, is -LSH.

Consider the following problem, called the Nearest Neighbor Search (NNS): consider a set of points in (potentially) high-dimensional space. For any query , find the closest point :

The naive solution would be to loop over all points in , but this takes time and space.

If we had we pre-process the points by sorting them, and then run a binary search to find the closest element. Generalizing this to higher dimensions leads us to k-d trees, but these unfortunately suffer from the curse of dimensionality, and fail to beat the naive solution when .

If we allow approximate solutions instead of searching for the exact nearest neighbor, we can reduce this problem to that of finding an LHS family. We’ll call the approximate variant Approximate Nearest Neighbor Search (ANNS); in this problem, instead of finding the closest point to a query point , we’re happy to find a point such that:

Here, is the approximation factor.

Reduction to LSH

With the following algorithm, the ANNS problem is reduced to that of finding a LSH.

The idea of the algorithm is to take a -LSH family with and (we can boost the probabilities if we need to), and to build a hash table in which we store for each . For a query point , we can do a lookup in the hash table in : we can then traverse the linked list of points hashing to the same value, and pick the first point satisfying .

1
2
3
4
5
6
7
8
9
10
11
12
def preprocess():
     choose k * l functions from H
     construct l hash tables
     for i in range(l):
        for p in P:
            store p at location f[i](p) = (h[i, 1](p), ..., h[i, k](p)) in i-th hash table

def query(q):
    for i in range(l):
        compute f[i](q)
        go over all points where f[i](p) = f[i](q)
        return first point p satisfying dist(p, q) <= c * r

To see how this solves the ANNS problem, consider a point such that . In this case, we have with probability . If we have , we have with probability .

Probability of success

Let’s fix a query point , and consider the “good case”, i.e. a point in the inner circle, meaning . Then:

Step follows from what we proved when augmenting . Step follows from our assumption that for some . Step comes from our choice of that satisfies , and step from the (by now) usual trick of , which stems from the Taylor expansion of . Finally, we simplify the expression in step .

This all means that for any point within distance , we output with probability .

Space and time complexity of the reduction

The algorithm maintains hash tables, each of which contains values, where each is a -dimensional vector. The space complexity is thus:

since we picked the parameters and .

For the query time, we need to compute the -dimensional vectors for , which is . Then, we need to look at candidate points: there is overhead in considering ineligible (“far away”) points. To know how much overhead this represents, we must consider how many “far away” points we expect to find in a linked list.

To compute this expectation, we fix a query point , and consider that , i.e. the point is outside both circles. For any fixed , the expected number of points that hash to the same value as but are outside both circles is:

The first step happens by linearity of expectation, and the second one by our choice of .

Summing up over all , in expectation there are far-away points in our data set mapping to the same value as for some . To measure their distance (and thus know their ineligibility), we must take time for each such point. This implies an overhead of to examine these.

In total, we have a runtime of:

Ignoring lower-order terms, the algorithm runs with memory and query time .

Submodularity

Definition

Definition: Set function

A set function is a function assigning a real value to every subset of a ground set .

In many cases, the value of an item may depend on whether we already have selected some other item. For instance, just buying one shoe is much less valuable than buying both. Or in the other direction, in a stamp collection, once we already have a stamp of a particular type, an additional stamp of the same type is less valuable.

In this latter example, the function that assigns a (monetary) value to the stamp collection corresponds to a submodular function.

Definition: Classic definition of submodularity

A set function is submodular if, for all :

An alternative (but equivalent) definition of submodularity may make more intuitive sense, but requires the introduction of the concept of marginal contribution:

Definition: Marginal contribution

Given a set function , a set and an element , the marginal contribution of to is defined as:

The marginal contribution measures how much value an individual element adds if added to a set . This allows us to give the alternative definition of submodularity:

Definition: Diminishing returns definition of submodularity

A set function is submodular if and only if for all and each the following holds:

Intuitively, this means that a function is submodular if adding elements to a smaller set has a larger marginal contribution than if added to a larger set. Let’s prove that this is equivalent to the classic definition.

First, we’ll prove the direction, namely that the classic definition implies the diminishing returns definition. Suppose that is submodular and let be two sets. We’ll consider the sets and . According to the classic definition:

The first step uses the fact that , thus ; this is also the case if we consider instead of , regardless of whether . The second step rearranges the terms, and the third one uses the definition of marginal contribution.

Now, let’s prove the direction, namely that the diminishing returns definition implies the classic definition. Let’s consider two sets . Let be the number of elements in . Let be the set of the first elements .

Step follows from the fact that and that , and step from the fact that since . Step introduces a reformulation as a telescoping sum23, and step rewrites the terms as a marginal contribution. The inequality in step holds by assumption, since as . Step expands the marginal contributions according to the definition; this sum is telescoping, so we can do step .

Rearranging the terms of the above inequality yields as required.

Examples

Cut size

The function measuring the size of a cut induced by in a graph is:

This function maps from the ground set to the reals. To see that it is submodular, we can look at the marginal contribution of a single point to a cut . Let be the number of edges between some node and a set of nodes :

In other words, the contribution of adding a node to a cut is the number of edges that now cross the cut by the inclusion of , minus the number of edges that no longer cross the cut, i.e. those that have been “absorbed” into the set by inclusion of .

Observe that the first term is decreasing in , and the second is increasing in , but is subtracted. Hence, the whole expression is decreasing in , which proves submodularity.

Coverage function

Consider a finite collection of sets with each . We consider a ground set of indices of these sets and a set function defined by:

The function counts how many elements in are covered by the sets specified by the indices in . To show that this function is submodular, we analyze the marginal contribution of a set (represented by the addition of index to a set ):

This is decreasing in so is submodular.

Matroid rank

Let be a matroid on a ground set . Let be the rank function, defined by:

In words, is the size of a maximal independent set containing only elements from .

This function is submodular: consider two sets . Let be any maximal independent set of contained in . By the matroid augmentation property , we can extend to a maximal independent set contained in . Then:

This satisfies the classic definition of submodularity.

Summarization

A function summarizing data is often submodular, much for the same reason as the stamp collection example.

Influence maximization

Suppose we want to select people to give free samples to in order to launch a product. We want to select people that have the maximum influence, meaning that they tell the most people about the product. Solving this is basically a maximization of a submodular function under a cardinality constraint, which we’ll see more about later.

Entropy

Entropy is another example of a submodular function. If we already have a piece of information, then receiving it a second time adds no entropy. Or perhaps, if we have some information already, additional information adds less entropy than if we had no information at all.

Submodular function minimization

In the following section, we’ll assume that we can evaluate in constant time. We will show that minimizing is equivalent to finding the global minimum of a certain convex continuous function, which we can do in polynomial time using the (sub)gradient descent. To be able to use the whole framework of convex optimization, we need to extend our submodular function on discrete sets, to a convex function on a continuous domain. This is what Lovász extension does.

Lovász extension

Say we have a submodular function . We want to extend it to .

To do that, we can start by thinking of as , where , which maps indicator vectors to real values. This indicator vector indicates which elements of are contained in the considered set .

We also introduce an equivalence between and (we can just let the elements in be numbered).

Definition: Lovász extension

Let . Define , the Lovász extension of , as:

where denotes a uniformly random sample on .

In other words, given a vector , the extension returns the expectation of placing a threshold uniformly at random in , and evaluating with the set of terms in that are .

Notice that for any , if , then , so will agree with over the hypercube (all integral points). At fractional points, is some kind of average of .

We can actually give a more closed form of this averaging representation of . To do so, we order the elements of :

Let for any equal . Then:

Note that for any (the probability of this event being since is selected from ), for any , we have .

All of this leads us to the following formulation:

Definition: Lovász extension formulation

Let be the non-increasingly ordered components of .

Let where . We let be ordered by the same permutation that ordered the .

Then:

This stems from the fact that the probability of is equal to , for any , as is uniformly distributed in . Additionally, if this event happens, then .

This means that we can evaluate at any using evaluations of (we assumed a call to take constant time).

For instance, let . Then gives rise to the following values:

The result of the function can be computed from evaluations of :

Convexity of the Lovász extension

If the Lovász extension is convex, we have a whole host of convex optimization methods that we can apply (subgradient descent, ellipsoid, …). Therefore, the following theorem is crucial.

Theorem: Convexity of the Lovász extension

Let be the Lovász extension of . Then:

This proof is quite long, we will only show the direction, that is, submodular convex. Let’s start with an outline of the proof:

  1. Redefine as a maximization problem
  2. Observe that the problem is convex
  3. Prove that is equal to the objective function by upper-bounding and lower-bounding it.

For this proof, we assume that is normalized, meaning (without loss of generality, as we could just “shift” to achieve this). We previously gave a formulation of the Lovász extension, and we’ll give a second one here, as an LP. For , and submodular, is the solution to the following LP:

This may seem to come out of nowhere, but we’ll see the rationale once we see the dual formulation. This LP has exponentially many constraints, as is equivalent to .

Our main claim is that the objective function , . Note that the objective function is convex, as it is the max of a linear function over a convex set. To prove this, we observe that for and :

Now, we need to prove to finish the proof. We’ll do so using weak duality. The dual program is given by:

By weak duality, for any feasible in the primal, and feasible in the dual, we have:

By strong duality, we achieve equality if and are optimal solutions.

We’ll assume the same notation as in the closed formulation of the Lovász extension, meaning that we have . We also have where . Once again, we let be ordered by the same permutation that ordered the .

We define:

Now, we’ll make a few claims that complete the proof. To prove these claims, we’ll first need:

This follows from the closed formulation of the Lovász extension and from .

Claim: Convexity proof claim 2

This is proven by .

Claim: Convexity proof claim 3

is feasible for the dual.

Observe that has all non-negative entries because . Further, for any :

The third step happens because the sum is telescoping. This exactly satisfies the dual LP, as required.

Claim: Convexity proof claim 4

is feasible for the primal.

Observe that:

Here, we used that is normalized so . Let . Then, for to be feasible, we must show:

This can be done by induction on the size of the set. The base case trivially holds as we sum over , which gives us 0, which is equal to . For the inductive step, we argue that if is the largest index in . The classic definition of submodularity can be rearranged to:

The final inequality uses the induction hypothesis.

With these four claims proven, we have proven the theorem in the direction.

Submodular function maximization

For this part, we’ll assume that all set functions we deal with are:

  • Non-negative:
  • Normalized:

Monotone cardinality-constrained submodular maximization

For this section, we’ll also assume that is:

  • Monotone:

Monotonicity tells us that the value of cannot decrease as we add elements; the marginal contribution is always positive. This means that , so unconstrained maximization is trivial (just pick the ground set).

Instead, we’ll look at the less trivial constrained maximization problem, in which we’re looking for a set of size at most that maximizes .

In the beginning of the course, we saw that weighted maximization problems with linear objectives can be solved optimally by the greedy algorithm. This is even true if we generalize the constraint that by a general matroid constraint.

What happens if we try to generalize this submodular functions? The natural approach is to, at each step, greedily add the element that has the maximal marginal gain:

1
2
3
4
5
6
7
8
9
10
11
12
def greedy_submodular_maximization(N, f, k):
    """
    Input:  N   ground set
            f   submodular function 2^N -> R
            k   size constraint 0 <= k <= |N|
    Output: S ⊆ N with |S| <= k
    """
    S = set()
    for i in range(k):
        u_i = argmax over u in (N\S) of f(u|S)
        S += u_i
    return S

However, this algorithm can produce suboptimal results. For instance, if we consider a coverage function with sets , , , a greedy algorithm will select and then either and (covering 5 elements), whereas the optimal solution is to pick and (covering 6 elements).

Still, we can show that the greedy algorithm gives a constant factor approximation. To show this, we’ll need to introduce a lemma about marginal contributions:

Lemma: Sum of marginal contributions

Let be a submodular function, and let . Then:

Equivalently:

This tells us that the individual contributions of to are more than the contribution of all of . In this instance, “the sum of the parts” is greater than the whole.

To prove this, order the elements of as .

The inequality follows from the submodularity of .

This allows us to prove the following theorem:

Theorem: Approximation guarantee of greedy maximization of a submodular function

Let be the set produced by the greedy algorithm for maximizing a monotone submodular function subject to a cardinality constraint . Let be any set of at most elements. Then:

If this is true for any set , it is also true for the optimal solution; therefore, the theorem tells us that the greedy algorithm is a approximation algorithm.

Let us prove this. Since is monotone, we can assume ; if we had we could always add elements to achieve , without decreasing . Our proof outline is the following:

  1. Do some work to restate the element selection criterion in terms of the optimal solution
  2. Set up a recurrence defining in terms of
  3. Solve the recurrence to find a closed form of the bound

Part 1. Let be the th element selected by the greedy algorithm. Let be the set of the first elements selected. Note that and . Consider an iteration and let be any element in (i.e. in the optimal solution, but that we haven’t picked yet). Then, by our greedy choice criterion:

In other words, the optimal solution increases our current choice less than the element that we are going to pick at iteration .

We have one such inequality for each . If we put all these inequalities together, i.e. we multiply by , we get:

Step uses submodularity of , step uses the previous lemma on the sum of marginal contributions, and step uses monotonicity (since we have ).

Now, note that since we assumed , and since is monotone. Therefore, as a small recap, we currently have the following inequality:

Now, if we divide by in the above, and recall how we defined and , we get:

Part 2. Rearranging the terms, we get:

This gives a recurrence for , where it is defined in terms of . By induction (a proof we will omit), we can show that:

Part 3. To complete the proof, it suffices to plug in and note that using the usual trick of because of the Taylor expansion of the exponential function.

Note that this is the best possible bound. It is NP-hard to do better.

Unconstrained submodular maximization

Now, suppose that is not monotone. Then, the unconstrained maximization is relevant again, because we have no guarantee that the solution should be .

The greedy algorithm that stops when no element gives any positive marginal gain does not perform well. For instance, consider the following example, where the ground set is , and the function is:

Here, is indeed submodular, but the greedy algorithm would pick immediately, as it has marginal contribution 2 when , whereas any other option only has marginal contribution 1. However, the optimal solution would be , which would have value .

Instead of the naive greedy approach, we’ll see the double-greedy approach that allows us to get a approximation. This algorithm is “greedy from two sides”, one side starting at and the other at .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def double_greedy_submodular_maximization(N, f):
    """
    Input:  N   ground set
            f   submodular function 2^N -> R

    Output: S ⊆ N with f(S) >= 1/3 * f(Opt)
    """
    X[0] = set()
    Y[0] = N
    for i in range(1, n+1): # 1 to n, inclusive
        a[i] = f(X[i-1] + u[i]) - f(X[i-1]) # marginal contribution of adding u[i]
        b[i] = f(Y[i-1] - u[i]) - f(Y[i-1]) # marginal contribution of removing u[i]
        if a[i] >= b[i]:
            X[i] = X[i-1] + u[i]
        else:
            Y[i] = Y[i-1] - u[i]
    return X[n] # which is equal to Y[n]

Before we can prove that this algorithm is a approximation, we need some lemmas.

This should be clear for a linear function (which is a special case of a submodular function), in which . We aim to prove this for the more general case of submodular functions.

First, note that for all , as they start at and respectively, and meet in the middle. Also, , since at step we have to make the decision of whether to keep or remove from . Therefore:

By definition of and , we have:

The last step follows because we have something of the shape:

Where and . By the classic definition of submodularity, this is positive.

Next, we’ll need a smooth way of walking between our solution and the optimal one. We’ll do this by introducing , which we define as:

In other words, is obtained by adding all elements in to the optimal solution, and then removing all elements in . This means that is a set that agrees with what the algorithm already has done in the first steps, and that contains the optimal choices in the following positions. For instance, if we let the previous decisions be denoted by a , we have:

Note that and that .

Lemma: Decrease in

At each step :

Intuitively, this tells us that each step brings about a decrease in the value of , but this decrease is always matched by an increase in the value of either or .

If we rearrange the terms of this lemma, we get:

We will just prove this lemma for the case , but the other case is similar. In this case, the algorithm sets and .

By the lemma on , we have . This means that the above inequality is:

Let’s now consider Since , we take , and thus have . There are two possible cases for this assignment:

  • , in which case . The right-hand side of the lemma is empty and we are done.
  • , in which case . By submodularity, we can replace by any superset to obtain this for the right-hand side of the lemma:

    Multiplying by on both sides, we get the following, as required:

With these lemma in place, we can get to the approximation theorem for this algorithm.

Theorem: Double greedy approximation

The above algorithm is a approximation for unconstrained submodular maximization.

If we sum the inequality of the lemma on over :

Simplifying this telescoping sum, and using that is non-negative we get:

Rearranging the terms, and using , , we get:

Online algorithms

Online algorithms are slightly different from streaming algorithms. They are similar in that they receive elements one at a time, and do not know the future. But while a streaming algorithm computes some kind of metric over the stream with limited memory, the goal of an online algorithm is to output some decision about each element before receiving the next.

Rental problems

We’ll introduce online algorithms with a very Swiss example, called the “ski rental problem”, or more generally the rent/buy problem.

Suppose we go skiing every winter, and thus need skis. We can either buy a pair for a cost and use them forever, or rent them every winter and pay cost per year. If we knew how many times we would ski in our entire lives up-front, the calculation would be very easy. But the reality of it is that we don’t know in advance. Instead, every winter we must make the choice of buying or renting. If we will ski less than times, then the optimal solution would be to always rent; if we ski more, we should buy from year 1.

An easy algorithm is to rent every winter until we have paid in rent, and then buy at that point. If we ski less than times, this is optimal. If we ski more, it’s within a factor 2 of the optimal solution (which would have been to buy on day 1).

Competitive ratio

Note that the factor 2 in the above example isn’t an approximation ratio, but something we call a competitive ratio. A competitive ratio is the performance loss we get by going to online algorithm, while an approximation ratio is what we get by going from NP to P.

Definition: Competitive ratio

If we have an online problem and some algorithm that, given an instance of the problem, gives the cost of the solution. Assume is the best possible solution of instance . Then:

is called the competitive ratio of the algorithm.

The competitive ratio is how far off our algorithm is from the optimal solution in the worst case. An alternative definition of competitive ratio allows for some warm-up cost:

Definition: Strong competitive ratio

Assume the same setting as in the definition of competitive ratio above. We call the competitive ratio if:

for all and some constant independent of the length of the sequence . If , is the strong competitive ratio

Caching

In computers, processors usually have a cache that sits in front of a main memory. When the processor needs to load information, it first goes to the cache. If the page is in cache, it’s a hit; otherwise, it’s a miss, and we go to main memory, and bring the page into the cache. If the cache is full at this point, one page must be evicted to be replaced. To decide which one to evict, a number of different caching algorithms exist; we’ll go into the different options later.

Suppose the cache has a capacity of pages, and the main memory has pages. A sequence of of reads could look like this:

With pages, we would fill up the cache with pages 4, 1, 2 on the first three reads. Then we’d read 1, which is a cache hit. Reading 5 would be a cache miss, so we’d read from main memory and bring it in. Which page do we evict at this point though? This is up to the caching algorithm to determine.

Deterministic caching

A few examples of deterministic caching algorithms are:

  • LRU (Least Recently Used): the page that has been in the cache the longest without being used is evicted
  • FIFO (First In First Out): the cache is like a queue, where we evict the page at the head, and enqueue new pages
  • LFU (Least Frequently Used): the page in the cache that has been used the least gets evicted
  • LIFO (Last In First Out): the cache is like a stack, where we push new pages, and pop to evict

Claim: Competitive ratio of LRU and FIFO

LRU and FIFO have a competitive ratio of (where is the size of the cache).

We’ll divide the request sequence into phases as follows:

  • Phase 1 begins at the first page of ,
  • Phase begins when we see the th distinct page in phase .

For instance, for , the following stream would be divided as follows:

We will now show that makes at least one cache miss each time a new phase begins. Let denote the th distinct page in phase . Consider pages to and page (i.e. pages 2 to in phase , and the first page of phase ). These are distinct pages.

If none of the pages to have a cache miss, then must have one (because we’ve now seen more distinct pages than we can fit in our cache). Let be the number of phases. Then we have (the best we can do is no misses in pages 2 to of the phase, and then a miss in the first one).

On the other hand, LRU and FIFO make at most misses per phase, so , and they are thus -competitive.

Claim: Competitive ratio of LFU and LIFO

LFU and LIFO have unbounded competitive ratio. This means that the competitive ratio isn’t bounded in terms of the parameters of the problem ( and ), but rather by the size of input.

Having unbounded competitive ratios is bad. To prove this, we’ll consider an input stream on which they perform particularly badly. Suppose we have a cache of size , which originally contains pages through . Suppose main memory has pages.

Let’s start with LIFO. Consider the request sequence alternating between pages and :

Since is the last page that was put in the cache, it will be evicted to make space for . Then, since was the last page placed in cache, it will be evicted to make space for , and so on. We have a cache miss for each page in the request sequence, whereas an optimal strategy would only have one cache miss.

Now, let’s consider LFU. Consider the same setup, but with the following request sequence, which has a “warm-up phase” that requests pages through , times. Then it alternates between requesting and , times each.

In the warm-up phase, we have no cache misses because we assume the cache to already have values through . Then, page is requested, which is a cache hit for the same reason. From there on out, LFU evicts to make space for and vice versa. This means cache misses, while an optimal strategy would only have a single cache miss, for the first request for page (it would evict any other page than ). Making large allows us to get any competitive ratio we want.

Lemma: Best competitive ratio for caching

No deterministic online algorithm for caching can achieve a better competitive ratio than , where is the size of the cache.

Let be a deterministic online caching algorithm. Suppose the cache has size and currently holds pages to . Suppose is the number of pages in memory.

Since we know the replacement policy of , we can construct an adversary that causes it to miss every element of the request sequence. This adversary doesn’t need all pages to construct a “bad sequence” for the algorithm, but only . The requested element in this “bad sequence” is simply the one that isn’t in the cache at that moment.

In comparison, when the optimal algorithm has a cache miss, it means that the evicted page won’t be requested for the next rounds. This means that has a -competitive ratio (every page compared to one in ).

It may seem counter-intuitive that we consider that a policy that can miss every page is considered to be “the best possible strategy”. But the important thing to understand here is that it’s good because it’s only times worse than the optimal solutions in the worst of cases, where even the optimal algorithm has some cache misses. The bad caching policies have a large number of misses even when the optimal algorithm has practically none.

Randomized caching

The following randomized caching strategy is known as the marking algorithm. Similarly to LRU, if a cache page is recently used, it’s marked with a bit of value 1; otherwise, it’s unmarked (bit with value 0).

  • Initially all pages are unmarked
  • Whenever a page is requested:
    • If the page is in the cache, mark it
    • Otherwise:
      • If there is an unmarked page in the cache, evict an unmarked page chosen uniformly at random, bring the requested page in, mark it.
      • Otherwise, unmark all pages and start a new phase.

Let’s state the following about the competitive ratio of this strategy:

Lemma: Competitive ratio of the marking algorithm

The above strategy achieves a competitive ratio of:

The above algorithm is almost the best we can do:

Lemma: Best possible competitive ratio

No randomized online algorithm has a competitive ratio better than

Secretary problem

The following problem is called the secretary problem:

  • candidates arrive in random order
  • When a candidate arrives, we must take an irrevocable decision on whether to hire the candidate.

We’d like to hire the best candidate (we can assume they all have an objective and precise score). Exactly one candidate must be hired. If the algorithm we devise only returns negative decisions, we’ll just pick the last candidate regardless of what the algorithm says (this is equivalent to specifying that the algorithm must give us at most one candidate).

Simple strategies

If we select the first candidate, regardless of their score, we get:

Another strategy would be to interview but not hire the first candidates (“sampling phase”), and then pick whoever is better than the best candidate in the sampling phase (“hiring phase”).

Note that the second inequality holds despite the two events not quite being independent (if we know the second best is in the first half, that takes up a spot in the first half and it’s slightly more likely that the best is in the second half). There are also other ways of fulfilling the event of hiring the best candidate (i.e. if the third best is in the first half, and the best is before the second best in the second half, etc), so the first inequality is not quite tight, but the above bound is good enough for us.

Optimal strategy

We can optimize the previous strategy by changing the point at which we switch from sampling phase to hiring phase. Instead of observing , we can observe . Then:

We want to maximize the above. For large enough , this is at , which gives us a probability of selecting the best candidate of at least , which is optimal!

Spectral Graph Theory

In spectral graph theory, we look at the eigenvectors and eigenvalues of the (normalized) adjacency matrix of a graph.

Adjacency matrix

Definition: Adjacency matrix

The adjacency matrix of a graph of vertices is a matrix defined by:

for every pair

For this course, we’ll assume without loss of generality that all graphs are -regular (all vertices have edges, degree ). This will greatly simplify notation.

Definition: Normalized adjacency matrix

The normalized adjacency matrix of a -regular graph is , where is the adjacency matrix.

This matrix is also called the random walk matrix of the graph. To see why, consider the following graph:

G A A B B A--B C C B--C D D C--D D--A

The normalized adjacency matrix will look like this:

Consider that we currently stand at . Our position can be summarized by the following position:

Notice now that is the probability distribution of our position after a single step:

More generally, is the probability distribution after steps.

Eigenvalues and eigenvectors

Definition: Eigenvalues and eigenvectors

A vector is an eigenvector of a matrix , with eigenvalue , if:

We’ll state the following as a fact of linear algebra. We can use this because our normalized adjacency matrix is real and symmetric.

Lemma: Eigenvalues

If is symmetric, then:

  1. has non-necessarily distinct real eigenvalues
  2. If are eigenvectors for then . If there are multiple vectors satisfying the above, then any such vector can be selected to be the eigenvector corresponding to .

The second point in particular means that we can always find an orthonormal basis, corresponding to the eigenvectors.

Relating eigenvalues to graph properties

Let’s state the following observation without proof. This will be very important for the rest of this section.

Lemma: Observation on product with normalized adjacency matrix

Consider a vector , which assigns a value to each vertex . Let , where is the normalized adjacency matrix of a graph . Then:

That is, the value that assigns to a vertex is the average of the value assigned to the neighbors. Using this observation, we can prove the following properties:

Lemma: Eigenvalues and graph properties

Let be the normalized adjacency matrix of a -regular graph and let be its eigenvalues. Then:

  1. is disconnected
  2. one component of is bipartite

Let’s prove these properties.

Proof of 1

We’ll prove this in two steps: and .

Since , is an eigenvalue, and since is the greatest eigenvalue,

Additionally, we consider any eigenvector and vertex such that is maximized. We let . Then, by our observation

The inequality follows from the fact that is the maximal value in the vector . From this inequality , we conclude that (as since we considered to be an eigenvector).

This proof not only tells us that , but also that we can select , which we will do from now on24.

Proof of 2

We will show that there is a vector such that iff is disconnected.

Let’s start with the direction. Suppose is disconnected. This means that there is a subset of vertices that are not connect to vertices in . Let be defined by:

Notice that , where . We now show . To do so, we fix an , and let :

The first step uses the observation we previously noted. The second uses the fact that every neighbor of has , by the definition we gave of (wherein vertices only have different values when they are not neighbors).

This shows that if is disconnected.

Now, let’s prove the direction. We’ll prove the contrapositive, so suppose that is connected. Let be an eigenvector corresponding to the second eigenvalue . We have . We must now show .

Since where , cannot assign the same value to all vertices. Therefore, as is connected, there must be a vertex that has at least one neighbor for which . Select such a vertex that maximizes . By selection of we have:

Since doesn’t have the same value for all vertices, for at least one neighbor of , we have . Again, we let . It follows that

It follows that . Since we were proving the contrapositive, this successfully proves the direction.

Proof of 3

First, let us prove the direction, namely that bipartite . For that, it suffices to find a vector such that . Suppose is a bipartite graph. Let be defined as follows:

Recall our observation of . It says that is the average of the neighboring . Now, vertices in only have neighbors in , and vertices in only have neighbors in . Therefore, by multiplying by this vector , vertices in get value , and vertices in get . In other words:

Now, let us prove the direction, namely that has a disconnected component.

Let be the eigenvector associated to . We therefore have . Let be the vertex maximizing . Let be the signed value behind the maximal absolute value. Since , by our observation, it means that the neighbors of must all have value . The same goes for the neighbors of , whose neighbors must have value , and so on.

This means that we can split the graph into a vertex set of nodes with value , and a set of nodes with values . All nodes only have neighbors in and vice versa. Therefore, is bipartite, with .

Mixing time of random walks

As we said earlier, we can use the normalized adjacency matrix to get the probability distribution after taking a single step by computing . If we want the probability distribution after steps, we compute .

Intuitively, mixing time is about how many steps one must take to have a uniform probability of being at any given node . More formally, the question is which should be picked to have close to a uniform distribution.

The mixing time is highly dependent on graph structure. Some observations:

  • If the graph is not connected () then we will only reach vertices in the same component as the starting node and will never come close to the uniform distribution, no matter the choice of .
  • If the graph is bipartite () then the random walk will alternate between left and right, and will never converge to a uniform distribution (the side we are currently on will only have 0 components in ).

To fix the second point, we can do a “lazy random walk” in which we have a probability of staying where we are at each step.

In any case, the above observations tell us that the mixing time has to do with the values and . The following lemma tells us that the quantity is important to mixing time:

Lemma: Mixing time

Let be a -regular graph and let be the eigenvalues of the normalized adjacency matrix of .

If then no matter from which vertex we start, we will be at any vertex with probability after steps.

More precisely, if is the vector of our starting position (a one-hot vector taking value 1 only on the vertex on which we started), then:

when for some constant .

This lemma tells us that after steps, we’ll be within a distance from a truly uniform distribution. The parameter is controlled by and , so the number of steps to do will depend on the graph structure, as anticipated.

To prove this, let , i.e. we assume (w.l.o.g) that the vertices are ordered and our start vertex is first.

Let be the eigenvectors corresponding to the eigenvalues . These form an orthonormal basis, meaning that we can write in terms of this basis:

where .

Let’s use this to determine how long it takes for the random walk to be mixed; to get a quantitative measure of this, we measure the distance to a uniform distribution. We could use any norm, but we chose the 2nd one here.

Seeing that , we have in particular that . Our goal of can be written , where is the normalized first eigenvector. In practice, this means that we can also write our goal in the eigenvector basis, and thus subtract it from the sum resulting from . This will be step below.

Step stems from the fact that the eigenvectors are orthogonal. Step comes from our assumption in the lemma that , which means that for all . Finally, step follows from the orthogonality of the eigenvectors.

If we take and let , we have:

With , we get the result we wanted.

Conductance

Definition: Conductance

Let be a -regular graph with vertices. We define the conductance of a cut :

where denotes the set of edges crossing the cut. We also define the conductance of the graph :

In other words, the conductance of a cut is the number of edges crossing the cut, divided by the number of edges that could cross the cut (if the smaller component’s vertices all used their edges to cross the cut, and not internally).

The conductance of the graph is that of the cut with the smallest conductance.

Cheeger’s inequalities

Note that a disconnected graph has conductance 0, and that a fully connected graph has conductance 1. Indeed, the conductance is closely related to . Cheeger’s inequalities give us a quantified version of the fact that is disconnected:

Theorem: Cheeger’s inequalities

We’ll only prove the lower bound. The lecture notes contain a very long and tedious proof of the upper bound. For the lower bound, it’ll be useful to introduce an alternative way to define eigenvalues of , namely as an optimization problem on the Rayleigh coefficient .

Lemma: in Rayleigh form

We’ll prove this by upper-bounding and lower-bounding . Let’s start with :

This proves the upper bound, because if is equal to some value of the maximization problem, it’s less than the maximal one.

For the lower bound , we let be the vector that attains the maximal value. Since is a basis, we can write in that basis using factors :

Then:

The inequality follows from the “tallest person in the class” argument which we’ve used previously.

Lemma: in Rayleigh form

Let be the eigenvector corresponding to . Then:

The proof is very similar to that of the previous lemma. We will first upper-bound . Let be the eigenvector associated to . We have , so:

Once again, this is because the maximum is greater than any value in the maximization problem.

The lower bound of is a little harder. Let’s look at the search space of this maximization problem:

Let be the vector that attains the maximum objective value in this search space. has dimension (seeing that a degree of freedom is removed by the constraint of orthogonality to ), so we can find a basis for ; these are the eigenvectors corresponding to the eigenvalues . With that basis, we can describe vectors in by using factors :

With this in mind, we now have:

The inequality step can be done because a single term in a sum is smaller than the whole sum itself.

With this proven, we can go on to prove the lower bound of Cheeger’s inequality. Because , we consider that the associated eigenvector is . By the above lemma on , we have:

The matrix is the normalized Laplacian matrix. This is another matrix that is nice to study in spectral graph theory. With this matrix, we have the following identity:

Using this in the previous equality, we get:

Note

What happens if we let be a vector taking value 1 for vertices in , and 0 everywhere else?

Numerator. To determine the value of the numerator, let’s look at the tree possible cases in the sum:

  • Edges where have
  • Edges where have
  • Edges where and (or vice versa) have

So in total, only edges crossing the cut count towards the total, and the numerator thus sums up to .

Denominator. For the denominator, . Again, since the values are all 0 or 1, this sums up to .

Result. In summary, we get:

Notice that this is the formulation of conductance. This is obviously a special case of the previous more general formulation, which we can consider as a continuous relaxation of the cut problem. That is indeed the intuition behind Cheeger’s inequalities. Let’s therefore generalize this result for the continuous case.

We consider (w.l.o.g.) that (otherwise we could swap the set we consider to be called , i.e. consider ). We define as follows:

We define it this way to make sure that (recall that ). Hence:

Indeed, if is equal to the minimal value, it’s less than any value that is part of the minimization problem. Let’s now analyze what this fraction is.

Numerator. With our selection of , the numerator is actually equal to :

This is because, as before, all edges not in the cut cancel out. Only edges in the cut contribute by 1, so the final result of the sum is the number of edges in the cut.

Denominator. For the denominator, we have:

The last inequality holds because holds by assumption.

Result. Combining the two results above, we get:

It follows that the conductance of every cut is at least , so as required.

Spectral partitioning algorithm

The spectral partitioning algorithm outputs a cut that cuts relatively few edges, with a small conductance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def spectral_partitioning(G, v2):
    """
    Input   G   graph G = (V, E)
            v2  second eigenvector of the normalized adjacency matrix

    Output  S   cut with low conductance
    """
    # Sort vertices in non-decreasing order of values in v2:
    vertices = sorted(V, key = lambda v: v2[v], reverse = True)
    
    # Find prefix cut of smallest conductance:
    minimum = 1
    best_cut = []
    for i in range(len(vertices)):
        cut = vertices[:i]
        if h(cut) < minimum:
            minimum = h(cut)
            best_cut = cut
    return best_cut

Assuming that we are given eigenvalues and eigenvectors, sorting is the bottleneck here; everything else is in linear time. This algorithm is therefore .

Note that the upper bound of Cheeger tells us that the returned set satisfies .

  1. Note that in the Algorithms course we saw it for minimum spanning trees, but that we’re now looking at a variation for maximum spanning trees. These are equivalent problems, because we can just take instead of

  2. As we said previously, the spanning tree has edges. 

  3. We use the notation as a shorthand for  

  4. We cannot create a cycle by removing edges 

  5. Maximal means that we cannot add any elements to the set and retain independence. 

  6. Suppose toward contradiction that two sets and are maximal, but that . By , , which means that isn’t maximal, which is a contradiction. 

  7. If we want to prove we can equivalently prove the contrapositive

  8. Maximum cardinality is a special case of maximum weight, where all the weights are equal.  2

  9. A polytope is a geometric object with flat sides, which is exactly what linear constraints form. 

  10. This particular example is a maximization problem, but in general we’ll be considering minimization problems. It all still holds, just don’t get confused if things are switched in this example. 

  11. The Hungarian algorithm bears its name in honor of Kőnig and Egerváry, the two Hungarian mathematicians whose work it is based on. 

  12. A -uniform hypergraph is defined over a vertex set and an edge set , where each edge contains vertices. A normal graph is just a 2-uniform hypergraph. 

  13. A negative value of thus means that it was profitable to follow the expert’s advice. 

  14. A more careful analysis can tell us to use

  15. There’s a Friends reference to be made here, but I wouldn’t ever be the type of person to make that kind of joke 

  16. No matter the sort algorithm, the parity of the number of swaps to sort is always the same 🤯 

  17. This is not hugely important for the following discussion, so you can safely disregard it if it doesn’t make a lot of sense. Just understand that some terms may cancel out because of the sign of the permutation. 

  18. See the isolation lemma from the 1987 paper “Matching is as easy as a matrix inversion” by Mulmuley, Vazirani and Vazirani. 

  19. We obtain this by setting and . This tells us the probability that our empirical average is off by more than , which is

  20. For reasons we won’t go into, it just turns out that this choice of makes the upper bound for the tail probability as small as possible 

  21. Note that Chebyshev bounds the probability of being further than from in either direction. This is a little overkill, because we’re only interested in one direction. There may be room to make this bound tighter, but it’s good enough for us. 

  22. 4-independence implies 2-independence. We can then use the lemma for expectation of the product of random variables to “push down” expectation. 

  23. Telescoping means that if we expand the sum, everything cancels out except the very first and last terms. 

  24. Eigenvectors cannot be the zero vector. Any scalar multiple of an eigenvector is also considered an eigenvector, so we can normalize eigenvectors without loss of generality. For instance, we can consider or if we need to. 

« Back