Pie Chart Callout Layout with Graph Algorithms

Introduction

Recently I dug up some notes from 2 years ago, about a problem I encountered at a previous job. It is a challenge of responsiveness layout for a simple visualization type: pie chart. Here’s a quick demo:

Try resize the chart area (make it smaller)
Notice not all pie slices have a callout labels displayed
Labels are dynamically removed if available areas are too small

The code for this d3 chart is adapted from: http://bl.ocks.org/dbuezas/9306799

We had found a pretty good divide and conquer solution, and iterated on it quite a bit. However, going through my notes, I have come up another formulation for this problem, albeit, with worse time complexity. This new idea is to model it as a graph, I find it quite neat (though I am certain this problem and its more generalized forms have been well-covered in CS literatures).

I wrote this version of the solution in rust and WebAssembly, and it is used in the demo above.

Problem Description

For a simple pie chart like above, we have two columns of callout labels. Notice on each side, labels are aligned horizontally (either left or right-aligned). For either side, we want the following:

Labels should never overlap, regardless of available screen estate (height-wise)
We’d like to keep as many “important” labels as possible - by some criteria of importance

In context of pie charts, it makes sense we would prefer labels for slices with larger arcs. People probably usually want to know about components with the largest shares among the total. Alternatively, we may want to ignore the share of pie for each label, and aim to maximize number of labels kept.

In both situations, we could provide a weight function for each label (using integer as weights for simplicity): . For maximizing remaining number of labels , this function is simply a constant function: . The inputs (labels) can be represented as a set of 1-d intervals, corresponding to their y coordinates. Given a set of weighted labels , we want to find a non-overlapping subset of :

where

Here we define operator as non-overlapping binary relation, . We do not require the intersection of two intervals to be empty for them to be considered non-overlapping, they could share a single boundary point (eg. and are considered non-overlapping).

Simple Algebraic Properties of Intervals

The operator defined here is not a very nice binary relation. It is symmetric, that is, . But it is not transitive, . Here’s an example:

does not overlap with , and does not overlap with , but does overlap with .

On the other hand, we could discard symmetry, and try to come up with another relation for intervals that is in fact transitive. This is also very straight-forward, let’s call this new relation , defined as:

Clearly, . Also, . The relation states the right hand side interval strictly follows the left hand side without any overlaps. And it’s easy to verify is transitive. Furthermore is not defined for two overlapping intervals.

What’s good about , well, it is a strict partial order:

It’s irreflexive, that is is not defined for any interval (an interval always overlaps with itself)
It is antisymmetric, if , then must not hold
It is transitive

The Algorithm

A partial order relation is closely tied to DAGs (Directed acyclic graph). Any finite strict partially ordered set can be represented as a DAG. Note our input intervals is exactly this. Recall the desired output must satisfy:

Expressed using :

Looking at it this way, we are basically looking for a chain or a strictly total ordered subset of the input partially ordered set. Stated in graph terms, we are looking for a path in a given DAG.

With this realization, now it’s time to construct this DAG:

Each node in the graph is an interval, it has a non-negative weight
Each edge corresponds to the relation, that is node_i connects to node_j if and only if node_i node_j
In addition, we add two dummy nodes:
- start_node, it has a weight 0, and connects to all the nodes in the input set
- end_node, it has a weight 0, and all input nodes connect to it

To illustrate, this set of three labels:

Translates into this graph:

The problem becomes finding a path from start_node to end_node, with largest sum of node weights - a variant of the longest path problem.

A quick google search for “longest path graph” points to this link: http://www.mathcs.emory.edu/~cheung/Courses/171/Syllabus/11-Graph/Docs/longest-path-in-dag.pdf

It gives a very succinct solution, keyword dynamic programming:

Conduct a topological sort on the graph, to find out a linearized order of its nodes
Finding a path with largest node weights ending at node_i requires finding paths ending in all its predecessors node_j with same property up to j (based on the linearized order)
Pick the path from node_j that has the largest accumulated weight w_j, and mark the total weight up to node_i as w_j + weight(node_i)
Continue until reaching end_node

Pausing for a second, it’s very easy to identify this is essentially the same dynamic programming solution to the longest increasing sequence problem - a popular algorithm tutorial topic. There is a more efficient () solution, but it’s a bit over my head. It would be interesting to know if that could be adapted for the labels layout challenge at hand.

Brief Look at the Code

I chose rust for implementing this for two reasons:

I have been learning rust for quite a while now, it’s fun and addicting
rust has a PartialOrd trait that’s particularly suitable in expressing this problem

To illustrate the second point, I used this trait to represent an input label/weighted interval:

trait Weighted: PartialOrd {

    fn weight(&self) -> u32;

}

Three lines of code and that’s it, all the solution code just needs to work with this trait in a generic manner, and does not even need to have any notion of intervals or labels.

Here is the entry function, it returns a list of indices to keep:

fn solve<W: Weighted>(inputs: &[W]) -> Vec<uszie> {

    let g = build_graph(inputs);

    g.longest_path()

}

struct Graph {

    nodes: Vec<Node>

}

impl Graph {

    fn longest_path(mut self) -> Vec<usize> {

        //...

    }

}

fn build_graph<W: Weighted>(inputs: &[W]) -> Graph {

    // ..

}

Graph is a simple adjacency list representation storing a Vec of nodes, referencing each node with its index.

Node struct:

struct Node {

    predecessors: IntHashSet<usize>,

    weight: u32,

    dist: u32,

    path_predecessor: Option<usize>,

}

Field predecessors is a list of indices for its predecessor nodes, I am using an IntHashSet here not necessarily for any performance reason, but for making certain part of the code a little less verbose, I suspect a plain Vec might actually perform better for realistic work loads.

All nodes’ dist field are initialized to be 0, and its path_predecessor as None. Calling longest_path method consumes the graph, runs the dynamic programming loop, and populates these fields for each node in linear (O(V + E)) time.

Topological Sort

I have tried a few graph algorithm crates for this task, and ended up picking pathfinding, as it is easiest to use, and has an unassuming api - just a plain function, not requiring any specific graph representation or traits implementations.

These few lines of code does the job:

impl Graph {

    fn longest_path(mut self) -> Vec<usize> {

        //...

        let mut sorted = topological_sort(&node_ids, |node_id| {

            let node = &self.nodes[*node_id];

            node.predecessors.iter().map(|id| *id)

        }).unwrap();



        sorted.reverse();



        // dynamic programming loop

        for i in 0..n {

            self.mark_longest_path_for(sorted[i]);

        }        

        // ...

    }

}

Dynamic Programming

This is the method solving a single sub-problem, assign a path_predecessor for a single node, and update its accumulated path weight (node.dist).

impl Graph {

    fn mark_longest_path_for(&mut self, node_id: usize) {

        // start_node

        if node_id == 0 {

            self.nodes[0].dist = 0;

            self.nodes[0].path_predecessor = None;

            return;

        }



        let pred_dist;

        let path_predecessor;

        {

            let (i, pred_node) = self

                .predecessors(node_id)

                 // pick predecessor with largests dist

                .max_by_key(|(_, node)| node.dist) 

                .unwrap(); // assume all sub-problems before have been solved



            pred_dist = pred_node.dist;

            path_predecessor = i;

        }



        let node = &mut self.nodes[node_id];

        let weight = node.weight;



        node.dist = pred_dist + weight;

        node.path_predecessor = Some(path_predecessor);

    }

}

Implementing Traits for Intervals

Finally with the functions ready, we just need to implement PartialOrd and Weight for intervals, first, a simple representation of Interval:

#[derive(Debug, Copy, Clone, PartialEq)]

pub struct Interval {

    lower: f64,

    upper: f64,

    weight: u32,

}

There is no StrictPartialOrd trait in rust, thus we’d settle for using PartialOrd, which requires PartialEq, and thankfully it’s auto-derive-able. Next, we provide our partial_cmp implementation,

impl PartialOrd for Interval {

    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {

        match self.upper.partial_cmp(&other.lower) {

            Some(Less) | Some(Equal) => return Some(Less),

            _ => {}

        };

        match other.upper.partial_cmp(&self.lower) {

            Some(Less) | Some(Equal) => return Some(Greater),

            _ => {}

        };



        None

    }

}

Notice, partial_cmp has a type signature of -> Option<Ordering>, but only ever returns one of these three:

Some(Ordering::Greater)
Some(Ordering::Less)
Node

And never Some(Ordering::Equal). Thus we are using it at value level as a true strict order. In addition, it is also NaN safe, using f64’s own PartialOrd trait implementation to guard against it.

Oh, and finally,

impl Weighted for Interval {

    fn weight(&self) -> u32 {

        self.weight

    }

}

Testing with Quickcheck

When it comes to testing, the first thing we’d like to verify: the solution set indeed does not overlap. Recall from earlier, this condition is formularized using relation:

In terms of code, this translates quite directly. Here, we use tuple_combinations from itertools crate to enumerate over pairs of indices.

fn path_overlaps<W: Weighted>(xs: &[W]) -> bool {

    let n = xs.len();

    if n < 2 {

        return false;

    }

    for (i, j) in (0..n).tuple_combinations() {

        match xs[i].partial_cmp(&xs[j]) {

            Some(Greater) | Some(Less) => {}

            _ => return true,

        }

    }

    false

}

Quickcheck is a very useful and fun crate for randomized property testing. Here, we will use it to generate random sets of Intervals with arbitrary positions and sizes. Next we check each solution produced from them does not overlap, using path_overlaps function we just wrote.

quickcheck! {

    fn removes_overlapping_spans(spans: Vec<WeightedSpan>) -> bool {

        let g = build_graph(&spans);

        let path = g.longest_path();

        let retained_spans: Vec<_> =

            path.into_iter().map(|i| spans[i]).collect();



        !path_overlaps(&retained_spans)

    }

}

For this to work, we need to implement the Arbitrary trait from quickcheck - basically to specify how to randomly generate individual WeightedSpans. Since quickcheck implements Arbitrary for Vec<T> if T: Arbitrary, it is sufficient to perform this test.

Next, we would also like to know our algorithm produces a solution with best possible total weight. This is hard to verify for arbitrary inputs, however, we could do a comparison with a brute force solver for small enough input sizes. The brute force approach is of course, to check all subsets of input spans.

I have used SmallVec<[WeightedSpan; 16]> for this task. smallvec crate provides array backed Vec-like structs that is suitable for this use case. To enumerate through all subsets for any collection of arbitrary 16 spans, we use all u16 values (65536 total) to serve as bit masks. This function below performs the exhaustive search using some basic iterator operations.

fn solve_brute(spans: SmallSpans<WeightedSpan>) -> SmallSpans<WeightedSpan> {

    (0..u16::max_value())

        .map(|mask| spans.subset(mask))

        .filter(|subset| !path_overlaps(subset.as_slice()))

        .max_by_key(|subset| score(subset.as_slice()))

        .unwrap_or(SmallSpans {

            spans: SmallVec::new(),

        })

}

Since previous quickcheck test ensures our algorithm does not produce any solution that overlaps, this time, we just need to check its total weight is optimal. There’s no need to require an identical solution to the brute force outcome, as there could be multiple solutions sharing the same score.

quickcheck! {

    fn check_correctness(spans: SmallSpans<WeightedSpan>) -> bool {

        let g = build_graph(spans.as_slice());

        let path = g.longest_path();

        let retained_spans: Vec<_> =

            path.into_iter().map(|i| spans.spans[i]).collect();



        let solution = solve_brute(spans);



        let score_1 = score(retained_spans.as_slice());

        let score_2 = score(solution.as_slice());



        score_1 == score_2

    }

}

With relative little effort, we now have a fairly good test suite that gives us some confidence about the algorithm. In fact, when making the demo, it worked the first time after I got all rust tests to pass. In general, I had way more bugs on js side creating it, and the wasm code has been much more solid.

End Note

If you finished reading this, I hope it has been interesting. Whenever someone claims frontend developers don’t need algorithm or data structure knowledge, this has been my go to counter example. The code for this small project is on github here. It was a fun weekend exercise.