Automine¶
Mawhirter, Daniel, and Bo Wu. "Automine: harmonizing high-level abstraction and high performance for graph mining." Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.
Link:https://dl.acm.org/doi/abs/10.1145/3341301.3359633
Review¶
Questions¶
- without friendly interfaces
- large intermediate data
- do not consider the subgraph pattern
facing two conflicting goals:
- providing a high-level abstraction
- high performance
Now solution¶
Their performance is far from ideal,
medium sized graph that has 1.1 million edges takes 43 sec(Arabe)
Arabesque¶
A distributed system, purposed "think like an embedding" paradigm
- high overhead for small graphs
- require enterprise clusters for large graphs
RStream¶
A single-machine graph mining system
Contribution¶
- both high level abstraction & high performance(however, I didn't see the high level abstraction)
- propose a space-efficient representation of embeddings(lays the foundation for automatic mining algorithm)
- some modeling & optimization techniques to generate efficient programs(in C++)
Code¶
Base API¶
Pattern definePattern(Edge[] edgelist);
Program countPatterns(Pattern[] patterns);
Program enumeratePatterns(Pattern[] patterns);
There are three applications that they implement:
- Clique(团) Counting(CC)
- Motif Counting(MC)
- Frequent Subgraph Mining(FSM)
definition of triangle pattern like: Pattern p = definePattern([(a,b), (b,c), (c,a)]);
Principle¶
A Simple Triangle Counting¶
Some properties:
- exploit the local structure of the input graph
- can discard the edge embedding immediately (found in the same loop)
- each vertex od the discovered set corresponds to a distinct triangle incident on \((v_0,v_1)\)
- the intersection operation performed on the neighbour set of \(v_0\) & \(v_1\) generates the structure
- [not very understood]
Algorithm¶
\(\mathcal P_k\) is the pattern on \(k(k>2)\) vertices.
\(\mathcal P_{k-1}\) is a sub-pattern of \(\mathcal P_k\) .
\(E_{\mathcal P_k}\) is a instance of \(P_k\)'s embeded graph, composing of vertices \((v_0, \cdots, v_{k-1})\) .
They introduce function \(\mathcal F^k(E_{\mathcal P_{k-1}})\), satisfying:
- return: \(\mathbf V_k\) , which is all nodes that can extends \(E(\mathcal P_{k-1})\) to \(E(\mathcal P_{k})\)
- process: operation can be applied only on the neibour sets of the vertices in \(E_{\mathcal P_{k-1}}\).
They prove that:
in which \(\mathbf V_T\) is vertices that are neighbours to the vertex that extends \(\mathcal P_{k-1}\) to \(\mathcal P_{k}\), \(\mathbf V_T\) & \(\mathbf V_F\) is a partition of \(E_{\mathcal P_{k-1}}\) 's vertices..
For example, if we want to count a tailed triangle, the schedule can be:
- \(\mathcal F^2 = \mathcal N(v_0) \cap \mathcal N(v_1)\)
- \(\mathcal F^3 = \mathcal N(v_2) - \mathcal N(v_0) - \mathcal N(v_1)\)
Then, we call the sequence \(\{\mathcal F^1, \cdots, \mathcal F^k\}\) a schedule.
There is a one-to-one mapping between the sequence & the schedule.
They use a colored(black for existing & red for absent ones) complete graph to present the pattern.
Then, they give every edge a direction to make it a tournament graph.
This tournament graph encodes define the sequence \(\{\mathcal F^1, \cdots, \mathcal F^k\}\).
As long as this tournament graph is acyclic, we can get a topology order, which is exactly the "adding" order, or the "schedule".
Notice that, it is the color that distinguish graph from isomorph.
We can iterate over all possible tournaments to search in the schedule space.
Multiplicity¶
Code Generating¶
By multiple loops that applies \(\mathcal F^k\) to vertices set (???), we can count in the innermost set.
Multi-Pattern scheduling¶
These patterns can share the same prefixes of their own sequences \(\{\mathcal F^1, \cdots, \mathcal F^k\}\) .
Other Features¶
Parallelization: using OpenMP
Memory Management: using VertexSet Class, a readonly reference;
Set Operation: also using VersexSet class
Out of core Processing: can work on single machine