# A Mathematical Introduction to Concept Graphs and Concept Dependency

A key distinction in the TechKnAcq project is the notion of a "concept graph". We describe our core formulation of this idea in this page.

## Preliminary Definitions

Co-occurrence: Given two concepts $c_i$ and $c_j$, if document $d$ is in the top-$k$ relevant documents for both $c_i$ and $c_j$, we say that $c_i$ and $c_j$ co-occur in the document $d$.

Cross entropy: Cross entropy measures the difference between two distributions. Specifically, the cross entropy for the distributions $X$ and $Y$ over a given set is defined as

where $H(X)$ is the entropy of $X$, and $D_{KL}(X||Y)$ is the Kullback–Leibler divergence of an estimated distribution $Y$ from true distribution $X$. Therefore, $H(X;Y)$ examines how well the distribution of $Y$ approximates that of $X$.

Joint entropy: Joint entropy measures the information we obtained when we observe both $X$ and $Y$. The joint Shannon entropy of two variables $X$ and $Y$ is defined as:

$H(X, Y) = \sum_X \sum_Y P(X, Y) \log_2 P(X, Y)$ where $P(X,Y)$ is the joint probability of these values occurring together.

Jaccard similarity coefficient: The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

## Approaches to Computing Concept Dependency Relations

### Co-occurrence

The more often two concepts, $c_i$ and $c_j$, co-occur, the more strongly correlated the concepts are. We add an undirected edge when they occur in more than some threshold number of documents.

### Word similarity

The more overlap there is in the words or phrases that are most strongly associated with each concept, the more likely it is that one concept can substitute for another. Given a pair of concepts, we calculate the Jaccard similarity coefficient between the sets of the top-20 words or phrases associated with each one.

### Document similarity

The more overlap there is in the documents that are most strongly associated with each concept, the more likely it is that one concept can substitute for another. Given a pair of concepts, we calculate the Jaccard similarity coefficient between the sets of the top-$k$ documents associated with each one.

### Cross-entropy

Given two concepts $c_i$ and $c_j$, the cross-entropy approach predicts that $c_i$ depends on $c_j$ if they satisfy these conditions:

1. The distribution of $c_i$ is better approximated by that of $c_j$ than the distribution of $c_j$ is approximated by that of $c_i$.

2. The co-occurrence frequency of instances of $c_i$ and $c_j$ is relatively higher than that of a known non-dependency pair.

Where each concept is represented as a set of top-$k$ relevant documents, we say $c_i$ depends on $c_j$ if and only if they satisfy the following conditions:

with $\theta$ as a threshold value, which can be interpreted as “the average joint entropy of any non-dependence concepts”. The weight of the dependency is defined as:

The cross-entropy method is general and can be applied to different distributions used to model concepts, such as the distributions of relevant words, of relevant documents, or of the documents that are cited by relevant documents.

This use of cross-entropy to derive asymmetric weights in a concept graph was inspired by this paper: > Intagorn, S. and Lerman K. (2012) ‘Probabilistic Approach to Mining Geospatial Knowledge from Social Annotations’ CIKM’12, Maui, HI, USA.

### Information flow

Given two concepts $c_i$ and $c_j$, the information flow score between $c_i$ and $c_j$ evaluates the likelihood of reaching $c_j$ when staring random walk from $c_1$. Therefore, the higher information flow score between $c_1$ and $c_2$, the more likely that $c_i$ is able to represent $c_j$. Given concepts $c_i$ and $c_j$, $c_i$ depends on $c_j$ if they satisfy these conditions:

a. The concept $c_i$ receives relatively lower navigation hits than $c_j$.

b. The number of navigation traces from concept $c_i$ to $c_j$ is much stronger than that to another, non-dependent concept $c_k$.

For each node $u$, the volume of information flow that enters into it can be calculated by:

where $\alpha$ is the damping parameter, $N_{in}(u)$ denotes the set of neighbors of $u$ that are pointing to $u$; while $N_{out}(u)$ denotes the set of neighbors of $u$ that $u$ are linked to; and $d_{out}(v)$ denotes the out-going degree of node $v$.

The information flow score for each node can be calculated via the following iterative algorithm:

Randomly initialize a score array with length equal to number of nodes N
Repeat:
For each topic t:
temp[t] = (1 - alpha) / N;
For each incoming neighbor t' of t
temp[t] += alpha * score[t'] / deg_{out}(t')
Normalize temp to L2 Norm
score = temp
Until score converges.


For each node $u$, the volume of information flow that exits from it can be calculated by

where $w(u,v)$ denotes the weight of an edge $(u,v)$.

In the above equation, it illustrates how each topic node distributes the information flow it received to all other remaining nodes through its outgoing neighbors.

The outgoing information flow can be calculated similarly with the following iterative algorithm:

Random initialize a score array with length equal to number of nodes N
Repeat
For each topic t:
temp[t] = 0
weightdeg = 0;
For each outgoing neighbor t' of t:
temp[t] += alpha * w(t,t') * score[t']
weightdeg += w(t,t')
temp[t] /= weightdeg;
temp[t] += (1 - alpha) / N
Normalize temp to L2 Norm
score = temp
Until score converges.


With the information flow score, the information flow between two concepts $c_1=i$ and $c_j$ in a co-occurrence graph is:

where $w(c_i, c_j)$ again is the co-occurrence counts within documents.

In the above equation, we first examine the information flow relativity scores between two topics, and then evaluate the strength of information flow between two topics.