A key distinction in the TechKnAcq project is the notion of a "concept graph". We describe our core formulation of this idea in this page.
Co-occurrence: Given two concepts and , if document is in the top- relevant documents for both and , we say that and co-occur in the document .
Cross entropy: Cross entropy measures the difference between two distributions. Specifically, the cross entropy for the distributions and over a given set is defined as
where is the entropy of , and is the Kullback–Leibler divergence of an estimated distribution from true distribution . Therefore, examines how well the distribution of approximates that of .
Joint entropy: Joint entropy measures the information we obtained when we observe both and . The joint Shannon entropy of two variables and is defined as:
where is the joint probability of these values occurring together.
Jaccard similarity coefficient: The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.
The more often two concepts, and , co-occur, the more strongly correlated the concepts are. We add an undirected edge when they occur in more than some threshold number of documents.
The more overlap there is in the words or phrases that are most strongly associated with each concept, the more likely it is that one concept can substitute for another. Given a pair of concepts, we calculate the Jaccard similarity coefficient between the sets of the top-20 words or phrases associated with each one.
The more overlap there is in the documents that are most strongly associated with each concept, the more likely it is that one concept can substitute for another. Given a pair of concepts, we calculate the Jaccard similarity coefficient between the sets of the top- documents associated with each one.
Given two concepts and , the cross-entropy approach predicts that depends on if they satisfy these conditions:
The distribution of is better approximated by that of than the distribution of is approximated by that of .
The co-occurrence frequency of instances of and is relatively higher than that of a known non-dependency pair.
Where each concept is represented as a set of top- relevant documents, we say depends on if and only if they satisfy the following conditions:
with as a threshold value, which can be interpreted as “the average joint entropy of any non-dependence concepts”. The weight of the dependency is defined as:
The cross-entropy method is general and can be applied to different distributions used to model concepts, such as the distributions of relevant words, of relevant documents, or of the documents that are cited by relevant documents.
This use of cross-entropy to derive asymmetric weights in a concept graph was inspired by this paper: > Intagorn, S. and Lerman K. (2012) ‘Probabilistic Approach to Mining Geospatial Knowledge from Social Annotations’ CIKM’12, Maui, HI, USA.
Given two concepts and , the information flow score between and evaluates the likelihood of reaching when staring random walk from . Therefore, the higher information flow score between and , the more likely that is able to represent . Given concepts and , depends on if they satisfy these conditions:
a. The concept receives relatively lower navigation hits than .
b. The number of navigation traces from concept to is much stronger than that to another, non-dependent concept .
For each node , the volume of information flow that enters into it can be calculated by:
where is the damping parameter, denotes the set of neighbors of that are pointing to ; while denotes the set of neighbors of that are linked to; and denotes the out-going degree of node .
The information flow score for each node can be calculated via the following iterative algorithm:
Randomly initialize a score array with length equal to number of nodes N
Repeat:
For each topic t:
temp[t] = (1 - alpha) / N;
For each incoming neighbor t' of t
temp[t] += alpha * score[t'] / deg_{out}(t')
Normalize temp to L2 Norm
score = temp
Until score converges.
For each node , the volume of information flow that exits from it can be calculated by
where denotes the weight of an edge .
In the above equation, it illustrates how each topic node distributes the information flow it received to all other remaining nodes through its outgoing neighbors.
The outgoing information flow can be calculated similarly with the following iterative algorithm:
Random initialize a score array with length equal to number of nodes N
Repeat
For each topic t:
temp[t] = 0
weightdeg = 0;
For each outgoing neighbor t' of t:
temp[t] += alpha * w(t,t') * score[t']
weightdeg += w(t,t')
temp[t] /= weightdeg;
temp[t] += (1 - alpha) / N
Normalize temp to L2 Norm
score = temp
Until score converges.
With the information flow score, the information flow between two concepts and in a co-occurrence graph is:
where again is the co-occurrence counts within documents.
In the above equation, we first examine the information flow relativity scores between two topics, and then evaluate the strength of information flow between two topics.