Core Idea of the Project

TechKnAcq asks a simple question: if we are given a large collection of documents, we would like to be able to analyse them so that an end-user could easily query the collection for a well-defined reading list that would help them learn about the main concepts described in the corpus' contents.

Consider a corpus C comprising all the documents for the topic (T) that we wish to construct a reading list for. The figure below shows this as a partitioned circle where each partition is a topic representing a concept. The breakdown of our task of constructing a reading list for T, involves finding the best available documents from each topic, linking them appropriately and then creating a traversal across the resulting graph to provide a reading list for the topic.

A key issue of this question will be is then to be able to select the best possible document for a given user (based on the related concepts of pedagogical value (how useful is this document as a teaching tool?) and knowledge complexity (how well does this document’s difficulty match the level of understanding of the user?). This matching between documents and users will come down to being able to learn and predict the features of required documents (Fdoc) based on the features of users (Fuser).

Possible features that may have an impact on this research question are impact, reliability, recency (how ‘up to date’ the document is), availability, etc. We need to be able to characterize documents in these terms based on their usefulness as teaching tools.

Within our approach, we have attempted to instantiate an initial, preliminary deliverable of this core concept to spur research in the field and to further develop our core model. This involves executing the following steps:

  1. Use topic modeling to automatically generate a statistical model of the corpus with minimal use of human experts for annotation.
  2. Develop automatic, information-theoretic methods to construct dependency relationships between concepts.
  3. Develop heuristic + machine learning methods to construct a reading list in response to a given query.