Principal Investigator: Peter Kraker
Key contributors: Scott Chamberlain, Björn Brembs, Maxi Schramm, Christopher Kittel, Rainer Bachleitner and Asura Enkhbayar
The Human Cell Atlas will work with data that is huge in volume, diverse in types and generated at high velocity. Navigating the data quickly and efficiently will be a huge challenge - at the same time it will be crucial for the success of the HCA. To aid the construction of the atlas and to support exploration and discovery of the data, we propose a novel tool for human data interaction: the Cognitive Data Interface (CDI). The CDI will enable users to get a quick overview of the immense amount of data ingested by the Data Coordination Platform. The CDI is based on knowledge maps, which provide clustered overviews of large amounts of resources. Using intelligent layering mechanisms involving insights from cognitive science and state-of-the-art machine learning techniques, we will provide dynamically evolving overviews to keep users up to date with new data as it is coming into the platform.
The CDI is intended for use of all stakeholders of the HCA, including researchers, physicians, and coordinators. To accommodate the needs of different groups, the CDI will provide a range of different views on the data, structured along existing ontologies. With intelligent search and exploration facilities, the CDI will support users in identifying collections of data objects for their current data needs, revealing possible links between objects, and accelerating the generation of insight from data. In addition, the CDI will serve as an important coordination tool among labs, as it shows, which data is being generated at any point in time.
The Human Cell Atlas will be organizing and standardizing terabytes of data for billions of cells, across multiple modalities, generated by hundreds of labs around the world. The HCA will work with big data: huge in volume, diverse in types and generated at high velocity. Navigating the data quickly and efficiently without any blind spots will therefore be a huge challenge - at the same time it will be crucial for the construction of the HCA.
Without appropriate data discovery tools, findability and reuse in such a large and dynamic data source will be an issue. Recent bibliometric research suggests that while the number of openly available datasets has skyrocketed in recent years, reuse is below 20% (Peters et al. 2016, Kraker et al. 2015a). For the Human Cell Atlas to succeed, this share of reuse will not be acceptable.
To aid the construction of the atlas and to support exploration and discovery, we propose a novel tool for human data interaction: the Cognitive Data Interface (CDI). The CDI will enable users to get a visual overview of the immense amount of data ingested by the Data Coordination Platform. Based on knowledge maps, which provide clustered overviews of large amounts of resources, the CDI provides an interface to the data objects that is optimized for human cognition. Using intelligent layering mechanisms involving state-of-the-art machine learning techniques, we will create visualizations that give a detailed overview, while at the same time keeping the cognitive load at an appropriate level (see Figure 1). A key feature of the CDI is that it will be dynamically evolving to accommodate new data object as they are coming into the platform.
The CDI is intended for use of all stakeholders of the HCA, including researchers, physicians, and coordinators. To accommodate the needs of different groups, the CDI will provide a range of different views on the data, structured along existing ontologies. With intelligent search and exploration facilities, the CDI will support users in identifying collections of data objects for their current data need, revealing possible links between objects, and accelerate the generation of insight from data. In addition, the CDI will serve as an important coordination tool among labs, as it shows, which data is being generated at any point in time.
The proposed project is based on prior work on structuring and visualizing scholarly outputs (Kraker et al. 2015b, Kraker et al. 2012), as well as mapping the evolution of knowledge domains (Kraker et al. 2014). This work has been translated into the powerful open source knowledge mapping software Head Start (Kraker et al. 2017a). Head Start provides an interactive, web-based interface for exploration and discovery. Head Start is designed to keep cognitive load at a manageable level, while still providing an overview of large numbers of items in a single view. It has been validated by user testing (Kraker 2015) and is continuously updated and improved.
Head Start includes a sophisticated backend that is capable of automatically producing knowledge maps from a variety of data, including text, metadata and references (Kraker et al. 2016). Head Start has been optimized to work in real-time settings with sparse, incomplete and inconsistent metadata and contents. The backend is complemented by an evaluation framework, which monitors aggregate metrics to finetune algorithms to accommodate changes and trends in input data.
Head Start is being used in various systems and projects, including Conference Navigator 3 and the H2020 project OpenUP. Head Start is also the main software used in the Open Knowledge Maps platform. This platform enables users to create a visual overview of a research topic based on more than 100 million scientific documents. The platform has become a popular discovery service with over 170,000 visits since its launch in May 2016.
The Cognitive Data Interface will connect to the consumer API of the Data Coordination Platform (DCP), as depicted in Figure 2. New datasets will be ingested on a daily basis and transformed and optimized for display by an analysis and mapping service. They will be integrated into the existing structure for each of the ontologies persisted into a separate data store. Each interface will connect to the interface API to retrieve a representation adapted for different user groups. The created structures will be continuously analyzed by an evaluation framework, which feeds its results back to improve analysis and mapping. In addition, we will collect user interaction data, which will also be used to improve analysis and mapping.
Work is divided into five work packages:
WP1: Project management and dissemination
WP1 will coordinate all activities of the project and ensure effective communication within the project and with CZI, including reporting. WP1 will disseminate project results, and support the exploitation of outcomes by other project participants.
D1.1 Releases of the Cognitive Data Interface (M7, M11)
D1.2: Final project report including financial report (M12)
WP2: User requirements and evaluation
W2 will carry out the requirements analysis and evaluation in a collaborative process with relevant stakeholders of the HCA. We will use metrics and user feedback to ensure usability and validity of the CDI, and suggest improvements where necessary.
D2.1 Initial set of requirements (M2)
D2.2 Evaluation reports (M8, M10)
WP3: Visual interface
WP3 will develop a web-based interface by adapting and extending the existing knowledge mapping framework Head Start to the HCA, introducing a layering mechanism and data-specific interface modes.
D3.1 User interface design (M3)
D3.2 User interface implementation (M6)
WP4: Analysis and mapping
In WP4 the backend of Head Start will be adapted to the HCA by integrating ontology-based tagging, text summarization, ranking and filtering, as well as manifold-learning based mapping.
D4.1 Backend architecture design (M3)
D4.2 Backend implementation (M6)
WP5: Data storage and connection
WP5 will develop a data streaming pipeline connecting to the HCA consumer API, separate databases for tabular metadata, and graph-based dataset enrichment. It will also develop a Data Fusion model in conjunction with WP4.
D5.1 Connection to HCA Consumer API (M2)
D5.2 Data Fusion model (M6)
Evaluation will be done both quantitatively using and extending the evaluation framework, and by user testing. As there is no ground truth for open ended knowledge discovery, we need to continuously balance between metrics like precision - providing relevant results, recall - reliable search over the whole dataset, and serendipity - providing the user with novel results and leaving the filter bubble. The development process will be conducted with experienced computational biologists on the team and in the advisory board of the project.
With regards to dissemination, we will publish the code on Github. The Cognitive Data Interfaces themselves will be accessible from openknowledgemaps.org, a popular discovery platform. Due to the fact that they are written in HTML5, they can also be easily integrated in other platforms and services. In addition, we will leverage extensive network of partners, advisors and users for dissemination.
The motto of Open Knowledge Maps is “Open science, all the way”. In all of our endeavours, our goal is to create a public good that can be freely used, modified, and shared. In addition, open, participatory and collaborative processes are a big part of our identity. Our roadmap, for example, has always been openly shared on Github (Open Knowledge Maps Team 2017), and we engage with our community of partners, advisors and users in human-centered design.
We will bring our collaborative mindset to the project and seek active exchange with project partners, our collaborative network and beyond. Data, code and content created within the project will be released under a license that is compatible with the Open Definition (Open Knowledge International 2015), including this proposal, which has already been published on Github (Kraker et al. 2017b).
Fig. 1: Schematic diagram of layered knowledge maps along ontology concept levels in the CDI
Fig. 2: Diagram of the key software components of the CDI and their connection to the DCP