-
Notifications
You must be signed in to change notification settings - Fork 0
Motivation
CK readme (brief introduction, target audience and our solution).
The first step in our approach is gradual organization and systematization of existing code and data on local machines. Users can gradually classify their files by assigning a new or existing CK module written in Python. Such modules are used as wrappers (containers) to abstract, describe and manage related data. Modules have common actions (such as add, delete, load, update, find, etc) and internal actions specific to a given class.
Modules and related data entries are always assigned a unique identifier (UID) and may also have a user-friendly alias (referenced by UOA - UID Or Alias) and a brief description. Any data entry in the system can be found and cross-linked using CID=module_uoa:data_uoa.
All data entries are kept on a native file system thus ensuring platform portability. Users can add or gradually update meta-description of any data entry in a popular and human readable JSON format that can be modified using any available editor. This meta-description can be transparently indexed by third-party Hadoop-based ElasticSearch tool to enable fast and powerful search capabilities. Furthermore, data entries in our format can be now easily archived, shared via GIT and moved between different local repositories and user machines.
Such relatively simple approach already allowed us to gradually abstract, organize and share all our past knowledge (not only results and data but also code) along with publications, protect it from continuous changes in the system, make it searchable, and connect it together to implement various research scenarios as conceptually shown in the figures below:
At the end, CK is just a small python module with JSON API that glues together user's code and data in local directories registered as CK repositories (to be able to always find any code and data by CID).
Furthermore, CK can help deal with ever changing and possibly proprietary software and hardware by abstracting access to them via CK tool wrappers as conceptually shown below:
For example, users just need to set up CK environment for a given version of already pre-installed software (such as Intel compilers or SPEC benchmarks) and then use CK modules with JSON API to access such software as described in detail in this section. This allows researchers share their experimental setups while excluding proprietary software and providing simple recipe how to install it and set up CK environment for unified communication. It also allows easy co-existence of multiple versions of related tools such as different version of LLVM and GCC compilers.
Such organization allows users to gradually convert any ad-hoc and hardwired analysis and experimental setups into unified pipelines (or workflows) assembled as LEGO (R) from interconnected CK modules and data entries. Furthermore, simple CK API with unified input and output allows to expose information flow to existing and powerful statistical analysis, classification and predictive modeling tools including R and SciPy. For example, CK helped us to convert and share all hardwired and script-based experimental setups from our past and current R&D on program auto-tuning and machine learning into the shareable CK pipelines conceptually presented in the following figure:
Furthermore, such pipelines can be replayed (repeated) anytime later provided a given JSON input, module_uoa and action thus supporting our initiative on collaborative and reproducible R&D. At the same time, whenever any unexpected behavior is detected, community can help improve modules and provide missing descriptions or add more tools, modules and data to the pipelines to gradually and collaboratively explain unexpected behavior, ensure reproducibility and improve collective knowledge.
Internally, modules and data should always be referenced by UID and not by alias to ensure compatibility between various modules, i.e. whenever API or data format of a given module becomes backward incompatible, we may keep the same alias (or add version), but we should change its UID. Thus, new and old modules will co-exist without breaking shared experimental workflows.
Finally, CK implementation as a simple and open knowledge management SDK makes it easy to integrate it with other third-party technology such as iPython, web services, file managers, GUI frameworks, MediaWiki, Drupal, Visual Studio, Android Studio, Eclipse, etc. It can also be extended through higher-level and user-friendly tools similar to TortoiseGIT, iPython, phpmyadmin, etc. We expect that if our community will find CK useful, it will help us improve CK and and develop extensions for various practical research and experimentation scenarios.
We hope that our approach will let industry, academia and volunteers work together to gradually improve research techniques and continuously share realistic benchmarks and data sets. We believe that this can eventually enable truly open, collaborative, interdisciplinary and reproducible research to some extent similar to physics, other natural sciences, Wikipedia, literate programming and open-source movement. It may also help computer engineers and researchers become data scientists and focus on innovation while liberating them from ad-hoc, repetitive, boring and time consuming tasks. It should also help solve some of big data problems we faced since 1993 by preserving predictive models (knowledge) and finding missing features rather than keeping large and possible useless amounts of raw data.
List of public repositories, data, modules and actions (customizable and multi-objective autotuning, realistic benchmarks and workloads, experiment crowdsourcing, predictive analytics, co-existance of multiple tools, interactive graphs and articles, etc):
We gradually convert and share all artifacts and experimental workflows from our past publications as reusable and customizable CK components.
You can read more on our motivation behind Collective Knowledge and previous versions of our collaborative experimentation and knowledge management frameworks (Collective Mind, cTuning) in the following recent publications:
- "Collective Knowledge: towards R&D sustainability", Grigori Fursin, Anton Lokhmotov and Ed Plowman, to appear at DATE'16 (Design, Automation and Test in Europe), Dresden, March 2016 [PDF]
- "Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science", 18th International Workshop on Compilers for Parallel Computing (CPC'15), London, UK, January 2015 [CK-powered interactive article with PDF and BIBTEX]
- "Collective Mind: Towards practical and collaborative auto-tuning", Journal of Scientific Programming 22 (4), 2014 [CK-powered interactive article with PDF and BIBTEX]
- "Community-driven reviewing and validation of publications", TRUST'14 at PLDI'14, Edinburgh, UK, 2014 [PDF and BibTex]
- "Collective Tuning Initiative: automating and accelerating development and optimization of computing systems", Proceedings of the GCC Summit, Montreal, Canada, 2009 [PDF and BibTex]
If you find CK useful, feel free to reference any of the above publications in your reports.
CK development is coordinated by the non-profit cTuning foundation and dividiti