-
Notifications
You must be signed in to change notification settings - Fork 0
Motivation
Over the past 20 years, we have worked on many issues related to improving performance, power consumption and reliability of computer systems via machine-learning based autotuning and run-time adaptation (1, 2, 3, 4, 5).
By now, we have expected to solve at least some of them and move on to addressing important societal issues. Only that, instead of unleashing our creativity and innovating, we have had to spend more and more time on ad-hoc management of growing numbers of research artifacts such as numerous and ever changing platforms, tools, benchmarks, datasets, scripts, predictive models, experimental results and publications. At the same time, we have been struggling to find convenient ways to share, reproduce and reuse knowledge across the research community. Furthermore, since we often work with proprietary software (compilers, benchmarks), we could not use Virtual Machines and Docker to publicly share our experimental setups and results.
In the end, we have decided to develop a lightweight and portable knowledge management system that can run on Linux, Windows, MacOS, Android and can evolve along the ever changing hardware and software artifacts, while taking advantage of the best modern techniques, including Open JSON-based APIs, Git, ElasticSearch, SciPy and scikit-learn for statistical analysis and predictive analytics, web services and agile R&D methodology. We have called this system "Collective Knowledge", for knowledge is most powerful when it is created and shared collectively.
You may find Collective Knowledge Framework (CK) useful if you are a researcher or organization performing many experiments on a regular basis and experiencing similar problems:
- spending more and more time on adapting your experimental setups to ever changing software, hardware, interfaces and numerous data formats, rather than innovating;
- having difficulty finding, managing and reusing your own past scripts, tools, data sets and experimental results;
- experiencing problems sharing the whole experimental setups and knowledge with your colleagues to be reproduced, customized and built upon (particularly if they use different operating systems, software and hardware);
- suffering from a raising amount of experimental choices to explore (design and optimization spaces) and experimental results to process (big data problem);
- receiving increasing bills for computational resources (particularly in data centers and supercomputers);
- finding it complex or simply too long to master powerful predictive analytics techniques that could help automatically process your "big data" (experimental results);
- finding it very time consuming to rebuild papers/reports whenever new experimental results are available.
We are developing CK with the help of the interdisciplinary community to:
- abstract and unify access to any software, hardware and data using CK modules (wrappers) with a simple JSON API while handling all continuous low-level changes;
- use CK modules to slightly reorganize local artifacts (scripts, benchmarks, data sets, tools, predictive models, graphs, articles) into searchable, customizable, reusable and interconnected components with unique IDs and JSON-based meta information which can be easily shared via GitHub, BitBucket and other public or private services;
- quickly prototype research ideas (experimental workflows) from shared components as LEGO(TM) and exchange results in schema-free JSON format (for example, performance benchmarking and empirical multi-objective program autotuning);
- involve interdisciplinary community to crowdsource and reproduce experiments using spare computational resources, report unexpected behavior, collaboratively explain and solve it by improving shared CK modules, and immediately push improvements back to the community (similar to Wikipedia);
- easily reuse statistical analysis and predictive analytics techniques via CK modules to process large experimental results (possibly on the fly via active learning), share and improve predictive models (knowledge), and effectively compact "bit data";
- use built-in web server to enable interactive graphs and articles.
The first step in our approach is gradual organization and systematization of existing code and data on local machines. Users can gradually classify their files by assigning a new or existing CK module written in Python. Such modules are used as wrappers (containers) to abstract, describe and manage related data. Modules have common actions (such as add, delete, load, update, find, etc) and internal actions specific to a given class.
Modules and related data entries are always assigned a unique identifier (UID) and may also have a user-friendly alias (referenced by UOA - UID Or Alias) and a brief description. Any data entry in the system can be found and cross-linked using CID=module_uoa:data_uoa.
All data entries are kept on a native file system thus ensuring platform portability. Users can add or gradually update meta-description of any data entry in a popular and human readable JSON format that can be modified using any available editor. This meta-description can be transparently indexed by third-party Hadoop-based ElasticSearch tool to enable fast and powerful search capabilities. Furthermore, data entries in our format can be now easily archived, shared via GIT and moved between different local repositories and user machines.
Such relatively simple approach already allowed us to gradually abstract, organize and share all our past knowledge (not only results and data but also code) along with publications, protect it from continuous changes in the system, make it searchable, and connect it together to implement various research scenarios as conceptually shown in the figures below:
At the end, CK is just a small python module with JSON API that glues together user's code and data in local directories registered as CK repositories (to be able to always find any code and data by CID).
Furthermore, CK can help deal with ever changing and possibly proprietary software and hardware by abstracting access to them via CK tool wrappers as conceptually shown below:
For example, users just need to set up CK environment for a given version of already pre-installed software (such as Intel compilers or SPEC benchmarks) and then use CK modules with JSON API to access such software as described in detail in this section. This allows researchers share their experimental setups while excluding proprietary software and providing simple recipe how to install it and set up CK environment for unified communication. It also allows easy co-existence of multiple versions of related tools such as different version of LLVM and GCC compilers.
Such organization allows users to gradually convert any ad-hoc and hardwired analysis and experimental setups into unified pipelines (or workflows) assembled as LEGO (R) from interconnected CK modules and data entries. Furthermore, simple CK API with unified input and output allows to expose information flow to existing and powerful statistical analysis, classification and predictive modeling tools including R and SciPy. For example, CK helped us to convert and share all hardwired and script-based experimental setups from our past and current R&D on program auto-tuning and machine learning into the shareable CK pipelines conceptually presented in the following figure:
Furthermore, such pipelines can be replayed (repeated) anytime later provided a given JSON input, module_uoa and action thus supporting our initiative on collaborative and reproducible R&D. At the same time, whenever any unexpected behavior is detected, community can help improve modules and provide missing descriptions or add more tools, modules and data to the pipelines to gradually and collaboratively explain unexpected behavior, ensure reproducibility and improve collective knowledge.
Internally, modules and data should always be referenced by UID and not by alias to ensure compatibility between various modules, i.e. whenever API or data format of a given module becomes backward incompatible, we may keep the same alias (or add version), but we should change its UID. Thus, new and old modules will co-exist without breaking shared experimental workflows.
Finally, CK implementation as a simple and open knowledge management SDK makes it easy to integrate it with other third-party technology such as iPython, web services, file managers, GUI frameworks, MediaWiki, Drupal, Visual Studio, Android Studio, Eclipse, etc. It can also be extended through higher-level and user-friendly tools similar to TortoiseGIT, iPython, phpmyadmin, etc. We expect that if our community will find CK useful, it will help us improve CK and and develop extensions for various practical research and experimentation scenarios.
We hope that our approach will let industry, academia and volunteers work together to gradually improve research techniques and continuously share realistic benchmarks and data sets. We believe that this can eventually enable truly open, collaborative, interdisciplinary and reproducible research to some extent similar to physics, other natural sciences, Wikipedia, literate programming and open-source movement. It may also help computer engineers and researchers become data scientists and focus on innovation while liberating them from ad-hoc, repetitive, boring and time consuming tasks. It should also help solve some of big data problems we faced since 1993 by preserving predictive models (knowledge) and finding missing features rather than keeping large and possible useless amounts of raw data.
List of public repositories, data, modules and actions (customizable and multi-objective autotuning, realistic benchmarks and workloads, experiment crowdsourcing, predictive analytics, co-existance of multiple tools, interactive graphs and articles, etc):
You can read more on our motivation behind Collective Knowledge and previous versions of our collaborative experimentation and knowledge management frameworks (Collective Mind, cTuning) in the following recent publications:
- "Collective Knowledge: towards R&D sustainability", Grigori Fursin, Anton Lokhmotov and Ed Plowman, to appear at DATE'16 (Design, Automation and Test in Europe), Dresden, March 2016 [PDF]
- "Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science", 18th International Workshop on Compilers for Parallel Computing (CPC'15), London, UK, January 2015 [CK-powered interactive article with PDF and BIBTEX]
- "Collective Mind: Towards practical and collaborative auto-tuning", Journal of Scientific Programming 22 (4), 2014 [CK-powered interactive article with PDF and BIBTEX]
- "Community-driven reviewing and validation of publications", TRUST'14 at PLDI'14, Edinburgh, UK, 2014 [PDF and BibTex]
- "Collective Tuning Initiative: automating and accelerating development and optimization of computing systems", Proceedings of the GCC Summit, Montreal, Canada, 2009 [PDF and BibTex]
We gradually share all artifacts and experimental workflows from all our past publications as reusable and customizable CK components.
If you find CK useful, feel free to reference any of the above publications in your reports.
CK development is coordinated by the non-profit cTuning foundation and dividiti