-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open Scientific Data
index >> Open Scientific Data
Initial concepts of scientific data, as we know it today, were developed in the last century in tandem with advances in computing technologies in particular the desktop PC and network connectivity. This enabled scientific researchers faster processing of experimental data and analysis as computing power increased over the decades. The increased data availability and the possibility to connect to remote scientific teams.
Open science refers to a contemporary research paradigm characterized by collaborative, publicly accessible conduct of scientific inquiry. This approach emphasizes the importance, and advantages, of unrestricted sharing of research data, methodologies, and findings throughout all stages of the scientific process, from conception to publication. It actively encourages the adoption of open-access principles, enabling unrestricted availability of scholarly literature to the scientific communities and to the public in general at a planetary scale. This type of publicly accessible scientific knowledge promotes further the reproducibility of scientific results, and overall quality and rigor, enabling continuous verifiability of research results while fostering knowledge dissemination, enabling open-innovation, and increasing trust in scientific methods in the public in general.
Infrastructure for open scientific data primarily comprises data repositories, data analysis platforms, indexes, digitized libraries, and digitized archives [1][2]. A data repository is a digital place where is possible to deposit and store experimental data and results with the purpose of preserving and sharing it with other researchers.
Before the computing age, until mid last century, data repositories were located in any data archive, for instance, a physical library inside a building. As computing power and electronics technologies advanced during the last quarter of the century, libraries began being digitized [1][2] and stored in data centers. Experimental data, previously stored in a paper format, started being collected and automatically converted into digital formats and stored in scientific databases. This allowed advances in how science can be made to happen using computing electronics.
Nowadays there are a plethora of platforms with the purpose of automating and facilitating experimental data analysis. Examples can be found on Google’s Collaboratory (https://colab.research.google.com/), a platform that allows anyone to upload large experimental datasets, and model using artificial intelligence and machine learning algorithms. For instance, AutoML (https://www.automl.org/) is a platform that allows one to search for the best machine learning or A.I. model from a set of predefined models it has available.
These resources play a crucial role in facilitating the sharing and accessibility of scientific data and in the implementation and advancement of data-sharing policies[1].
Open scientific data, also known as open research data, refers to a type of open data where observations and outcomes of scientific activities are available in a publicly accessible format. This scientific information is made available to anyone who wishes to analyze or reuse it. The primary need when advocating for open data is to enable broader and faster verification of scientific claims. This is done by allowing others, elsewhere, to examine the reproducibility of results.
According to Christine Borgman[1], there are four main rationales for sharing data that are commonly discussed in regulatory and public discussions about scientific open data: research reproducibility, public accessibility, research valorization, and increased research and innovation.
One of the cornerstones of the scientific method is the ability to reproduce an experiment. This requires a researcher to obtain consistent results when using the same experimental setup, environmental conditions, input data, computational procedures, methods, and code, as well as analytical and mathematical conditions. Moreover, research results need to be replicated and reproduced by any other researcher that copies and mimics the same procedure that produced initial results from previous scientific work. However, in real life, is not an easy task to achieve[5]. A survey elaborated by Monya Baker[3], in the field of biology, found more than 70% of researchers were unable to replicate the results of other scientists and about 60% were unable to replicate their own findings. An article available on the Nature website, here, produced by the American Type Culture Collection (ATCC), titled “Six factors affecting reproducibility in life science research and how to handle them”[4] lists the following factors contributing to the lack of reproducibility:
- A lack of access to methodologies and experiment procedure details, raw data, and research materials
- Use of misidentified, cross-contaminated, specimens
- Inability to manage complex datasets
- Poor research practices and experimental design
- Cognitive bias
- Rewarding novel findings and undervaluing negative results
This research work presented here is a step forward in addressing the main reproducibility concerns in particular in what relates to items 1 to 4 on the list above by enabling near real-time data auditing using cryptographic identification tagged to experimental data collected directly at the source and during measurements collection.
With the turn to a new millennium, data, information, and knowledge, in general, saw a process of mass diffusion to the common everyday citizen and human on this planet. From past centuries when it was understood as an advantage and power, to perception and valuation as a way to a better and healthier life and lifestyle in particular on what concerns Work as defined in Physics and also as defined in Economics, Labour. One result of such advances in the diffusion of knowledge publicly, with no restrictions of access, can be found nowadays in the current state of technology development and deployment of what is known as “renewable energies” in particular photovoltaics, and electricity storage. The ability to reduce the amount of work and power required to complete a work activity or task. At the base of this knowledge availability is the greatest achievement of mankind, a planetary network of communications, at light speed, in tandem with the ability to collect data, store and share it instantly. And this simple detail previously mentioned is key to understand when valuing public access, unrestricted, to information (data) and knowledge in general. Even when considered as a competitor’s advantage. In science is no different. Real-time, Live (instantaneous) exchange of experimental data can only produce more complete scientific work and results. And will defiantly safeguard further authorship of its owners.
Nowadays, the main causes found in regard to access to scientific information are difficulties on access experimental raw data, in verification of the authenticity of its origins, where it was produced, and with which boundary conditions. Due to publishing incomplete data analysis and results. In linking experimental data, and results, with other similar and even equal experimental campaigns.
This research work presented here is a step forward in addressing one of the main concerns stated above, the one of linking experimental raw data. By tagging a cryptographic identification identifier to experimental data collected, directly at the origin and during measurements on individual sensors, is possible to automate further linkage of experimental raw data since the instant is published in any public data repository. For instance at Harvard University’s dataverse (www.dataverse.org). With the proposed technologies in this work, this can happen almost in the same instant when the sensor data is collected in an experimental campaign, with a delay of a few seconds under normal internet network connectivity. When the proposed SDAD is configured this way, a researcher is advertising, publicly and in near real-time, his raw data findings while claiming more trustworthy authorship data produced.
In 2023, storing experimental data makes more sense by doing it in an individual file rather than a common centralized database server. The advantages of individual files such as Microsoft Excel and SQLite, over a common database, can be summarized due to the centralized nature of the latter. In a centralized form, is more difficult and more complex to share experimental data contents. It also requires additional software and setup for someone wanting to access it offline. Another advantage over a centralized DB server is the fact individual files can be stored in a censorship-resistant, for instance on the Interplanetary File System (IPFS) Network. Accessing data stored in individual files is simple and easy nowadays. For cases of Big Data records above Excel’s maximum number of allowed data rows, SQLite database files are well-suited and fully compatible with Python or any other programming language with more than enough open-source libraries available on GitHub.
Table 1 List of data repositories with an API (jan 2023)
| Cloud Host | API | Open Source |
|---|---|---|
| Zenodo | ReST | Yes |
| OSF | Yes | Yes |
| Harvard's Dataverse | Yes | Yes |
| DRYAD | No | No |
| Mendeley | Yes | No |
| Data in Brief: | No | No |
| Scientific Data | No | No |
Moreover and adding to this the “Big Data” advantage of bigger datasets when designing an experiment, it is possible to do near real-time uploads to a data repository (see a list of the most known Cloud-based data repositories in Table 1 above). This enables further scientific collaboration and cooperation with less latency to anyone working on the same subject and in any country on the planet. A research team is no longer bound to limitations of proximity, language, or secrecy.
Furthermore, the proposed experimental data collection method, automates many back-office tasks, for instance, tasks of data authentication, data storage, data management, and reduces the need for scheduling synchronization among fellow researchers collaborating. A data repository’s API (see Table 1) needs to be capable of real-time synchronization of raw sensor data. For instance, accept:
- an inline string with sensor data and append it to an existing file CSV file on a repository.
- sensor data plus information on where to be added in an Excel document file. For instance, in a specific column/ row.
- sensor data plus information of table and table fields to be added in a SQLite database file.
In summary, the smart data acquisition device (SDAD) presented here enables a researcher the possibility to upload automatically and in near real-time to a data repository, the experimental data being produced on any experiment instrumented with analog or digital sensors connected to this data logger.
In the 90s when scientific publishing transitioned to this digital reality, to promote sharing of research data, editors and publishers began accepting what became known as “supplementary data files”, in the submission form with an addition of an additional file upload option. These “supplementary files” were not part of the published publication and served only the purpose of facilitating the review process and evaluation of work. They were also associated with a single journal publication, and sharing only happened upon formal contact and request. Supplementary data files always had an ambiguous status as part of a publication and over time the released datasets became subject of detail and often specifically curated for publication.
A Data Availability Statement also known as Data Access Statement is a statement about the availability of the data utilized in a particular scientific communication. It includes the type of access, where to access it, and also the type of licensing. In 2018 Lisa Federer et al. [6] and again in 2020 Colavizza et al.[7][8] research works identified three main categories for access of data:
- DAS 1: "Data available on request or similar"
- DAS 2: "Data available with the paper and its supplementary files"
- DAS 3: "Data available in a repository"
Elsevier recommends, on their website (August, 2023, see here), authors include publicly accessible data repositories. Springer Nature also recommends here (August, 2023) the same level of access. On Sage Publishing Online Platform can be found ((August, 2023) as “Research Data Sharing Policies” and defines 3 levels of publication acceptance according to the access type authors are sharing their data. Many other publishers can be found on a Google search for “Data Accessibility Statement”.
When considering public availability of experimental data, it can only be found by another researcher, when actively searched on the internet as part of the literature and bibliographic research process. With the current state and availability of research works and publications, searching for similar works on a search engine limits immensely the scientific research work of a researcher, in particular in small research teams. Search results are often displayed in a short list of 10 results and refer to static documents of publications and data files in repositories without any information about future and past updates and revisions. This limits the diffusion of experimental research knowledge, not having information about document and data file revisions and also requiring subsequent internet searches, in the future and during the duration of the research project, to update the local knowledge database in a specific research project.
The first notification systems appeared in the 1980s in computers using operating systems such as AmigaOS and Windows 1.0 when the first taskbars were introduced. However, it was only with Windows 95 that term became known and used. The notification area is usually located at the bottom or at the top of a screen and initially had the purpose to provide short text notifications about any tasks running in the background of a system. Nowadays, it serves all kinds of purposes, and in particular, the ability to receive a notification from another device connected to the network or internet, referred to as push notification. This type of notification is typically used to deliver updates, on any information that requires special attention, or immediate human interaction. Can be classified into update notifications, location trigger notifications, real-time notifications, subscription notifications, achievement notifications, reminder notifications, personalized notifications, and time-aware notifications. In a research work by Sahami et al. [9] were analyzed approximately 200 million notifications from more than 40,000 users. Their work included subjective perceptions of users in the analysis and a total of five major findings are presented as recommendations for developers on how to effectively use them. In Mehrotra et al. [10] research work is possible to find a classification model to predict notification acceptance using as inputs the notification content and context. Pielot et al. [11] introduced a machine-learning model that can anticipate if a user will read a message within the next few minutes after receiving a notification. In another research work by Pielot et al. [12] a machine-learning model was developed to predict whether a user would click on a notification and engage with the content. This model can assist in determining the best moments to send notifications.
Publicly accessible research data enables researchers to include external experimental data in their own work however, current data repositories serve only the purpose of storing data. For it to be trustworthy more functionalities are required on those platforms. The ability to verify experimental data origins in the datasets themselves, i.e., how measurements were made and collected. Another requirement is the ability to advertise and disseminate uploaded datasets to individual researchers, research teams, and communities. In 2023, there is a plethora of tools and apps with that purpose, the most commonly used is the subscription method, where a researcher adds himself to a subscription list by providing a personal email, and also the notification method, in particular, the push notification method also subscribed, however instead of using an email, it requires a piece of software installed on a device (computer, or any other) able to listen for notifications broadcast from a specific notification server.
This research work presented here is a step forward in enabling automated time-aware notifications about experimental data updates. For instance, is possible to configure a notification subscription list, on an update to a public data repository, to send a notification to all subscribers of that list.
Data redundancy is a terminology utilized to refer to identical or duplicate data elements persistently stored and maintained in multiple locations within a database or system. The main advantage of having duplicate exact copies of a dataset is the availability of the data it contains in the event of corruption, failure, or deletion of any other copy. Hosting multiple copies of the same experimental data on multiple data servers enables separate auditing, to examine for variance and discrepancies, and reduces the risk of discrepancies and data forging that any other way goes unnoticed.
The proposed data acquisition method on this work, experimental data redundancy happens during an experiment made on every dataset uploaded to a data repository, for instance, a dataverse, where previously uploaded dataset values are reuploaded on each subsequent upload with newer data. On the repository side, as an experiment advances, the number of dataset files available increases, with overlapping measured sensor data values between consecutive files. The last dataset upload holds all data values with a unique fingerprint ID and is linked together in a blockchain-like kind of logic. It is important to mention here, the importance of randomized time intervals for the dataset uploads as a way to improve data trustworthiness during acquisition of sensor data. Therefore, the proposed data acquisition electronics, the SDAD, is possible to implement the following functionalities on the device itself:
- randomized dataset uploads to a data repository, made by the SDAD during an experiment and without human intervention.
- dataset uploads upon request made remotely by the data repository.
(...)
Introduction << Open Scientific Data >> Experimental Data Authenticity and Trustworthiness
🟢 Fully tested and working
A green circle means the hardware electronics or the programming code was fully tested, each of its functionalities and capabilities. And it can be installed in a vehicle. Keep in mind this does not mean errors won't happen. As in everything related to electronics and software, there are revisions and updates. This open hardware is no different.
💯 Fully tested & working, no improvements necessary - already being sold online
🆓 Fully Open hardware \ source code
🤪 There's better than this. don't use it
🔐 Fully closed hardware \ source code
⚡️ fully tested and working, however, it is a dangerous solution to deploy
🟡 Not tested. Working capability is unknown, it may work or not.
A yellow circle means the hardware electronics or the programming code was not fully tested, each of its functionalities and capabilities. This does not mean it not working, it simply means testing is needed before giving a green circle of approval.
🔴 Fully tested but not working.
A red circle means the hardware electronics or the programming code was fully tested, and found some kind of critical error or fault. This means the electronics or firmware code cannot be used in a vehicle.
⌛ Not started.
The hourglass means the hardware electronics or the programming hasn't started. Most likely because is waiting for the necessary test components needed for reverse engineering and also engineering of the new open solution.
🆕 New updated contents
The new icon means the link next to it was recently updated with new contents
💬 Comments on the Discussion page
The comments icon means there are useful and even new comments on the discussions page of the repository important for what you are seeing or reading.
Join the beta program to test and debug to provide feedback, ideas, modifications, suggestions, and improvements. And in return, write your own article blog or post on social media about it. See participation conditions on the Wiki.
The Beta Participant Agreement is a legal document being executed between you and AeonLabs that outlines the conditions when participating in the Beta Program.
Bug reports and pull requests are welcome on any of AeonLabs repositories. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
- Contributing
Please make sure tests pass before committing, and add new tests for new additions.
You can get in touch with me on my LinkedIn Profile:
You can also follow my GitHub Profile to stay updated about my latest projects:
The PCB design Files I provide here for anyone to use are free. If you like this Smart Device or use it, please consider buying me a cup of coffee, a slice of pizza or a book to help me study, eat and think new PCB design files.
Make a donation on PayPal and get a TAX refund*.
Liked any of my PCB KiCad Designs? Help and Support my open work to all by becoming a GitHub sponsor.
Before proceeding to download any of AeonLabs software solutions for open-source development and/or PCB hardware electronics development make sure you are choosing the right license for your project. See AeonLabs Solutions for Open Hardware & Source Development for more information.