-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathconcept.tex
1003 lines (867 loc) · 52.1 KB
/
concept.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\TOWRITE{NT/...}{Finalise}
\TOWRITE{ALL}{Proofread concept and approach pass 2}
\subsection{Concept and Methodology}\label{sec:concept_methodology}
\eucommentary{5-8 pages}
\eucommentary{
-- Describe and explain the overall concept underpinning the project.
Describe the main ideas, models or assumptions involved. Identify
any trans-disciplinary considerations;
-- Describe and explain the overall approach and methodology, distinguishing, as
appropriate, activities indicated in the relevant section of the work programme, e.g.
Networking Activities, Service Activities and Joint Research Activities, as detailed in
the Part E of the Specific features for Research Infrastructures of the Horizon 2020
European Research Infrastructures (including e-Infrastructures) Work Programme 2014-
2015;\\
-- Describe how the Networking Activities will foster a culture of co-operation between the
participants and other relevant stakeholders.\\
-- Describe how the Service activities will offer access to state-of-the-art infrastructures,
high quality services, and will enable users to conduct excellent research.\\
-- Describe how the Joint Research Activities will contribute to quantitative and qualitative
improvements of the services provided by the infrastructures.\\
-- As per Part E of the Work Programme, where relevant, describe how the project will
share and use existing basic operations services (e.g. authorisation and accounting
systems, service registry, etc.) with other e-infrastructure providers and justify why such
services should be (re)developed if they already exist in other e-infrastructures. Describe
how the developed services will be discoverable on-line.\\
-- Where relevant, describe how sex and/or gender analysis is taken into account in the
project's content.}
\subsubsection{Concept}\label{sec:concept}
Open Science is the principle that science, in order to be most
\textbf{impactful} and \textbf{socially responsible}, should be done \textbf{publicly}, with as
much of the scientific process and products \textbf{accessible, reviewable,
and reusable} by as many members of the global community as possible.
In the modern age of computational science, almost all academic
fields, from humanities to social sciences to biology and astronomy
are presented with exciting opportunities for Open Science. As more and
more research takes the form of code and/or data, the opportunity to
share, reproduce, and reuse scientific work is greater than ever, even
enabling new forms of \textbf{interdisciplinary collaboration}.
At the same time as we share in these exciting opportunities, there
are corresponding challenges, technical and social, to making Open
Science a practical reality. We face big questions: If a researcher
has code and/or data to publicise, how is that best done? How do
researchers learn \textbf{Open Science best practices} in their field? How do
previously disconnected fields benefit from each other's work as the
same computational challenges are faced again and again by different
communities?
These are the questions that guide \TheProject.
With so much research being done that wants to be Open,
how can we make Open Science
\begin{enumerate}
\item as \textbf{easy} as possible to share?
\item as \textbf{useful} as possible to other researchers and the public?
\end{enumerate}
\noindent Our plan for \textbf{improving access and effectiveness of Open Science} can be summarised as:
\begin{enumerate}
\item improve and maintain \textbf{common software infrastructure} used for
Open Science,
\item develop the Jupyter ecosystem to improve capabilities to \textbf{better
serve Open Science},
\item \textbf{guide, validate, and demonstrate} our developments through
collaboration with a wide variety of application domains,
\item enable students and researchers to perform Open Science through
\textbf{training and education}, and improving inclusiveness by focusing
these on under-served and under-represented communities, and
\item operate services to facilitate Open Science collaborations with
Jupyter software.
\end{enumerate}
\medskip
\subsubsection{Project Jupyter and the surrounding ecosystem}
\label{sec:project-jupyter}
\begin{figure}[htb]\centering
\includegraphics[width=0.9\textwidth]{use-cases-binder-logbook-solution.png}
\caption{A typical use case for Jupyter notebooks in research.
Image by Juliette Belin for the OpenDreamKit project, used under
CC-BY-SA.}\label{fig:use-cases-binder}
\end{figure}
\noindent\textbf{Jupyter ecosystem as the root of \TheProject}
\TheProject has chosen to centre its efforts on the Jupyter software
ecosystem. Figure~\ref{fig:use-cases-binder} summarises a typical use
case of Jupyter Notebook and Binder; both are described in more detail
below.
The Jupyter notebook and Jupyter ecosystem are of increasing
importance in computational science and data science, in academia,
industry, and services. In addition to supporting high productivity of
researchers, they have great potential to push Open Science forward:
the notebook provides a complete description of a computational and
data science study (Step 1 in figure~\ref{fig:use-cases-binder}), and the notebook can -- in principle -- be turned
into a publication, or can be used to provide the required computation
for a part of a publication, such as a figure
(Step 2 in figure~\ref{fig:use-cases-binder}). Once the researcher has
specified what software is required to execute the notebook (Step 3
in figure~\ref{fig:use-cases-binder}), the study is completely
reproducible by anyone (Step 4 in figure~\ref{fig:use-cases-binder}).
In this way, the notebook enables reproducibility of complex tasks
with hardly any additional effort on the user side.
The Binder project allows to execute such notebooks in
tailored computational environments; an aspect of reproducibility that
is not widely supported yet,
and a great opportunity for improving best practices in Open Science.
Furthermore, for users wanting to connect
to a local Jupyter notebook server on their machine, or to connect to
a server somewhere else on the Internet, the users only need a
web-browser to display and use the notebook regardless of the location
of the notebook server,
allowing computation to run anywhere from a local laptop to a remote supercomputer or in the cloud.
Because of these characteristics,
the Notebook is already planned to become an
important service on the European Open Science Cloud (EOSC) (for
example in \cite{panosc}),
and is an ideal component to use when building Open Science Services.
\medskip\noindent\textbf{Project Jupyter}
\emph{Project Jupyter} \cite{Jupyter}, which has grown increasingly popular in the scientific
computing community, has become the \emph{lingua franca} of interactive
computing in both academia and industry. The main goal of Project Jupyter
is to provide a consistent set of tools to improve researchers'
workflows from the exploratory phase of the analysis to the communication
of the results \cite{Kluyver2016}.
Split in 2014 from the \emph{IPython Project} \cite{IPython}, Jupyter has grown rapidly in
popularity and adoption both in the industry and academia. We estimate the user
base of the Jupyter notebook to be in the millions \cite{jupyter-grant}. Users range from data
scientists to researchers, educators, and students from many fields,
including journalists and librarians. In 2017, the Jupyter
team was awarded the \emph{ACM Software System Award}, an annual award that
honors people or an organization \emph{"for developing a software system that had a
lasting influence"}. Prior recipients include \emph{Unix}, \emph{TCP/IP}, and
the \emph{World Wide Web} \cite{acm-award}.
A large number of discrete software components make up Project Jupyter.
While these interact with one another, many can be installed separately
to serve various use cases. For this proposal, we loosely divide the
software involved into \emph{Jupyter core} developed under the guidance
of the developers who started the project, and the broader \emph{Jupyter
ecosystem} including software developed by third parties,
which may interact or build upon core Jupyter components.
Some of the components and concepts important to \TheProject are detailed below.
\begin{figure}[ht]\centering
\centering
\includegraphics[width=0.9\textwidth]{spectrogram_smaller.png}
\caption{A notebook document in the Jupyter Notebook interface.}\label{fig:notebook-screenshot}
\end{figure}
\medskip\noindent\emph{Jupyter core}
\begin{itemize}
\item The \textbf{Jupyter Notebook} is the flagship application of Project Jupyter.
It allows the creation of notebook documents, containing a mixture of text and
interactively executable code, along with rich output from running that code.
Figure \ref{fig:notebook-screenshot} shows an open notebook including graphs
from an audio processing example. Notebook documents are readily shareable,
providing a popular way to describe and illustrate computational methods and
tools.
\textbf{JupyterLab} is the new, modular, extensible client application
for Jupyter notebooks, but the document format, server, and user model are the same.
\item \textbf{Jupyter kernels} are the backend software which allow Jupyter to execute
code in many different programming languages. The \textbf{IPython} kernel is
the reference kernel, supporting the Python programming language, and is
developed by the Jupyter core team. Kernels for other languages are maintained
by third parties
\item \textbf{nbconvert} converts notebook files to a variety of other file
formats, including HTML and PDF, so that the content of a notebook can easily
be shared with people who don't have Jupyter software. nbconvert also powers
\textbf{nbviewer}, a web service which provides static HTML views of publicly
accessible notebooks.
\item \textbf{JupyterHub} is a multi-user extension of the Jupyter Notebook.
It runs on one or more notebook servers, for example at a research institution.
Users can log in to author and run notebooks securely through their web
browser, without needing to install any special software on their own
computer.
\end{itemize}
\medskip\noindent\emph{Jupyter ecosystem}\label{jupyter-ecosystem}
While Jupyter is a large, distributed, coordinated project,
the wider community of Jupyter users develops a great deal of
software with Jupyter integration,
providing increased or domain-specific functionality,
building on top of Jupyter, or integrating core Jupyter components in some aspect.
We call this the \textbf{Jupyter ecosystem}.
The broader Jupyter ecosystem includes many more projects than we will describe
here, but a selection of projects which are relevant to
\TheProject includes:
\begin{itemize}
\item \textbf{Binder} builds on JupyterHub to allow sharing executable
environments along with data files and a description of the software components
required to run the notebooks. When someone accesses a Binder repository,
the service builds the computational environment on demand, allowing them to
execute and modify a copy of the notebooks.
\textbf{repo2docker} \cite{repo2docker} and \textbf{BinderHub} \cite{binder} are components of the Binder
software.
\item \textbf{nbsphinx} \cite{Nbsphinx} integrates notebooks with the \emph{Sphinx}
documentation system, which is widely used for software documentation,
especially but not only for software written in Python.
This allows developers to write notebooks showing how to use their software,
then seamlessly make those notebooks part of their main documentation.
\item \textbf{nbval} \cite{nbval} is a plugin for the popular \emph{pytest} testing
framework to automatically execute notebooks and optionally check that the
output matches that saved in the file. While this is not a subsitute for a
test suite, it's valuable for documentation with code examples in notebooks.
If changes to the underlying tools mean the example no longer
works, testing with nbval will quickly show this, so that either the software
or the example can be corrected. This ensures that example code and
documentation don't get outdated.
\item \textbf{nbdime} \cite{nbdime} provides tools for comparing and merging notebooks.
These integrate with version control systems such as \emph{git}, which
are designed for plain text files and typically don't handle notebook files
well.
\item \textbf{Widgets} allow interactive output in the notebook which can
communicate with the kernel, updating values in the kernel and updating the
displayed output as code runs. \textbf{ipywidgets} \cite{ipywidgets} provides the main
implementation for the IPython kernel, while other packages such as
\textbf{bqplot} \cite{bqplot}, \textbf{ipyvolume} \cite{ipyvolume} and
\textbf{K3D} \cite{K3D} extend the framework to provide 2D and 3D visualisations.
Figure \ref{fig:ipywidgets-example} shows a simple example of interactive
widgets in use.
\item The \textbf{Voila} package \cite{Voila} enables the
sharing of notebook-based interactive dashboards for non-technical users.
\item The \textbf{Xeus} instrastructure \cite{Corlay2017} supports writing kernels
in C++. \textbf{xeus-cling} is one such kernel, running user code in C++,
and built upon CERN's C++ interpreter, "cling" \cite{Vassilev2012},
which has significant adoption in the High Energy Physics community.
xeus-cling is already in use for teaching the C++ programming language.
\end{itemize}
\begin{figure}[ht]\centering
\includegraphics[width=0.5\textwidth]{ipywidgets_example.png}
\caption{An example of using two simple slider widgets to explore the
parameter space of a function. The \texttt{@interact} decorator creates
the widgets and connects them to the function.}
\label{fig:ipywidgets-example}
\end{figure}
\medskip
\noindent\textbf{Jupyter as a basis for web services}\\
Because the Jupyter notebook is a web-based application, it can be
deployed at computational facilities or in the cloud, and can function
as the basis for services exposing computational resources of all
kinds to researchers and the public. Because Jupyter is
\textbf{interactive}, it enables making scientific results and
communications more interactive than static publications. The
audience can follow their own initiative and ask their own questions
of published data without needing support from the publishing author,
greatly facilitating the \textbf{practicality of Open Science}.
\medskip
\noindent\textbf{Jupyter is generic}\\
\TheProject chose Jupyter because it is
Generic. Jupyter makes no domain-specific or even language-specific
assumptions. Any application where mixing description, code, and
results is valuable can make use of Jupyter. This broad applicability
makes investment in the Jupyter ecosystem extremely effective, because
improvements to Jupyter can serve many communities simultaneously.
Jupyter is built from a collection of standard protocols and file
formats. Jupyter is not just a single, monolithic piece of
software, but a description of how such software can be built. The
result is the ability for a variety of communities and applications to
use components of Jupyter for their purposes, and/or reimplement pieces to
meet their needs.
%
For example:
\begin{enumerate}
\item The notebook file format is a well-specified JSON document,
which can be interpreted by many systems. This has facilitated the
development of different services providing rendering of notebooks, e.g. the code
hosting website GitHub, which renders notebooks for easy viewing by
anyone, without Jupyter software.
\item The Jupyter protocol describes how execution is performed, which
has enabled the development of over one hundred kernel
implementations in dozens of languages\footnote{\url{https://github.com/jupyter/jupyter/wiki/Jupyter-kernels}}.
\item Output in the Jupyter protocol uses web-standard MIME types,
enabling any possible format to be an output in a Jupyter notebook.
\item The JupyterLab extension system provides a system for building
applications from Jupyter components and others.
\item The Jupyter Widgets provide a system for customizing and
extending interactivity in Jupyter-based environments.
\end{enumerate}
The popularity of Jupyter, with millions of users and hundreds of open
source contributors, is an indicator of the value and impact of this approach.
\medskip
\noindent\textbf{Improvement to the Jupyter ecosystem}\\
The benefits of focusing our work on a mature system like Jupyter include:
\begin{itemize}
\item vibrant community ensures health and sustainability,
\item large existing user base maximises impact of contributions,
\item mature software ecosystem maintains quality software through
industry standards such as version control, tests, continuous
integration, stable release cycles, roadmaps, and user support.
\end{itemize}
The Jupyter community aims to be inclusive, and \TheProject fully
embraces and supports that approach. Jupyter is inclusive across a number of axes.
By being applicable across numerous domains, Jupyter and \TheProject
encourage participation from individuals of various interests and
backgrounds, and has taken action to improve diversity in the project
by participating in ``Outreachy,'' a program of paid internships for
individuals from groups that face under-representation, systemic bias,
or discrimination. Jupyter has also operated workshops focused on
training contributors from under-represented groups. In being free,
public, open source software, Jupyter and \TheProject are accessible
to as many individuals as possible, and invites users and contributors
beyond origin, nationality, beliefs, orientation. One area where
Jupyter has lacked in this regard is in the User Interface
accessibility, and we will help improve this in
\taskref{core}{accessibility}. Additionally, the project will
focus some of its workshops in \taskref{education}{workshops} on
under-represented communities.
\begin{figure}[ht!]\centering
\includegraphics[width=0.6\textwidth]{images/notebook_components.png}
\caption{The architecture of the Jupyter Notebook, kernels, and tools
which operate on notebook files}
\label{fig:notebook-architecture}
\end{figure}
\medskip
\noindent\textbf{Related projects}
EOSC-hub is a 33 million Euro H2020 project that started in January 2018 with
the involvement of over 100 institutes. In three years the project is
establishing the first elements of the European Open Science Cloud. EOSC-hub
defines, creates and operates the integration and management system of the
EOSC. This integration and management system (the Hub) builds on mature
processes, policies and tools from the leading European e-infrastructures to
cover the whole life-cycle of services from planning to delivery. Through this
management system online and `human' services, software and data are delivered
towards researchers via a single EOSC Portal. The Marketplace already includes
nearly 50 services from EOSC-hub provided by 3 e-infrastructure communities
(EGI, EUDAT, INDIGO-DataCloud), and from 18 Research Infrastructures and
scientific service providers. The catalogue of services is expected to
radically grow in the next years through national, regional and EU
initiatives.
Integrating Jupyter-based services into EOSC provides an excellent opportunity
for facilitating interoperability of EOSC services,
bringing data and computation together in a flexible environment.
\subsubsection{Methodology}\label{sec:methodology}
\textbf{Proposed improvements to core components of Jupyter (\WPref{core})}\\
We plan to make technical changes to Jupyter software to better support
real-time collaboration (\taskref{core}{collaboration}),
so that two or more people in different places or working on different
devices can work together
on the same notebook. This would significantly enhance the value of
notebooks for collaborative research.
We will also work on making Jupyter software accessible to as broad a
range of users as possible (\taskref{core}{accessibility}).
Further work to bring the code behind JupyterHub and Binder closer together
(\taskref{core}{jh-bh-conv}) will bring a range of benefits, allowing more
flexible sharing of notebooks along with access to remote computing resources
such as those available through EOSC.
Finally, we are explicitly allocating time in \WPref{core} for maintaining
Jupyter software, as well as new development (\taskref{core}{maintenance}).
Maintenance is crucial to creating reliable, sustainable software,
but its cost is often swept under the rug in funding applications
because of the perceived pressure to focus on novelty.
Being up front and explicit about this cost is critical to the sustainability
of open source open science.
\medskip
\noindent\textbf{Proposed improvements to the Jupyter ecosystem (\WPref{ecosystem})}\\
We further propose improvements to the wider Jupyter ecosystem for
better scientific workflows. In particular, we have identified
possible improvements to:
\begin{itemize}
\item Binder and its crucial software component \emph{repo2docker}
(\taskref{ecosystem}{r2d-and-binder}).
\item Xeus, to better support the C++ programming language in notebooks
(\taskref{ecosystem}{xeus-cpp}).
\item Interactive widgets, including tools for 3D visualisation to help
people make sense of large amounts of data
(\taskref{ecosystem}{jupyter-widgets}).
\item Archiving of computational environments to allow reproducible research
with a focus on the long term (\taskref{ecosystem}{reproducibility}).
\item Tooling and guidelines
for using notebooks in education
(\taskref{ecosystem}{teaching-tools}).
\end{itemize}
We may create new open source software projects in these tasks,
but we will carefully review existing software, both in the
Jupyter ecosystem and beyond, to avoid unnecessary duplication of effort.
\medskip\noindent\textbf{Beyond the improvement to the Jupyter Project
(\WPref{applications}, \WPref{eosc}, \WPref{education})}\\
Beyond the improvement to the Jupyter core and ecosystem software for EOSC, we plan on
\begin{itemize}
\item Design, implementation, application, demonstration and
evaluation of new innovative EOSC services
in multiple demonstrators, that cover research fields such as
health, astrophysics, photon and neutron science, geosciences and
mathematics, and also interests of participating SMEs (\WPref{applications}).
\item Operating a \emph{European Binder Service} on the EOSC-Hub and
enabling provision of Jupyter Services through the EOSC-Hub (\WPref{eosc}).
\item Producing \emph{training and education material} to disseminate
the ability to do reproducible computational science using the tools
we develop, among others (\WPref{education}).
\end{itemize}
\medskip
\noindent
\textbf{The science
demonstrators}\label{sec:science-demonstrators-in-concept}\\
We describe the context and challenges for each demonstrater in this
section. The particular planned activities are shown in the
corresponding tasks in \WPref{applications}.\\
\noindent\textbf{Demonstrator: Astronomy (\taskref{applications}{astro})}\label{sec:concept-demonstrator-astronomy}\\
The \href{http://cdsweb.u-strasbg.fr/}{Strasbourg Astronomical Data Center} (CDS) is a scientific data
center hosted by the Observatory of Strasbourg. The CDS plays a unique and
essential role in astronomy by adding value to published and reference data.
CDS runs astronomical services that
provide data for the world-wide astronomy research community. Its three main
services (SIMBAD, VizieR and Aladin) are heavily used with up to one million
queries per day. These services can be accessed through web interfaces, mainly
for human interaction, as well as through programmatic interfaces, including
the standardized protocols defined by the International Virtual Observatory
Alliance \cite{ivoa}.
\begin{figure}[ht!]\centering
\includegraphics[width=0.6\textwidth]{python-astro-citations}
\caption{Mentions of programming languages in refereed Astronomy papers, extracted from ADS. Python usage has increased dramatically in the recent years.}\label{fig:python-astro-citations}
\end{figure}
Python and notebooks are rapidly increasing in importance for astronomy
research. Indeed, Python for Astronomy software ecosystem has known a
constant steady growth in the latest years, as shown in
figure~\ref{fig:python-astro-citations}. As Python and notebooks integrate
well together, the Jupyter notebook as an analysis tool is becoming a hot
topic in the astronomical world: large surveys like the LSST (Large Synoptic
Survey Telescope) have endorsed the usage of the Jupyter platform for their
data access portal \cite{lsst2017scienceplatform}.\\
We will develop a Jupyter-based framework to efficiently access, explore,
visualize and analyze reference data that are available through CDS services
as a real example of using open astronomy data.
We will provide scientific users with a set of customizable Jupyter notebooks
for visualization and analysis tasks, providing a new level of
interoperability with python libraries and notebooks as is highly demanded
by the astronomy research community.
The focus is on the two following user stories:
\begin{compactitem}
\item analysis of catalogue data results, up to billions of rows.
Tabular data is the typical output of SIMBAD and VizieR data.
\item modular dashboard-like interface providing a top level
interactive view of the available data for a given astronomical
object and enabling loading and analysis of those data.
\end{compactitem}
\begin{figure}[ht!]\centering
\includegraphics[width=1.0\textwidth]{astro-aladin-snapshot}
\caption{Example of astronomical data objects: Simbad sources, XMM and Hubble coverages overlaid on Digital Sky Survey imagery in the vicinity of the Horsehead nebula, and visualized in Aladin Desktop software.}\label{fig:astro-aladin-snapshot}
\end{figure}
Access to the notebooks will be provided as a one-click action option from
SIMBAD and VizieR results pages.
Thus, providing with a one-click way of visualizing, filtering and analyzing
these potentially large tables will bridge the gap between access and analysis
of the data, with zero installation for the user.
For specific science cases, we will explore rendering of notebooks with
interactive widgets through Voila \cite{Voila}, as to allow users not familiar with
Python to benefit from the Jupyter notebook framework.
Figure~\ref{fig:astro-aladin-snapshot} depicts typical data objects we want to analyse and interact with in the notebooks: images, catalogue data, datasets coverages.
These new developments will be highly visible to the large number of astronomers who use the CDS services (50,000 unique visitors per month) and such tools are in high demand by these users.
The CDS expertise in astronomy data and interfaces will be profitably combined with expertise of \TheProject partners to ensure the deployment of high quality widgets (Simula, WildTree Tech, QuantStack).
The particular activities for this demonstrator are shown in
\WPref{applications} in \taskref{applications}{astro}.
\medskip
\noindent\textbf{Demonstrator: Enriched education with Jupyter (\taskref{applications}{teaching})}\label{sec:concept-demonstrator-teaching}
In the recent years, Jupyter technologies have been widely adopted
worldwide in higher education -- and even in high schools -- for
teaching in all areas of sciences. The Jupyter notebook indeed
provides a very versatile environment -- with a smooth learning
curve -- for authoring interactive material such as class notes,
exercise sheets, dedicated applets; all the way to complete books
such as those produced by OpenDreamKit for biology, physics, and
mathematics. The interactivity engages the students to take an
active role, for example playing with code, exploring the effect of
tweaking the parameters in a simulation, changing visualizing tools,
adding personal notes. This lets them take progressively ownership
of the material and better understands the issues, and encourages
them to create their own documents and share their experience with
colleagues and teachers.
\begin{figure}[ht!]\centering
\includegraphics[width=.45\textwidth]{images/teaching-cling}\quad
\includegraphics[width=.45\textwidth]{images/teaching-graphs}
\caption{Jupyter based teaching material from Paris Sud. On the
left: an exercise sheet for the course \emph{Introduction to
programming}; this instructor version showcases interactive C++
execution and automatic grading configuration menus. On the right:
interactive slides for a graph theory course.}\label{fig:teaching-cling}
\end{figure}
Success stories include:
\begin{itemize}
\item Berkeley's ``Data 8: The foundations of Data Science'' open
course (\url{data8.org}) which is delivered yearly to thousands of
lower undergraduate students in all majors, scientific or not,
%\item OpenAI Gym (\url{https://gym.openai.com/}),
\item QuantEcon' open interactive book ``Lectures in Quantitative
Economics'' (\url{https://lectures.quantecon.org/}), entirely
authored with Jupyter.
\item 100+ IPython/Jupyter based MOOC's (Massively Open Online
Courses) on Coursera
(\url{https://www.coursera.org/courses?query=ipython})
\item Paris Sud's first-year course ``Info 111: Introduction to
Computer Science'' where each year 400 students write their first
lines of code in C++ in a Jupyter notebook (see Figure~\ref{fig:teaching-cling}).
\end{itemize}
École Polytechnique, Université Paris-Sud, and other participants from this
project have been early adopters of these tools (see the description of \site{EP}
and \site{UPSUD}, and also task~\taskref{ecosystem}{teaching-tools}).
We learned the hard way that deploying the Jupyter environment at a
large scale (e.g. for a university) requires specialized expertise
(DevOps, software development, ...) which impedes its adoption
by the greatest number of people. High quality hosted solutions
(e.g. CoCalc\cite{Cocalc}, Gryd\cite{Gryd}) do exist but are not the final solution when it
is desired to exert greater control on private data, integration with
the local infrastructure (authentication, shared drive, e-learning
environment, dedicated hardware, ...), or to use available local computing
resources rather than paid services.
Further improving the Jupyter environment for education, while
leveraging it to the greatest degree, are therefore key motivations
for the following tasks of this proposal:
\begin{itemize}
\item Tasks~\taskref{core}{jh-bh-conv}
and~\taskref{eosc}{jh-bh-deployment} will greatly ease the
deployment of Jupyter environments, with tight integration in the
existing local infrastructure and full customizability by the
teachers.
\item Task~\taskref{ecosystem}{teaching-tools} will improve the
interoperability with existing e-learning systems, and further
develop teaching aids for, e.g., material sharing,
(self)-evaluation, and grade management.
\item Task~\taskref{applications}{math} will support teaching
in mathematics through better support for real-time interactivity.
\item Task~\taskref{ecosystem}{xeus-cpp} will support teaching
in computer-science and scientific programming through
better C++ integration in the notebook and will allow to first class students to focus on the
syntax of the language without distractions such as compiling and
linking a program.
\item Task~\taskref{eosc}{eosc} will ease publication and FAIR
access to course material, which in turn will promote sharing
and collaboration in the education community.
\end{itemize}
The particular activities for this demonstrator are shown in
\WPref{applications} in \taskref{applications}{teaching}.
\medskip
\noindent\textbf{Demonstrator: Visualisation and control of fluid dynamics in
Jupyter notebook (\taskref{applications}{application-gpu})}\label{sec:concept-demonstrator-gpu}\\
In recent years, the lattice Boltzmann method (LBM) emerged as an
interesting alternative to more established methods for fluid flow
simulations. Sailfish-cfd \cite{januszewski2014sailfish} is an open
source implementation of the LBM on General Purpose Graphical Processing
Unit (GPGPU) devices. It is written in Python with real-time
generation of CUDA-C code. In order to harvest capabilities of GPGPUs
one needs to access the specialized hardware, which usually is
available to researchers as remote HPC resources. The typical fluid
dynamics research workflow consists of three stages: preparing
boundary conditions, running a simulation, and data analysis. The
first and last stage require capable and responsive user interface for
maniputation and inspection of 3d data. The Jupyter 3d visualisation
widgets developed in \taskref{ecosystem}{jupyter-widgets} can fulfil
such needs.
Based on previous experience with K3D-jupyter~\cite{K3D}
widgets we know that web browser based software can display moderate
dataset during the simulation. As the dataset is becoming larger the
visualisation in the browser turns out to be nontrivial due to
limitations of the browser itself and required large data transfers. It is
an open question how much of data processing should be performed on
server-side and what can be done on the client hardware (i.e. in the
widget in the browser side of the user). Our
experience suggests that there is no clear answer and it depends on
the size of the data and its nature. For example, volume rendering
technique can be very effective on the browser side but infers large data
transfers. One can perform it the server-side, in a distributed way if
the simulation uses many nodes, but the interactivity is limited by
network latency. We will attempt to provide practical
solutions to this issue.
%
\begin{figure}[ht!]\centering
\includegraphics[width=.95\textwidth]{images/k3d_cfd.png}
\caption{An example showing Jupyter based visualization of velocity
magnitude in the blood flow through an aneurysm. It demonstrates
the use of small interactive widget for selecting the cutting
plane.
Visualization is done by a K3D-jupyter widget~\cite{K3D}.
In this case most of computations are done using VTK library on
the server side, and the K3D-jupyter widget is used to display the
colored surface mesh.
}\label{fig:k3d-cfd}
\end{figure}
The particular activities for this demonstrator are shown in
\WPref{applications} in \taskref{applications}{application-gpu}.
\medskip
\noindent\textbf{Demonstrator: Geosciences (\taskref{applications}{geoscience})}\label{sec:concept-demonstrators-geo}\\
The amount of geospatial data from a variety of sources, including satellite observations, 4D simulations and in-situ observations, contributed by volunteers
or state agencies keeps increasing. In many disciplines, managing this large volume
has become a challenge, and the old approach of downloading datasets for local
analysis has become intractable.
The heterogeneity of the tools used in different institutions to deal with
large geographical datasets makes it difficult for researchers to share the outcome
of their work in a reproducible or interoperable fashion.
In this context, Jupyter is now emerging as a standard exploration tool for
geospatial analysis, climate science, geology and by data providers in these areas.
To mention a few,
\begin{itemize}
\item
the \emph{PanGeo} platform \cite{Pangeo2018} (Funded by the NSF, NASA, and the
Alfred P. Sloan Foundation) is built upon Jupyter, JupyterHub, Binder, and Dask.
\item
the \emph{Joint Research Centre Earth Observation Data and Processing Platform}
(JEODPP) \cite{Soille2018} relies on Jupyter, JupyterHub and ipyleaflet as
its main user interface (see also figure~\ref{fig:jeodpp} on page \pageref{fig:jeodpp}).
\item
the \emph{Google Earth Engine} platform also offers a Jupyter-based user
interface allowing the visual exploration of the data with ipyleaflet
\cite{GEEJupyterLeaflet2017}.
\end{itemize}
In these three cases, deferred processing is used to restrict computation to
the extent of the area displayed in the map viewer, which allowed these
platforms to scale up to petabytes of data. In all examples, interactive
visualization is a key feature of the platform. Beyond tile-based
2-D visualization, the ability to efficiently process and visualize vector
or 3-D data is also becoming critical.
The \TheProject team, which comprises the main authors of the technologies upon
which these platforms are built (Jupyter, JupyterHub, Binder, ipyleaflet),
together with the Department of Geosciences of the University of Oslo, are
in a unique position to bring these technologies together in the context of
EOSC.
This demonstrator will focus on tools for two transversal research projects
\begin{itemize}
\item \href{https://www.mn.uio.no/geo/english/research/groups/latice/}{LATICE} (Land-Atmosphere Interactions in Cold Environments)
\item \href{https://www.mn.uio.no/geo/english/research/groups/earthflows/}{EarthFlows} (Interface Dynamics in Geophysical Flows)
\end{itemize}
The work items for this demonstrator fall in two main categories:
visualization and geographical data processing tools. Data will not be produced as part of \TheProject. The University of Oslo follows the "open as standard" policy and all data used for \TheProject will be publicly available through:
\begin{itemize}
\item the \href{https://archive.norstore.no/}{Norwegian Research Data archive}
\item \href{https://zenodo.org/}{Zenodo} for smaller datasets and for datasets used for teaching
\item SQL requests for in-situ observations stored in local databases (for instance collected during field campaigns)
\item the Earth System Grid Federation (ESGF) for all climate data,
\item \href{https://www.copernicus.eu/en/access-data}{Copernicus data portal} for satellite observations.
\end{itemize}
Beyond their use in scientific research, these development will be used in
the class room for teaching master's students with best practices in open
science.
The particular activities for this demonstrator are shown in
\WPref{applications} in \taskref{applications}{geoscience}.
\medskip
\noindent\textbf{Demonstrator: Nuclear Medicine dosimetry (\taskref{applications}{opendose-analysis})}\label{sec:concept-demonstrators-opendose}\\
% Scientific description
Nuclear Medicine is a field of medicine where radioactive material
(radiopharmaceutical) is used for diagnostic and therapy. The OpenDose
project \cite{Chauvin2017} is a collaborative effort to generate a reference
database, freely available, proposing dosimetric data applicable in a context
of nuclear medicine dosimetry. A major aspect of the project is the
development of tools ensuring traceability and reproducibility of generated
results.
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{images/opendose_framework.png}
\caption{OpenDose project overall framework including the unified data
analysis to be developed in this demonstrator.}
\label{fig:opendose_framework}
\end{figure}
% Technical description
OpenDose data is produced using the five most represented Monte Carlo
simulation software tools in medical applications: Geant4/GATE, MCNP, EGS, PENELOPE
and Fluka. Each simulation consists of calculating radiation transport in
anthropomorphic models for specific parameters (source organ, particle type,
energy, model and number of primaries to simulate). Every simulation produces
binary (3D matrices) and ASCII files for a total of $\sim$150MB / simulation.
The 3D matrices contain energy deposited per voxels, and ASCII files contain
pre-processed data corresponding to energy deposited per regions such as
organs and tissues. These raw outputs are later processed into dosimetric
data such as Specific Absorbed Fractions (SAFs) and S-values.
Producing data for one model (ex. Adult female) requires $\sim$30,000
simulations, with the workload shared between the different teams and
software.
The data produced by all the teams is currently centralised at the Cancer
Research Center of Toulouse (CRCT), processed and fed into a local SQL
database at CRCT.
This collaborative effort raises some challenges:
\begin{compactitem}
\item Data production: a total of 750,000 hours of CPU time is needed per
model.
\item Volume of data: one model represents TBs of raw data that can be
heterogeneous from the different teams.
\item Data analysis: raw data has to be processed into dosimetric data in a
robust and reproducible way.
\item Database: has to be efficient and handle all the data (raw and
processed).
\item Visualization: display and compare results from all teams.
\end{compactitem}
Figure \ref{fig:opendose_framework} shows the overall framework of the
project and how data will be managed.
By building a set of tools to access and process data within the Jupyter
ecosystem, we will ensure the production of traceable and reproducible
dosimetric data for the OpenDose project members.
Another major aspect of the OpenDose collaboration is to provide open
access to the generated dosimetric data. For that purpose a website is under
development to allow data download and simple dosimetry calculations. For
users who need more advanced calculations, a dedicated Jupyter workspace will
provide a set of tools to easily access, process and display the OpenDose
data.
The particular activities for this demonstrator are shown in
\WPref{applications} in \taskref{applications}{opendose-analysis}.
\medskip
\noindent\textbf{Demonstrator: Interactive Mathematics with Jupyter Widgets (\taskref{applications}{math})}\label{sec:concept-demonstrator-math}\\
Computations have played a long time and ever increasing role for
research and teaching in (pure) mathematics, to explore, search and
check for conjectures, or better understand algorithmic ideas. This
led to the development of a whole ecosystem of mathematical
software, many of which are open source. Given the huge variety of
mathematical objects and workflows, the Read-Eval-Print-Loop (REPL)
paradigm -- on which Jupyter is based -- is particularly suitable:
the user interacts with the system by typing commands that use its
library of mathematical features, often combined with personal code.
In fact, the REPL and notebook paradigms of Jupyter as well as some
of its interactive features were largely inspired by that of
computer algebra systems such as Maple, Mathematica, or SageMath.
One major action of the OpenDreamKit project was to foster the
convergence between the Jupyter and math software ecosystems:
nowadays Jupyter can be used as a uniform user interface for most
major systems: e.g. GAP, OSCAR, Pari/GP, SageMath, Singular, and
even for C++ libraries. This interface is being widely adopted: for
example, Jupyter has become the standard user interface for
SageMath, enabling to phase out its former bespoke notebook; by now,
thousands of Jupyter notebooks for SageMath are publicly shared
(6000+ on GitHub alone).
Thanks to this prior work, the mathematical community will
immediately enjoy all the benefits brought by EOSC-based generic
Jupyter services, including eased collaboration, sharing, archival,
and reproducibility.
\begin{figure}[h]
\begin{center}
\includegraphics[height=2.9cm]{images/partitions-of-5}
\includegraphics[height=2.9cm]{images/Wikipic}
\includegraphics[height=2.9cm]{images/hexominoes}
\includegraphics[height=2.9cm]{images/AztecDiamond}
\includegraphics[height=2.9cm]{images/619px-Tiling_Fibonacci_word_fractal}
\medskip
\includegraphics[height=2.5cm]{images/fractioncont}
\includegraphics[height=2.9cm]{images/elliptic-curve}
\includegraphics[height=2.9cm]{images/548px-Fermat's_spiral_01}
\includegraphics[height=2.9cm]{images/597px-E8Petrie}
\medskip
\includegraphics[height=2.9cm]{images/500px-McGee_graph}
\includegraphics[height=2.9cm]{images/nonominoes}
\includegraphics[height=2.9cm]{images/500px-Torus_with_seven_colours}
\includegraphics[height=2.9cm]{images/359px-F4HassePoset}
\end{center}
\caption{Graphical visualizations of a variety of mathematical objects}
\label{fig:math_viz}
\end{figure}
The next step to maximise attractivity and impact in the
mathematical community, and this is the aim of this task, is to go
beyond the REPL paradigm, and \textbf{leverage the real time
interactivity and flexibility brought by Jupyter widgets for
Mathematical purposes}. This will, for example, make it straightforward for a teacher or
researcher to build and disseminate via the EOSC mini applications
or dashboards enabling the graphical exploration of a whole range of
mathematical inputs, with real-time visualization of the associated
outputs.
The unique challenge comes from the huge variety of mathematical
objects that the user may want to visualize and interact with, and
the variety of graphical representations (see Figure~\ref{fig:math_viz}). Co-design is central here,
as building a bespoke interactive visualization entails a
combination of technology skills (e.g. javascript development) and
business knowledge (designing the interaction and visualization).
The role of Research Software Engineers is to leverage the
technology by encapsulating the technical difficulties into flexible
and easy to use tool boxes from which mathematicians can build
mini-applications as innovative services that are tailored to their needs.
Within OpenDreamKit, we conducted experiments to explore this
venue~\cite{ODK_D4.16}. One specific focus was to enable not only
\emph{interactive visualization}, but also \emph{interactive
editing}: being able to graphically modify the mathematical object
being visualized; this enables the interactive exploration of how the
modifications affect its properties, or to use the editor as an input
widget for a larger application or dashboard. The outcome of this
task is the development of two prototypes in SageMath
(\software{sage-combinat-widget}, a library of widgets for
combinatorics, and \software{sage-explorer} a generic dashboard for
interactive browsing and introspection of mathematical objects), and
contributions to \software{Francy}, an Interactive Discrete Math
Framework for \software{GAP} and \software{SageMath}.
\TODO{... which are accessible as an EOSC service?}
The particular activities for this demonstrator are shown in
\WPref{applications} in \taskref{applications}{math}.
\medskip
\noindent\textbf{Demonstrator: Reproducible photon science workflows at
European XFEL (\taskref{applications}{reproducibility-xfel})}\label{sec:concept-demonstrator-photonscience}\\
European XFEL is a research facility that provides X-ray Free
Electron Laser (XFEL) light to image structures at the nanoscale. It
is currently the world's most brilliant laser, created in a 3.4km
long tunnel, and supporting user experiments since September
2017. These imaging capabilities of European XFEL and similar
services available from synchrotron and neutron sources, underpin lots of
fundamental and applied research, in domains ranging from
physics and material science to biochemistry and drug design. Some
example data is shown in figure \ref{fig:photon-science-example}.
All of the data recorded at European XFEL will be made freely
available after an embargo period of three years
\cite{EuXFEL-datapolicy-2017}. This provides scientific transparency
and is expected to enable better exploitation of the data, as more
researchers than those conducting the experiments have access to the
results. If the analysis steps are not carefully recorded, there is a risk
that the necessary understanding of the data is lost by the time it
is made public or subsequently, greatly reducing its scientific
value.
We are keen to complement this open data access to the actual data
with open access to reproducible data analysis, to confirm
conclusions drawn and to significantly lower the barriers for
re-analysis with new tools or for new research purposes.
A task in the EC funded project Photon and Neutron Open Science
Cloud (PaNOSC) is using the Jupyter Ecosystem tools as they are in
2019 to provide interactive data analysis services to complement the
data: through use of Jupyter Notebook and exploitation of the
mybinder.org service, this activity will reduce the barrier for
interactively exploring the data, understanding and making use of
the data, and to do this through a central portal such as EOSC.
Here, we combine and use the new developments (\WPref{core},
\WPref{ecosystem}) of this
proposal to enable new qualities of open science services, and to
demonstrate the potential impact of these improvements for a wide
set of EOSC services through a demonstrator in Photon Science.
\medskip
\begin{figure}[tb]
\centering
\includegraphics[height=0.27\textheight]{images/photon-science-prototype1.png}
\includegraphics[height=0.27\textheight]{images/photon-science-prototype2.png}
\caption{Prototypes for data analysis of 2d x-ray detector images
in the Jupyter notebook, relating to the
photon science use case.
% task reference in caption doesn't work
% \taskref{applications}{reproducibility-xfel}.
\emph{(Left)} Data from crystallography
scattering experiment. \emph{(Right)} Azimuthal integration of detector
data as one step in the data analysis workflow.}
\label{fig:photon-science-example}
\end{figure}
\emph{Context}: The very first experiments at European XFEL
produced as little as 45 terabytes of data on average, but as the
facility develops, the amount of data produced per time is expected
to grow substantially: Given the rate of light pulses, there is the
potential to produce up to a petabyte of data within the beam time
of one experiment (typically one week). These significant amounts of
data need to be complemented by complicated workflows to convert the
data into insight through data analysis. Derived results of such
data analysis are typically much smaller in size and useful to
archive together with the raw data. To explain how they have been
obtained, the particular workflow of data analysis also needs to be
archived.
\medskip
\emph{Vision}:
At European XFEL, it is proposed to use Jupyter notebooks to facilitate
this workflow: the simplest model would be to use one notebook per
workflow. Once the data capture from the experiment is completed,
this notebook can be executed (without being displayed in a web
browser) to start processing the data. When the notebook has
completed execution, it is saved, and contains the analysis results
(it may of course also created files on disk as part of the
process).
A particularly useful aspect of the notebooks is that they mix data
analysis commands with outputs, and that the notebook provides a
complete (and thus reproducible) summary of the data analysis when
it succeeds with the execution. Should the execution fail, for
example half-way through the notebook, then derived results obtained
prior to the error occurring are preserved and can be inspected. The
error is embedded in the notebook and appears after the command that
has triggered the error; which helps with debugging the process.
This is of particular interest as the data analysis processes at
European XFEL may fail not because of software errors but due to
variation in the data that require (manual) expert adjustments of
parameters. The ``failure'' of such an analysis workflow
(represented through the Notebook) is thus not exceptional, but a
common occurrence. The scientist conducting the experiment is
sufficiently skilled to modify the parameters and wants to either
re-execute the notebook from the beginning or to continue from the
point of failure. The notebook caters for both use cases. The
modified notebook would need to be preserved of course to provide
reproducibility of the derived results that the notebook has
computed.
We are aiming for re-executability of the notebook for the lifetime
of the data. The lifetime of the archived data at European XFEL is
currently guaranteed for 5 years and aimed to be 10 years
\cite{EuXFEL-datapolicy-2017}. It is possible though, that data used
for publications will be preserved for longer, and it would be
highly desirable to keep the data analysis re-executable for the
same period of time, potentially well exceeding 10 years.