-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmanuscript.py
2417 lines (1678 loc) · 150 KB
/
manuscript.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env python
# coding: utf-8
# # The Future of OA: A large-scale analysis projecting Open Access publication and readership
#
#
#
# **Heather Piwowar *<sup>1</sup>, Jason Priem *<sup>1</sup>, Richard Orr<sup>1</sup>**
#
# * shared first authorship
# <sup>1</sup>_Our Research (team@ourresearch.org)_
#
# Preprint first submitted: October 6, 2019
#
# ------
# **Summary**
#
# *Will move the Summary ([Section 4.5](#section-4-5) right now) up to the top here. Is at the bottom right now so it can produce the graphs in it using the code below :)*
#
#
# ------
# <a id="section-1"></a>
# ## 1. Introduction
#
# The adoption of [open access (OA)](https://en.wikipedia.org/wiki/Open_access) publishing is changing scholarly communication. Predicting the future prevalence of OA is crucial for many stakeholders making decisions now, including:
#
# - libraries deciding which journals to subscribe to and how much they should pay
#
# - institutions and funders deciding what mandates they should adopt, and the implications of existing mandates
#
# - scholarly publishers deciding when to flip their business models to OA
#
# - scholarly societies deciding how best to serve their members.
#
# Despite how useful OA prediction would be, only a few studies have made an attempt to empirically predict open access rates. Lewis (2012) extrapolated the rate at which [gold OA](https://en.wikipedia.org/wiki/Open_access#Gold_OA) would replace subscription-based publishing using a simple log linear extrapolation of gold vs subscription market share. Antelman (2017) used one empirically-derived growth rate for [green OA](https://en.wikipedia.org/wiki/Open_access#Green_OA) and another for all other kinds of OA combined. Both of these studies are based on data collected before 2012, and rely on relatively simple models. Moreover, these studies predict the number of papers that are OA. While this number is important, it is arguably less meaningful than the number of views that are OA, since this latter number describes the prevalence of OA as experienced by actual readers.
# This paper aims to address this gap in the literature. In it, we build a detailed model using data extrapolated from large and up-to-date Unpaywall dataset (https://unpaywall.org/). We use the model to predict the number of articles that will be OA (including gold, green, hybrid, and bronze OA) over the next five years, and also use data from the Unpaywall browser add-on (https://unpaywall.org/products/extension) to predict the proportion of scholarly article views that will lead readers to OA articles over time.
#
# This paper aims to provide models of OA growth, taking the following complexities into account:
#
# - some forms of OA include a delay between when a paper is first published and when it is first freely available
#
# - different forms of open access are being adopted at different rates
#
# - wide-sweeping policy changes, technical improvements, or cultural changes may cause disruptions in the growth rates of OA in the future
# <a id="section-2"></a>
# ## 2. Data
# In[1]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# hidden: code to import libraries, set up database connection, other initialization\nimport warnings\nwarnings.filterwarnings(\'ignore\')\n\nimport os\nimport sys\nimport datetime\nimport pandas as pd\nimport numpy as np\nimport scipy\nfrom scipy import signal\nfrom scipy.optimize import curve_fit\nfrom scipy.stats.distributions import t\nfrom matplotlib import pyplot as plt\nimport matplotlib as mpl\nfrom matplotlib import cm\nfrom matplotlib.colors import ListedColormap\nimport seaborn as sns\nfrom sqlalchemy import create_engine\nimport sqlalchemy\nimport psycopg2\nfrom datetime import timedelta\nfrom IPython.display import display, HTML, Markdown\nimport cache_magic\nfrom tabulate import tabulate\n\n# our database connection\nredshift_engine = create_engine(os.getenv("DATABASE_URL_REDSHIFT"))\n\n# graph style\nsns.set(style="ticks")\n\n# long print, wrap\npd.set_option(\'display.expand_frame_repr\', False)\n\n# read from file if available, else from db and save it in a file for next time\n# will also help have data files ready for archiving in zenodo \ndef read_from_file_or_db(varname, query, skip_cache=False):\n filename = "data/{}.csv".format(varname)\n my_dataframe = pd.DataFrame()\n try:\n if not skip_cache:\n my_dataframe = pd.read_csv(filename)\n except IOError:\n pass\n if my_dataframe.empty:\n global redshift_engine\n my_dataframe = pd.read_sql_query(sqlalchemy.text(query), redshift_engine)\n my_dataframe.to_csv(filename, index=False) # cache for the future\n\n return my_dataframe.copy()\n\n\n# make figure captions work. use like this: \n# make a code cell, and include\n# register_new_figure("my-figure-anchor-name") \n# before you want to refer to a figure. This is where the link will go to.\n# and then in text markdown to refer to the figure\n# {{figure_link("my-figure-anchor-name")}}\n\nglobal figure_so_far\nglobal figure_numbers\nfigures_so_far = 1\nfigure_numbers = {}\n\n# inspired by https://github.com/l-althueser/nbindex-jupyter/blob/master/nbindex/numbered.py\ndef leave_figure_anchor(anchor_text):\n key = u"figure-{}".format(anchor_text)\n """\n Adds numbered named object HTML anchors. Link to them in MarkDown using: [to keyword 1](#keyword-1)\n """\n return display(HTML(\'\'\'<div id="%s"></div>\n <script>\n var key = "%s"\n $("div").each(function(i){\n if (this.id === key){\n this.innerHTML = \'<a name="\' + key + \'"></a>\';\n }\n });\n </script>\n \'\'\' % (key,key)))\n\ndef register_new_figure(anchor_text):\n global figures_so_far\n global figure_numbers\n if not anchor_text in figure_numbers:\n figure_numbers[anchor_text] = figures_so_far\n leave_figure_anchor(anchor_text)\n figures_so_far += 1\n return figure_numbers[anchor_text]\n\ndef figure_link(anchor_text=None):\n if anchor_text:\n template = "[Figure {figure_number}](#figure-{anchor_text})"\n my_return = template.format(figure_number=figure_numbers[anchor_text], \n anchor_text=anchor_text)\n else:\n my_return = figure_numbers\n return my_return\n \n\n# set up colors\noa_status_order = ["green", "gold", "hybrid", "bronze", "closed"]\noa_status_colors = ["green", "gold", "orange", "brown", "grey"]\noa_color_lookup = pd.DataFrame(data = {"name": oa_status_order, "color": oa_status_colors, "order": range(0, len(oa_status_order))})\nmy_cmap = sns.color_palette(oa_status_colors)\n\ngraph_type_order = ["green", "gold", "hybrid", "immediate_bronze", "delayed_bronze", "closed"]\ngraph_type_colors = ["green", "gold", "orange", "brown", "salmon", "gray"]\ngraph_type_lookup = pd.DataFrame(data = {"name": graph_type_order, "color": graph_type_colors, "order": range(0, len(graph_type_order))})\nmy_cmap_graph_type = sns.color_palette(graph_type_colors)\n\ngraph_type_colors_plus_biorxiv = ["lawngreen"] + graph_type_colors\ngraph_type_order_plus_biorxiv = ["biorxiv"] + graph_type_order\nplus_biorxiv_labels = [\n "green (biorxiv)",\n "green (other)",\n "gold",\n "hybrid",\n "bronze (immediate)",\n "bronze (delayed)",\n "closed"\n]\ngraph_type_plus_biorxiv_lookup = pd.DataFrame(data = {"name": graph_type_order_plus_biorxiv, "color": graph_type_colors_plus_biorxiv, "order": range(0, len(graph_type_colors_plus_biorxiv))})\nmy_cmap_graph_type_plus_biorxiv = sns.color_palette(graph_type_colors_plus_biorxiv)')
# The data in this analysis comes from two sources: (1) the Unpaywall dataset and (2) the access logs of the Unpaywall web browser extension.
#
# <a id="section-oa-vocab"></a>
# <a id="section-2-1"></a>
# ### 2.1 OA type: the Unpaywall dataset of OA availability
#
# Predicting levels of open access publication in the future requires detailed, accurate, timely data. This study uses the [Unpaywall](https://unpaywall.org/) dataset to provide this data. Unpaywall is an open source application that links every research article that has been assigned a Crossref DOI (more than 100 million in total) to the OA URLs where the paper can be read for free. It is built and maintained by Our Research (formerly Impactstory), a US-based nonprofit organization. Unpaywall gathers data gathered from over 50,000 journals and open-access repositories from all over the world. The full Unpaywall dataset is freely, publicly available (see details: <https://unpaywall.org/user-guides/research>).
# Our definitions of OA type (gold, green, hybrid, bronze, closed) are described in Piwowar et al. (2018). To facilitate prediction, for the purpose of this analysis we subdivided bronze OA into immediate and delayed OA. In summary, these definitions are:
#
# - **<span style="color:gold; font-size:100%;">█</span> Gold:** published in a fully-OA journal
# - **<span style="color:orange; font-size:100%;">█</span> Hybrid:** published in a toll-access journal, available on the publisher site, with an OA license
# - **<span style="color:brown; font-size:100%;">█</span> Bronze:** published in a toll-access journal, available on the publisher site, without an OA license
# - **<span style="color:brown; font-size:100%;">█</span> Immediate Bronze:** available as Bronze OA immediately upon publication
# - **<span style="color:salmon; font-size:100%;">█</span> Delayed Bronze:** available as Bronze OA after an embargo period
# - **<span style="color:green; font-size:100%;">█</span> Green:** published in a toll-access journal and the only fulltext copy available is in an OA repository
# - **<span style="color:gray; font-size:100%;">█</span> Closed:** everything else
#
# This analysis uses all articles with a Crossref article type of "journal-article" published between 1950 and the date of the analysis (October 2019), which is 71 million articles.
# <a id="section-2-2"></a>
# ### 2.2 Article views: access logs of the Unpaywall web browser extension
#
#
# Predicting the open access pattern of usage requests requires knowing the relative usage demands of papers based on their age. This study has extracted these pageview patterns from the usage logs of the [Unpaywall browser extension](https://unpaywall.org/products/extension) for Chrome and Firefox.
# In[2]:
register_new_figure("unpaywall_map");
# This extension is an open-source tool made by the same non-profit as the Unpaywall dataset described above, with the goal of helping people conveniently find free copies of research papers directly from their web browser. The extension has [more than 200,000 active users](http://blog.our-research.org/unpaywall-200k-users/), distributed globally, as shown in {{ print figure_link("unpaywall_map") }}.
#
# <img src="https://github.com/Impactstory/future-oa/blob/master/img/unpaywall%20extension%20users%20by%20location.jpg?raw=true"></img>
#
# **{{ print figure_link("unpaywall_map") }}: Map of Unpaywall users in February 2019.**
#
#
# The Unpaywall browser automatically extension detects when a user is on a scholarly article webpage -- we consider this an access request, or a view. The extension can be disabled, or can be configured to only run upon request, but very few users use these settings.
#
# The extension received more than 3 million article access requests in July 2019 which we use for most of our analysis. Because readership data is private and potentially sensitive, we are not releasing the Unpaywall usage logs along with the other datasets behind this paper other than as aggregate counts by OA type by year.
#
# <a id="section-3"></a>
# ## 3. Approach
#
# <a id="section-3-1"></a>
# ### 3.1 Overview
#
#
# The goal of this analysis is to predict two aspects of OA growth:
# 1. Growth in OA articles and their proportion of the literature over time
# 2. Growth in OA article views and their proportion of all literature views over time
# We examine the growth in OA articles *by date of observation*, rather than by date of publication. This requires us to calculate the OA lag between publication and availability for different types of OA, which is done in [Section 4.1](#section-4-1).
#
# Once we have the pattern of OA availability by year, we forecast the OA availability for future years by assuming that it will have the same overall pattern as previous years -- the papers that will be made available next year will have the same age distribution as papers that were made available last year. We allow the absolute number of papers to increase year-over-year: we estimate the future growth multiplier by extrapolating the height of past availability curves. This analysis is presented in [Section 4.2](#section-4-2).
# Next, we turn to predicting the growth of OA article views -- what proportion of what is read is available OA, and how will this change in the future? The Unpaywall browser extension logs give us a relative baseline of what is read right now. By assuming that reading patterns remain relatively unchanged over time (specifically the probability that a reader wants to read a paper given its age and OA type), we use the publication estimates we made in previous sections to calculate the relative number of views by OA type in the past and the future. This is described in [Section 4.3](#section-4-3).
#
# Finally, we look at the impact of extending the model to include a disruptive change, in this case the growth of bioRxiv, in [Section 4.4](#section-4-4).
#
# <a id="section-3-2"></a>
# ### 3.2 Glossary
#
# In addition to the OA types defined in [Section 2.1](#section-oa-vocab), we define additional terms as we use them in this paper, in approximate order they are discussed:
#
# - **Date of publication**: the date an article is published in a journal
# - **Embargo**: the delay that some toll-access journals require between date of publication and when an article can be made Green or Delayed Bronze OA
# - **Self-archiving**: when an author posts their article in an OA repository
# - **OA type**: the OA classification of an article, as defined in [Section 2](#section-oa-vocab). The OA type of an article may change over time (from Closed to Delayed Bronze OA, or from Closed to Green OA) because of embargoes and other self-archiving delays
# - **Date first available OA**: the date an article first becomes an OA type other than "Closed"
# - **OA lag**: the length of time between an article's Date of Publication and its Date First Available OA
# <pre></pre>
# - **OA assessment**: the determination of the OA type of an article at a given point in time
# - **Date of observation**: the point in time for which we make an OA assessment for an article. Explained in [Section 3.3](#section-3-3).
# - **Observation age** of an article: the length of time between an OA assessment observation and the article's date of publication
# <pre></pre>
# - **View**: someone on the internet visited the publisher webpage of an article, presumably with the hope of reading the article
# - **Date of view**: the date of the view
# - **View age** of an article: the length of time between an article's date of publication and the date of a view
# <pre></pre>
# - **Articles by age curve**: for a given snapshot, the plot of snapshot age (in years) on the x-axis and number of articles published of that snapshot age on the y-axis
# - **Views by age curve**: the plot of view age (in years) on the x-axis and number of views received by articles of that view age on the y-axis
# - **Views per article by age curve**: the plot of view or snapshot age (in years) on the x-axis and number of views per article (by views of that view age and articles of that snapshot age) on the y-axis
# - **Views per year curve**: the plot of year on the x-axis and the number of views estimated to have been made that year on the y-axis
# <a id="section-3-3"></a>
# ### 3.3 Date of Observation
# In[3]:
register_new_figure("date_of_observation");
# In this paper we approach the growth of OA from the Date of Observation of OA assessment, rather than the date of publication. We explain this with the use of {{ print figure_link("date_of_observation") }}.
#
# <img src="https://github.com/Impactstory/future-oa/blob/master/img/date_of_observation_prediction.jpg?raw=true" style="float:right;"></img>
#
# **{{ print figure_link("date_of_observation") }}: Date of observation.**
#
#
# Let’s imagine two observers, <span style="color:blue">Alice</span> (blue) and <span style="color:red">Bob</span> (red), shown by the two stick figures at the top of {{ print figure_link("date_of_observation") }}.
#
# Alice lives at the end of Year 1--that’s her "Date Of Observation." Looking down, she can see all 8 articles (represented by solid colored dots) published in Year 1, along with their access status: Gold OA, Green OA, or Closed. The Year of Publication for all eight of these articles is Year 1.
#
# Alice likes reading articles, so she decides to read all eight Year 1 articles, one by one.
#
# She starts with Article A. This article started its life early in the year as Closed. Later that year, though--after an OA Lag of about six months--Article A became Green OA as its author deposited a manuscript (the green circle) in their institutional repository. Now, at Alice’s Date of Observation, it’s open! Excellent. Since Alice is inclined toward organization, she puts Article A article in a stack of Green articles she’s keeping below.
#
# Now let’s look at Bob. Bob lives in Alice’s future, in Year 3 (ie, his “Date of Observation” is Year 3). Like Alice, he’s happy to discover that Article A is open. He puts it in his stack of Green OA articles, which he’s further organized by date of their publication (it goes in the Year 1 stack).
#
# Next, Alice and Bob come to Article B, which is a tricky one. Alice is sad: she can’t read the article, and places it in her Closed stack. Unbeknownst to poor Alice, she is a victim of OA Lag, since Article B will become OA in Year 2. By contrast, Bob, from his comfortable perch in the future, is able to read the article. He places it in his Green Year 1 stack. He now has two articles in this stack, since he’s found two Green OA articles in Year 1.
#
# Finally, Alice and Bob both find Article C is closed, and place it in the closed stack for Year 1. We can model this behavior for a hypothetical reader at each year of observation, giving us their view on the world--and that’s exactly the approach we take in this paper.
#
# Now, let’s say that Bob has decided he’s going to figure out what OA will look like in Year 4. He starts with Gold. This is easy, since Gold article are open immediately upon publication, and publication date is easy to find from article metadata. So, he figures out how many articles were Gold for Alice (1), how many in Year 2 (3), and how many in his own Year 3 (6). Then he computes percentages, and graphs them out using the stacked area chart at the bottom of {{ print figure_link("date_of_observation") }}. From there, it’s easy to extrapolate forward a year.
#
# For Green, he does the same thing--but he makes sure to account for OA Lag. Bob is trying to draw a picture of the world every year, as it appeared to the denizens of that world. He wants Alice’s world as it appeared to Alice, and the same for Year 2, and so on. So he includes OA Lag in his calculations for Green OA, in addition to publication year. Once he has a good picture from each Date Of Observation, and a good understanding of what the OA Lag looks like, he can once again extrapolate to find Year 4 numbers.
#
# Bob is using the same approach we will use in this paper--although in practice, we will find it to be rather more complex, due to varying lengths of OA Lag, additional colors, of OA, and a lack of stick figures.
#
# <a id="section-3-4"></a>
# ### 3.4 Statistical analysis
#
# The analysis was implemented as an executable python Jupyter notebook using the pandas, scipy, matplotlib, and sqlalchemy libraries. See the [Data and code availability section](#Data-and-code-availability) below for links to the analysis code and raw data.
# *---- delete the text between these lines in the final paper ----*
# #### Code: Functions
# See notebook.
# In[4]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\ndef get_data_extrapolated(graph_type, data_type=False, extrap="linear", now_delta_years=0, cumulative=True):\n \n calc_min_year = 1951\n display_min_year = 2010\n now_year = 2019 - now_delta_years\n max_year = 2024\n\n min_y = 0\n max_y = None\n color = graph_type\n if "bronze" in graph_type:\n color = "bronze"\n \n if isinstance(data_type, pd.DataFrame):\n df_this_color = data_type.loc[(data_type.graph_type==graph_type)]\n elif data_type == "basic":\n df_this_color = articles_by_color_by_year.loc[(articles_by_color_by_year.oa_status==color)]\n else:\n df_this_color = articles_by_graph_type_by_year.loc[(unpaywall_graph_type.oa_status==graph_type)]\n\n totals = pd.DataFrame()\n for i, prediction_year in enumerate(range(calc_min_year, now_year)):\n\n if "published_year" in df_this_color.columns:\n if cumulative:\n df_this_plot = df_this_color.loc[(df_this_color["published_year"] <= prediction_year)]\n else:\n df_this_plot = df_this_color.loc[(df_this_color["published_year"] == prediction_year)]\n else:\n df_this_plot = df_this_color\n y = [a for a in df_this_plot["num_articles"] if not np.isnan(a)]\n prediction_y = sum(y)\n\n totals = totals.append(pd.DataFrame(data={"prediction_year": [prediction_year], \n "num_articles": [prediction_y]}))\n\n \n x = totals["prediction_year"]\n y = totals["num_articles"]\n xnew = np.arange(now_year-1, max_year+1, 1)\n if extrap=="linear":\n f = scipy.interpolate.interp1d(x, y, fill_value="extrapolate", kind="linear")\n ynew = f(xnew)\n else:\n f = scipy.interpolate.interp1d(x, np.log10(y), fill_value="extrapolate", kind="linear")\n ynew = 10 ** f(xnew)\n \n new_data = pd.DataFrame({"color":color, "graph_type": graph_type, "x":np.append(x[:-1], xnew), "y":np.append(y[:-1], ynew)})\n\n return new_data\n\n\ndef graph_data_extrapolated(graph_type, data_type=False, extrap="linear", now_delta_years=0, ax=None, cumulative=True):\n calc_min_year = 1951\n display_min_year = 2000\n now_year = 2019 - now_delta_years\n max_year = 2024\n\n min_y = 0\n max_y = None\n color = graph_type\n if "bronze" in graph_type:\n color = "bronze"\n \n new_data = get_data_extrapolated(graph_type, data_type, extrap, now_delta_years, cumulative)\n\n year_range = range(display_min_year, now_year)\n \n if not isinstance(data_type, pd.DataFrame) and data_type == "simple":\n my_color_lookup = oa_color_lookup.loc[oa_color_lookup["name"]==color]\n else:\n my_color_lookup = graph_type_lookup.loc[graph_type_lookup["name"]==graph_type]\n \n if not ax:\n fig = plt.figure()\n ax = plt.subplot(111)\n\n if not max_y:\n max_y = 5 * max(new_data["y"])\n\n df_actual = new_data.loc[new_data["x"] < now_year]\n x = [int(a) for a in df_actual["x"]]\n y = [int(a) for a in df_actual["y"]]\n df_future = new_data.loc[new_data["x"] >= now_year]\n xnew = [int(a) for a in df_future["x"]]\n ynew = [int(a) for a in df_future["y"]]\n\n ax.plot(x, y, \'o\', color="black")\n ax.fill_between(x, y, color=my_color_lookup["color"])\n\n ax.plot(xnew, ynew, \'o\', color="black", alpha=0.3)\n ax.fill_between(xnew, ynew, color=my_color_lookup["color"], alpha=0.3)\n if cumulative:\n ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: \'{0:,.0f}\'.format(y/(1000*1000.0))))\n ax.set_ylabel("articles (millions)")\n ax.set_xlabel("year")\n else:\n ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: \'{0:,.1f}\'.format(y/(1000*1000.0))))\n ax.set_ylabel("articles (millions)")\n ax.set_xlabel("year of publication")\n ax.set_xlim(min(year_range), max_year)\n ax.set_title(graph_type);\n\n return new_data')
# In[5]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# graph! :)\n\ndef graph_available_papers_at_year_of_availability(graph_type, now_delta_years=0, ax=None):\n calc_min_year = 1951\n display_min_year = 2010\n now_year = 2018 - now_delta_years\n max_year = 2024\n\n color = graph_type\n if "bronze" in graph_type:\n color = "bronze"\n\n if graph_type == "biorxiv":\n my_color_lookup = {"color": "limegreen"}\n else:\n my_color_lookup = graph_type_lookup.loc[graph_type_lookup["name"]==graph_type] \n \n all_papers_per_year = get_papers_by_availability_year_including_future(graph_type, calc_min_year, max_year)\n\n most_recent_year = all_papers_per_year.loc[all_papers_per_year.article_years_from_availability == 0]\n \n x = [int(a) for a in most_recent_year.loc[most_recent_year.prediction_year <= now_year]["prediction_year"]]\n xnew = [int(a) for a in most_recent_year.loc[most_recent_year.prediction_year > now_year]["prediction_year"]]\n y = [int(a) for a in most_recent_year.loc[most_recent_year.prediction_year <= now_year]["num_articles"]]\n ynew = [int(a) for a in most_recent_year.loc[most_recent_year.prediction_year > now_year]["num_articles"]]\n\n year_range = range(display_min_year, now_year)\n if not ax:\n fig = plt.figure()\n ax = plt.subplot(111)\n\n max_y = 1.2 * max(ynew)\n\n ax.plot(x, y, \'o\', color="black")\n ax.fill_between(x, y, color=my_color_lookup["color"])\n\n ax.plot(xnew, ynew, \'o\', color="black", alpha=0.3)\n ax.fill_between(xnew, ynew, color=my_color_lookup["color"], alpha=0.3)\n ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: \'{0:,.2f}\'.format(y/(1000*1000.0))))\n ax.set_ylabel("total papers (millions)")\n\n ax.set_xlim(min(year_range), max_year)\n# ax.set_ylim(0, max_y)\n ax.set_xlabel(\'year of observation\')\n title = plt.suptitle("OA status by observation year")\n title.set_position([.5, 1.05])\n all_papers_per_year.reset_index(inplace=True)\n return all_papers_per_year')
# In[6]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\ndef graph_available_papers_in_observation_year_by_pubdate(graph_type, data, observation_year, ax=None):\n display_min_year = 2010\n max_year = 2025\n\n x = [int(a) for a in data["publication_date"]]\n y = [int(a) for a in data["num_articles"]]\n\n my_color_lookup = graph_type_lookup.loc[graph_type_lookup["name"]==graph_type]\n if not ax:\n fig = plt.figure()\n ax = plt.subplot(111)\n\n alpha = 1\n# if observation_year > 2018:\n# alpha = 0.3\n ax.bar(x, y, color=my_color_lookup["color"], alpha=alpha, width=1, edgecolor=my_color_lookup["color"])\n\n ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: \'{0:,.1f}\'.format(y/(1000*1000.0))))\n ax.set_xlim(display_min_year, max_year+1)\n max_y = 1.2 * data.num_articles.max()\n try:\n ax.set_ylim(0, max_y)\n except:\n pass\n ax.set_xlabel("")\n ax.set_ylabel("")\n ax.spines[\'top\'].set_visible(False)\n ax.spines[\'right\'].set_visible(False)\n \n# ax.set_title("{}: {}".format(graph_type, observation_year)); \n# title = plt.suptitle("Availability in {}, by publication date".format(observation_year))\n# title.set_position([.5, 1.05])\n return \n')
# In[7]:
def get_papers_by_availability_year_including_future(graph_type, start_year, end_year):
start_calc_year = 2009
last_year_before_extrap = 2017
offset = 0
global final_extraps
my_return = pd.DataFrame()
for prediction_year in range(min(start_year, start_calc_year), last_year_before_extrap+1):
# print prediction_year
papers_per_year = get_papers_by_availability_year(graph_type, prediction_year, just_this_year=False)
papers_per_year["prediction_year"] = prediction_year
my_return = my_return.append(papers_per_year)
if end_year >= last_year_before_extrap:
scale_df = final_extraps.copy()
current_year_all = get_papers_by_availability_year(graph_type, last_year_before_extrap, just_this_year=False)
now_year_new = get_papers_by_availability_year(graph_type, last_year_before_extrap, just_this_year=True)
for i, prediction_year in enumerate(range(last_year_before_extrap+1, end_year+1)):
current_year_all["article_years_from_availability"] += 1
# print now_year_all.head()
# print now_year_new.head()
merged_df = current_year_all.merge(now_year_new, on="article_years_from_availability", suffixes=["_all", "_new"], how="outer")
merged_df = merged_df.fillna(0)
# print merged_df.head(10)
scale = float(scale_df.loc[(scale_df.x==prediction_year)&(scale_df.graph_type==graph_type)].y) / int(scale_df.loc[(scale_df.x==last_year_before_extrap)&(scale_df.graph_type==graph_type)].y)
merged_df["num_articles"] = merged_df["num_articles_all"] + [int(scale * a) for a in merged_df["num_articles_new"]]
merged_df["prediction_year"] = prediction_year
current_year_all = pd.DataFrame(merged_df, columns=["num_articles",
"article_years_from_availability",
"prediction_year"])
my_return = my_return.append(current_year_all)
my_return.reset_index(inplace=True)
return my_return
# In[8]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# graph! :)\n\ndef graph_views(graph_type, data=None, now_delta_years=0, ax=None):\n calc_min_year = 1951\n display_min_year = 2010\n now_year = 2018 - now_delta_years\n max_year = 2025\n\n color = graph_type\n\n if isinstance(data, pd.DataFrame):\n df_views_by_year = data\n else:\n df_views_by_year = get_predicted_views(graph_type, display_min_year, max_year)\n\n year_range = range(display_min_year, now_year)\n if graph_type == "biorxiv":\n my_color_lookup = {"color": "limegreen"}\n else:\n my_color_lookup = graph_type_lookup.loc[graph_type_lookup["name"]==color]\n \n if not ax:\n fig = plt.figure()\n ax = plt.subplot(111)\n\n \n x = [int(a) for a in df_views_by_year["observation_year"]]\n y = [int(a) for a in df_views_by_year["views"]]\n max_y = 1.2 * max(y)\n\n ax.scatter(x, y, marker=\'x\', s=70, color=my_color_lookup["color"])\n\n ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: \'{0:,.1f}\'.format(y/(1000*1000.0))))\n ax.set_ylabel("views (millions)")\n\n ax.set_xlim(min(year_range), max_year+1)\n# ax.set_ylim(0, max_y)\n ax.set_xlabel(\'view year\')\n# title = plt.suptitle("Estimated views by access year, by OA type")\n# title.set_position([.5, 1.05])\n return df_views_by_year')
# In[9]:
# do calculations
def get_papers_by_availability_year(graph_type="closed", availability_year=2000, just_this_year=False):
my_return = pd.DataFrame()
if just_this_year:
if graph_type == "closed":
rows_published_this_year = articles_by_color_by_year.loc[articles_by_color_by_year["published_year"] == availability_year]
total_this_year = rows_published_this_year.num_articles.sum()
open_this_year = 0
for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze"]:
temp_papers = get_papers_by_availability_year(prep_graph_type, availability_year, just_this_year=False)
temp_papers = temp_papers.loc[temp_papers.article_years_from_availability == 0]
num_articles = temp_papers.num_articles.sum()
# print prep_graph_type, num_articles
open_this_year += num_articles
num_closed = total_this_year - open_this_year
my_return = pd.DataFrame({
"article_years_from_availability": [0],
"num_articles": [num_closed]
})
else:
prev_year_history = get_papers_by_availability_year(graph_type, availability_year-1, just_this_year=False)
prev_year_history["article_years_from_availability"] += 1
this_year_history = get_papers_by_availability_year(graph_type, availability_year, just_this_year=False)
df_merged = this_year_history.merge(prev_year_history, on="article_years_from_availability", how="left")
df_merged = df_merged.fillna(0)
df_merged["num_articles"] = df_merged["num_articles_x"] - df_merged["num_articles_y"]
df_merged["num_articles"][df_merged["num_articles"] < 25] = 0
df_merged = df_merged.loc[df_merged["article_years_from_availability"] <= 10]
my_return = pd.DataFrame({
"article_years_from_availability": df_merged["article_years_from_availability"],
"num_articles": df_merged["num_articles"]
})
else:
if graph_type == "delayed_bronze":
temp_papers = delayed_bronze_after_embargos_age_years.loc[delayed_bronze_after_embargos_age_years["prediction_year"]==availability_year]
my_return = pd.DataFrame({
"article_years_from_availability": temp_papers["article_age_years"],
"num_articles": temp_papers["num_articles"]
})
elif graph_type == "green":
my_green_oa = green_oa_with_dates_by_availability
my_green_oa = my_green_oa.loc[my_green_oa["months_old_at_first_deposit"] >= -24]
my_green_oa = my_green_oa.loc[my_green_oa["months_old_at_first_deposit"] <= 12*25]
my_green_oa = my_green_oa.loc[my_green_oa["year_of_first_availability"] <= availability_year]
my_green_oa_pivot = my_green_oa.pivot_table(
index='published_year', values=['num_articles'], aggfunc=np.sum)
my_green_oa_pivot.reset_index(inplace=True)
my_green_oa_pivot = my_green_oa_pivot.sort_values(by=["published_year"], ascending=False)
my_green_oa_pivot["article_years_from_availability"] = [(availability_year - a) for a in my_green_oa_pivot["published_year"]]
my_return = pd.DataFrame({
"article_years_from_availability": my_green_oa_pivot["article_years_from_availability"],
"num_articles": my_green_oa_pivot["num_articles"]
})
elif graph_type == "closed":
my_return = pd.DataFrame()
for i, year in enumerate(range(availability_year+1, 1990, -1)):
closed_rows = get_papers_by_availability_year(graph_type, availability_year - i, just_this_year=True)
closed_rows["article_years_from_availability"] = i
my_return = my_return.append(closed_rows)
elif graph_type == "immediate_bronze":
temp_papers = articles_by_color_by_year_with_embargos.loc[(articles_by_color_by_year_with_embargos.oa_status=="bronze") &
(articles_by_color_by_year_with_embargos["embargo"].isnull()) &
(articles_by_color_by_year_with_embargos.published_year <= availability_year)]
# temp_papers["article_years_from_availability"] = availability_year - temp_papers["published_year"]
temp_pivot = temp_papers.pivot_table(
index='published_year', values=['num_articles'], aggfunc=np.sum)
temp_pivot.reset_index(inplace=True)
my_return = pd.DataFrame({
"article_years_from_availability": availability_year - temp_pivot.published_year,
"num_articles": temp_pivot.num_articles
})
elif graph_type == "biorxiv":
my_return = biorxiv_growth_otherwise_closed.copy()
my_return = my_return.loc[my_return["published_year"] <= availability_year]
my_return["article_years_from_availability"] = availability_year - my_return["published_year"]
else:
temp_papers = articles_by_color_by_year.loc[(articles_by_color_by_year.oa_status==graph_type) &
(articles_by_color_by_year.published_year <= availability_year)]
my_return = pd.DataFrame({
"article_years_from_availability": [availability_year - a for a in temp_papers["published_year"]],
"num_articles": temp_papers["num_articles"]
})
if not my_return.empty:
my_return = pd.DataFrame(my_return, columns=["article_years_from_availability", "num_articles"])
my_return = my_return.sort_values(by="article_years_from_availability")
return my_return
# In[10]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\ndef get_predicted_views_by_pubdate(graph_type, observation_year):\n\n views_per_article = get_views_per_article(graph_type)\n \n df_views_by_year = pd.DataFrame()\n all_papers_per_year = get_papers_by_availability_year_including_future(graph_type, observation_year, observation_year+1)\n papers_per_year = all_papers_per_year.loc[all_papers_per_year["prediction_year"] == observation_year]\n \n try:\n data_merged_clean = papers_per_year.merge(views_per_article, left_on=["article_years_from_availability"], right_on=["article_age_years"])\n data_merged_clean = data_merged_clean.sort_values("article_age_years")\n# print data_merged_clean.head()\n data_merged_clean["views"] = data_merged_clean["views_per_article"] * data_merged_clean["num_articles"]\n data_merged_clean["observation_year"] = observation_year\n data_merged_clean["publication_year"] = observation_year - data_merged_clean["article_age_years"]\n new_data = pd.DataFrame(data_merged_clean, columns=["publication_year", "views", "article_age_years", "observation_year"])\n df_views_by_year = df_views_by_year.append(new_data)\n except (ValueError, KeyError): # happens when the year is blank\n pass\n \n return df_views_by_year')
# In[ ]:
# In[11]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\ndef get_predicted_views(graph_type, now_delta_years=0):\n# calc_min_year = 1951\n calc_min_year = 1995\n display_min_year = 2010\n now_year = 2020 - now_delta_years\n max_year = 2024\n exponential = False\n\n if graph_type == "biorxiv":\n exponential = True\n \n views_per_article = get_views_per_article(graph_type)\n \n df_views_by_year = pd.DataFrame()\n \n all_papers_per_year = all_predicted_papers_future\n for prediction_year in range(calc_min_year, max_year+1): \n# for prediction_year in range(calc_min_year, 2019): \n# for prediction_year in range(2017, 2019): \n papers_per_year = all_papers_per_year.loc[all_papers_per_year["prediction_year"] == prediction_year]\n papers_per_year = papers_per_year.loc[papers_per_year["graph_type"] == graph_type]\n# print views_per_article.head()\n try:\n data_merged_clean = papers_per_year.merge(views_per_article, left_on=["article_years_from_availability"], right_on=["article_age_years"])\n data_merged_clean = data_merged_clean.sort_values("article_age_years")\n win = data_merged_clean["views_per_article"] \n sig = data_merged_clean["num_articles"]\n views_by_access_year = signal.convolve(win, sig, mode=\'same\', method="direct")\n y = max(views_by_access_year)\n df_views_by_year = df_views_by_year.append(pd.DataFrame({"observation_year":[prediction_year], "views": [y]}))\n except (ValueError, KeyError): # happens when the year is blank\n pass\n \n\n return df_views_by_year')
# In[12]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\ndef get_papers_by_availability_year_total(availability_year):\n all_data = pd.DataFrame()\n for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze", "closed"]:\n# for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze"]:\n temp_papers = get_papers_by_availability_year_including_future(prep_graph_type, availability_year, availability_year+1)\n temp_papers["graph_type"] = prep_graph_type\n# print prep_graph_type\n# print "{:,.0f}".format(temp_papers.num_articles.max()), "{:,.0f}".format(temp_papers.num_articles.sum())\n# print "\\n"\n all_data = all_data.append(temp_papers)\n return all_data\n\ndef get_views_per_year_total():\n all_data = pd.DataFrame()\n for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze", "closed"]:\n temp_papers = get_views_per_year(prep_graph_type)\n temp_papers["graph_type"] = prep_graph_type\n# print prep_graph_type\n# print "{:,.0f}".format(temp_papers.num_views_per_year.max()), "{:,.0f}".format(temp_papers.num_views_per_year.sum())\n# print "\\n"\n all_data = all_data.append(temp_papers)\n return all_data\n\n\n\ndef get_views_per_article_total():\n all_data = pd.DataFrame()\n for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze", "closed"]:\n temp_papers = get_views_per_article(prep_graph_type)\n# print prep_graph_type\n# print "{:,.0f}".format(temp_papers.views_per_article.max()), "{:,.0f}".format(temp_papers.views_per_article.sum())\n# print "\\n"\n temp_papers["graph_type"] = prep_graph_type\n all_data = all_data.append(temp_papers)\n return all_data\n\n\ndef get_predicted_views_total(observation_year):\n all_data = pd.DataFrame()\n for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze", "closed"]:\n# for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze"]:\n temp_papers = get_predicted_views(prep_graph_type, observation_year)\n temp_papers["graph_type"] = prep_graph_type\n# print prep_graph_type \n all_data = all_data.append(temp_papers)\n return all_data\n\ndef get_predicted_views_by_pubdate_total(observation_year):\n all_data = pd.DataFrame()\n# for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze"]:\n for prep_graph_type in ["gold", "hybrid", "green", "immediate_bronze", "delayed_bronze", "closed"]:\n temp_papers = get_predicted_views_by_pubdate(prep_graph_type, observation_year)\n temp_papers["graph_type"] = prep_graph_type\n# print prep_graph_type\n all_data = all_data.append(temp_papers)\n return all_data')
# In[13]:
def get_views_per_year(graph_type):
if graph_type == "delayed_bronze":
views_per_year = views_by_age_years.loc[(views_by_age_years.oa_status=="bronze") &
(views_by_age_years.delayed_or_immediate=="delayed")]
elif graph_type == "immediate_bronze":
views_per_year = views_by_age_years.loc[(views_by_age_years.oa_status=="bronze") &
(views_by_age_years["delayed_or_immediate"]=="immediate")]
else:
views_per_year = views_by_age_years.loc[(views_by_age_years.oa_status==graph_type)]
views_per_year["num_views_one_month"] = views_per_year["num_views"] # this is just for one month
views_per_year["num_views_per_year"] = 12.0 * views_per_year["num_views_one_month"]
del views_per_year["num_views"]
del views_per_year["delayed_or_immediate"]
views_per_year = views_per_year.sort_values(by="article_age_years")
views_per_year = views_per_year.loc[views_per_year["article_age_years"] < 15]
return views_per_year
def get_views_per_article(graph_type):
if graph_type == "biorxiv":
graph_type = "green"
views_per_year = get_views_per_year(graph_type)
papers_per_year = get_papers_by_availability_year(graph_type, 2018, just_this_year=False)
papers_per_year["article_age_years"] = papers_per_year["article_years_from_availability"]
papers_per_year = papers_per_year.loc[(papers_per_year["article_age_years"] <=15 )]
data_merged_clean = papers_per_year.merge(views_per_year, on=["article_age_years"])
data_merged_clean["views_per_article"] = data_merged_clean["num_views_per_year"] / data_merged_clean["num_articles"]
views_per_article = pd.DataFrame(data_merged_clean, columns=["article_age_years", "views_per_article"])
views_per_article = views_per_article.sort_values(by="article_age_years")
if graph_type=="delayed_bronze":
# otherwise first one is too high because number articles too low in year 0 for delayed subset
views_per_article.loc[views_per_article.article_age_years==0, ["views_per_article"]] = float(views_per_article.loc[views_per_article.article_age_years==1].views_per_article)
return views_per_article
# In[14]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\ndef plot_area_and_proportion(df, color_type, start_year, end_year, divide_year, \n xlabel="year of publication",\n fancy=None):\n if color_type=="simple":\n my_colors = oa_status_colors\n my_color_order = oa_status_order\n color_column = "color"\n elif color_type=="standard":\n my_colors = graph_type_colors\n my_color_order = graph_type_order\n color_column = "graph_type"\n else:\n my_colors = graph_type_colors_plus_biorxiv\n my_color_order = graph_type_order_plus_biorxiv\n color_column = "graph_type"\n \n all_data_pivot = df.pivot_table(\n index=\'x\', columns=color_column, values=[\'y\'], aggfunc=np.sum)\\\n .sort_index(axis=1, level=1)\\\n .swaplevel(0, 1, axis=1)\n all_data_pivot.columns = all_data_pivot.columns.levels[0]\n\n fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 3), sharex=True, sharey=False)\n plt.tight_layout(pad=0, w_pad=2, h_pad=1)\n plt.subplots_adjust(hspace=1)\n\n all_data_pivot_graph = all_data_pivot\n ylabel = "articles (millions)"\n if fancy=="cumulative":\n ylabel = "cumulative articles (millions)"\n all_data_pivot_graph = all_data_pivot_graph.cumsum(0)\n elif fancy=="diff":\n ylabel = "newly available articles (millions)"\n all_data_pivot_graph = all_data_pivot_graph.diff()\n all_data_pivot_graph = all_data_pivot_graph.loc[all_data_pivot_graph.index > 1950]\n all_data_pivot_graph = all_data_pivot_graph.loc[all_data_pivot_graph.index <= end_year]\n \n # print all_data_pivot_graph\n all_data_pivot_actual = all_data_pivot_graph.loc[all_data_pivot_graph.index <= divide_year+1]\n my_plot = all_data_pivot_actual[my_color_order].plot.area(stacked=True, color=my_colors, linewidth=.1, ax=ax1)\n if end_year > divide_year:\n all_data_pivot_projected = all_data_pivot_graph.loc[all_data_pivot_graph.index > divide_year]\n my_plot = all_data_pivot_projected[my_color_order].plot.area(stacked=True, color=my_colors, linewidth=.1, ax=ax1, alpha=0.6)\n ax1.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter(\'{x:.0f}\'))\n ax1.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: \'{0:,.0f}\'.format(y/(1000*1000.0))))\n ax1.set_xlabel(xlabel)\n ax1.set_ylabel(ylabel) \n ax1.set_xlim(start_year, end_year)\n ax1.set_ylim(0, 1.2*max(all_data_pivot_graph.sum(1)))\n# ax1.set_title("Number of papers");\n handles, labels = my_plot.get_legend_handles_labels(); my_plot.legend(reversed(handles[0:len(my_colors)]), reversed(labels[0:len(my_colors)]), loc=\'upper left\'); # reverse to keep order consistent\n\n df_diff_proportional = all_data_pivot_graph.div(all_data_pivot_graph.sum(1), axis=0)\n all_data_pivot_actual = df_diff_proportional.loc[all_data_pivot_graph.index <= divide_year+1]\n my_plot = all_data_pivot_actual[my_color_order].plot.area(stacked=True, color=my_colors, linewidth=.1, ax=ax2)\n if end_year > divide_year:\n all_data_pivot_projected = df_diff_proportional.loc[all_data_pivot_graph.index > divide_year]\n my_plot = all_data_pivot_projected[my_color_order].plot.area(stacked=True, color=my_colors, linewidth=.1, ax=ax2, alpha=0.6)\n my_plot.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(xmax=1))\n ax2.set_xlabel(xlabel)\n ax2.set_ylabel(\'proportion of articles\')\n# ax2.set_title("Proportion of papers");\n ax2.set_xlim(start_year, end_year)\n ax2.set_ylim(0, 1) \n handles, labels = my_plot.get_legend_handles_labels(); my_plot.legend(reversed(handles[0:len(my_colors)]), reversed(labels[0:len(my_colors)]), loc=\'upper left\'); # reverse to keep order consistent\n\n plt.tight_layout(pad=.5, w_pad=4, h_pad=2.0) \n return (all_data_pivot_graph, df_diff_proportional)')
# In[15]:
# plot graphs duplicate new
def get_long_data(graph_type):
full_range = range(1990, 2020)
totals_bronze = pd.DataFrame()
for i, prediction_year in enumerate(full_range):
new_frame = get_papers_by_availability_year(graph_type, prediction_year, just_this_year=True)
new_frame["prediction_year"] = prediction_year
new_frame["published_year"] = [int(prediction_year - a) for a in new_frame["article_years_from_availability"]]
totals_bronze = totals_bronze.append(new_frame)
long_data_for_plot = totals_bronze
long_data_for_plot = long_data_for_plot.loc[long_data_for_plot["article_years_from_availability"] < 15]
return long_data_for_plot
def first_detailed_plots(graph_type):
my_color_lookup = graph_type_lookup.loc[graph_type_lookup["name"]==graph_type]
long_data_for_plot = get_long_data(graph_type)
pivot_data_for_plot = long_data_for_plot.pivot_table(
index='published_year', columns='prediction_year', values=['num_articles'], aggfunc=np.sum)\
.sort_index(axis=1, level=1)\
.swaplevel(0, 1, axis=1)
pivot_data_for_plot.columns = pivot_data_for_plot.columns.levels[0]
pivot_data_for_plot[pivot_data_for_plot < 0] = 0
pivot_data_for_plot["published_year"] = [int(a) for a in pivot_data_for_plot.index]
years = range(2015, 2018+1)
historical_graphs = False
color_idx = np.linspace(0, 1, len(years))
fig, axes = plt.subplots(1, len(years), figsize=(12, 3), sharex=True, sharey=True)
axes_flatten = axes.flatten()
axis_index = 0
max_y_for_this_plot = max(pivot_data_for_plot.max(1))
for i, prediction_year in enumerate(years):
ax = axes_flatten[axis_index]
axis_index += 1
rows = pivot_data_for_plot.copy()
rows = rows.loc[pd.notnull(rows[prediction_year])]
x = [int(a) for a in rows.index]
y = [int(a) for a in rows[prediction_year]]
ax.bar(x, y, color=my_color_lookup["color"])
ax.set_ylim(0, 1.2*max_y_for_this_plot)
ax.set_xlim(2010, 2019)
if ax.get_legend():
ax.get_legend().remove()
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:.0f}'))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(True)
ax.spines['left'].set_visible(True)
ax.set_xlabel("year of publication")
ax.set_title("year first available OA\n{}".format(prediction_year))
axes_flatten[0].set_ylabel("articles\nfirst made available")
plt.tight_layout(pad=0, w_pad=0, h_pad=0)
plt.subplots_adjust(hspace=0)
plt.show()
# In[ ]:
# In[16]:
def make_detailed_plots(graph_type):
num_subplots = 8
long_data_for_plot = get_long_data(graph_type)
pivot_data_for_plot = long_data_for_plot.pivot_table(
index='published_year', columns='prediction_year', values=['num_articles'], aggfunc=np.sum)\
.sort_index(axis=1, level=1)\
.swaplevel(0, 1, axis=1)
pivot_data_for_plot.columns = pivot_data_for_plot.columns.levels[0]
pivot_data_for_plot[pivot_data_for_plot < 0] = 0
# print pivot_data_for_plot
years = [year for year in pivot_data_for_plot.columns if year > 1990]
for historical_graphs in (False, True):
color_idx = np.linspace(0, 1, len(years))
fig, axes = plt.subplots(len(years[-num_subplots:]), 1, figsize=(7, 6), sharex=True, sharey=True)
axes_flatten = axes.flatten()
axis_index = 0
max_y_for_this_plot = max(pivot_data_for_plot.max(1))
for i, prediction_year in zip(color_idx[-num_subplots:], years[-num_subplots:]):
ax = axes_flatten[axis_index]
axis_index += 1
if historical_graphs:
pivot_data_for_plot[range(2000, prediction_year+1)].plot.area(stacked=True, alpha=0.4, ax=ax, color=[plt.cm.jet(i) for x in range(2000, prediction_year)])
try:
pivot_data_for_plot[range(2000, prediction_year)].plot.area(stacked=True, ax=ax, alpha=.9, color="lightgray")
ax.set_ylim(0, 3*max_y_for_this_plot)
except TypeError:
pass
else:
pivot_data_for_plot[prediction_year].plot.area(stacked=False, ax=ax, alpha=.4, color=plt.cm.jet(i))
ax.set_ylim(0, 1.2*max_y_for_this_plot)
ax.set_xlim(2009, 2018)
if ax.get_legend():
ax.get_legend().remove()
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:.0f}'))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
y_label = "{} made available during {}:".format(graph_type, prediction_year)
ax.set_ylabel(y_label, rotation='horizontal', labelpad=150, verticalalignment="center")
ax.set_yticks([])
plt.tight_layout()
plt.show()
fig, ax1 = plt.subplots(1, 1, figsize=(10, 3))
pivot_data_for_plot[years].plot.area(stacked=True, ax=ax1, alpha=.4, cmap=plt.cm.jet)
ax1.set_xlim(2000, 2018)
legend_handles, legend_labels = ax1.get_legend_handles_labels(); ax1.legend(reversed(legend_handles[-8:]), reversed(legend_labels[-8:]), loc='upper left'); # reverse to keep order consistent
ax1.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax1.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:.0f}'))
ax1.axvline(x=2015, color='black')
ax1.set_title("Total {} OA available in 2019, by year of availability and publication year".format(graph_type));
ax1.set_ylabel("number of articles")
ax1.set_xlabel("published year")
plt.tight_layout()
plt.show()
# In[17]:
def make_zoom_in_plot(graph_type):
full_range = range(1990, 2020)
long_data_for_plot = get_long_data(graph_type)
color_idx = np.linspace(0, 1, len(full_range))
fig, ax1 = plt.subplots(1, 1, figsize=(4, 4))
data_for_this_plot = long_data_for_plot
data_for_this_plot = data_for_this_plot.loc[data_for_this_plot["published_year"]==2015]
total_sum = data_for_this_plot["num_articles"].sum()
data_for_this_plot = data_for_this_plot.loc[data_for_this_plot["num_articles"]/total_sum>=0.01]
# print data_for_this_plot
# data_for_this_plot = data_for_this_plot.drop(columns=["article_age_months"])
pivot_df = data_for_this_plot.pivot_table(index='published_year', columns='prediction_year', aggfunc=np.sum)
pivot_df = pivot_df.div(pivot_df.sum(1), axis=0)
pivot_df.plot.bar(stacked=True, alpha=.4, ax=ax1, colors=[plt.cm.jet(a) for a in list(color_idx[-len(pivot_df.sum(0)):])])
ax1.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(xmax=1))
plt.ylabel('proportion of articles')
plt.title("Proportion of {} articles published in 2015".format(graph_type));
ax1.set_xlabel("")
ax1.set_xticks([])
legend_handles, legend_labels = ax1.get_legend_handles_labels();
cleaned_legend_labels = [a[-5:-1] for a in legend_labels]
legend_length = len(data_for_this_plot) # just the nonzero ones
ax1.legend(reversed(legend_handles[-legend_length:]), reversed(cleaned_legend_labels[-legend_length:]), loc='upper left'); # reverse to keep order consistent
# In[18]:
# Nonlinear curve fit with confidence interval
def curve_fit_with_ci(graph_type, papers_per_year_historical, curve_type, ax=None):
my_rows = papers_per_year_historical.loc[papers_per_year_historical.article_years_from_availability <= 5]
my_rows = my_rows.loc[my_rows.prediction_year >= 2000]
my_rows = my_rows.loc[my_rows.prediction_year < 2018]
x = my_rows.groupby("prediction_year", as_index=False).sum().prediction_year
y = my_rows.groupby("prediction_year", as_index=False).sum().num_articles
my_color_lookup = graph_type_plus_biorxiv_lookup.loc[graph_type_plus_biorxiv_lookup["name"]==graph_type]
my_color = my_color_lookup.iloc[0]["color"]
if not ax:
fig, ax = plt.subplots(1, 1, figsize=(3, 3), sharex=True, sharey=False)
ax.plot(x, y, 'o', color=my_color)
ax.set_xlim(2000, 2025)
ax.set_ylabel("articles (millions)")
ax.set_title("{}".format(graph_type))
if curve_type == "no_line":
ax.set_xlabel("year of observation")
ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: '{0:,.2f}'.format(y/(1000*1000.0))))
return
if curve_type == "linear":
initial_guess=None
def func(x, a, b):
return a * (x - 2000) + b
elif curve_type == "exp":
if graph_type == "biorxiv":
initial_guess=(5, 1, 1)
def func(x, a, b, d):
return b + a * np.exp((x - 2014)/d)
else:
initial_guess=(14287, 21932, 5)
def func(x, a, b, d):
return b + a * np.exp((x - 2000)/d)
elif curve_type == "negative_exp":
initial_guess=(1731700, 22962997, -7)
def func(x, a, b, d):
return b - a * np.exp((x - 2000)/d)
pars, pcov = curve_fit(func, x, y, initial_guess)
xfit_extrap = range(2000, 2040+1)
if curve_type == "linear":
yfit_extrap = [func(a, pars[0], pars[1]) for a in xfit_extrap]
yfit = [func(a, pars[0], pars[1]) for a in x]
else:
yfit_extrap = [func(a, pars[0], pars[1], pars[2]) for a in xfit_extrap]
yfit = [func(a, pars[0], pars[1], pars[2]) for a in x]
alpha = 0.05 # 95% confidence interval = 100*(1-alpha)
n = len(y) # number of data points
p = len(pars) # number of parameters
dof = max(0, n - p) # number of degrees of freedom
tval = t.ppf(1.0-alpha/2., dof) # student-t value
residuals = y - yfit
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y - np.mean(y))**2)
r_squared = 1 - (ss_res / ss_tot)
fit_string = ""
for i, p, var in zip(range(n), pars, np.diag(pcov)):
sigma = var**0.5
fit_string += ' p{}: {} [{} {}] '.format(i,
round(p, 3),
round(p - sigma*tval, 3),
round(p + sigma*tval, 3))
fit_string += "{}".format(round(r_squared, 3))
# print "{} {} {}".format(graph_type, curve_type, fit_string)
ax.plot(xfit_extrap[0:25], yfit_extrap[0:25], '-', color=my_color)
ax.set_xlabel("r^2={}".format(round(r_squared, 3)))
if max(yfit_extrap) > 100000:
ax.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda y, pos: '{0:,.2f}'.format(y/(1000*1000.0))))
my_return = pd.DataFrame({
"x": xfit_extrap,
"y": yfit_extrap,
"r_squared": [r_squared for y in yfit_extrap]
})
return my_return
# #### Code: SQL
# See notebook.
# In[19]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# query for articles_by_color_by_year_with_embargos and articles_by_color_by_year\n\nq = """\nselect date_part(\'year\', fixed.published_date)::int as published_year, \nfixed.oa_status,\ndelayed.embargo,\ncount(*) as num_articles\nfrom unpaywall u\nleft join journal_delayed_oa_active delayed on u.journal_issn_l = delayed.issn_l\njoin unpaywall_updates_view fixed on fixed.doi=u.doi\nwhere genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\nand published_year > \'1950-01-01\'::timestamp\ngroup by published_year, fixed.oa_status, embargo\norder by published_year asc\n"""\narticles_by_color_by_year_with_embargos = read_from_file_or_db("articles_by_color_by_year_with_embargos", q)\n\narticles_by_color_by_year = articles_by_color_by_year_with_embargos.drop(columns = ["embargo"])\narticles_by_color_by_year = articles_by_color_by_year.groupby([\'published_year\', \'oa_status\']).sum()\narticles_by_color_by_year.reset_index(inplace=True)')
# In[20]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# query for articles_by_graph_type_by_year\n\nq = """\nselect date_part(\'year\', fixed.published_date) as published_year, \nfixed.oa_status,\ncase when fixed.oa_status=\'bronze\' and delayed.embargo is not null then \'delayed_bronze\' \n when fixed.oa_status=\'bronze\' and delayed.embargo is null then \'immediate_bronze\' \n else fixed.oa_status end\n as graph_type,\ncount(*) as num_articles\nfrom unpaywall u\nleft join journal_delayed_oa_active delayed on u.journal_issn_l = delayed.issn_l\njoin unpaywall_updates_view fixed on fixed.doi=u.doi\nwhere genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\nand published_year > \'1950-01-01\'::timestamp\nand published_year < \'2019-01-01\'::timestamp\ngroup by published_year, fixed.oa_status, graph_type\norder by published_year asc\n"""\narticles_by_graph_type_by_year = read_from_file_or_db("articles_by_graph_type_by_year", q)')
# In[21]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# query for views_by_age_months_no_color_full_year. maybe don\'t need this one in the final paper?\n\nq = """\nselect datediff(\'days\', fixed.published_date, received_at_raw::timestamp)/30 as article_age_months, \ncount(u.doi) as num_views \nfrom papertrail_unpaywall_extracted extracted \njoin unpaywall u on extracted.doi=u.doi \njoin unpaywall_updates_view fixed on fixed.doi=u.doi\nwhere genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\nand fixed.published_date > \'1950-01-01\'::timestamp\nand extracted.doi not in (\'10.1038/nature21360\', \'10.1038/nature11723\')\ngroup by article_age_months\norder by article_age_months asc\n\n"""\nviews_by_age_months_no_color_full_year = read_from_file_or_db("views_by_age_months_no_color_full_year", q)')
# In[22]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# query for views_by_age_months\n# not used by analysis but here for data dump\n\nq = """\nselect datediff(\'days\', fixed.published_date, received_at_raw::timestamp)/30 as article_age_months, \nfixed.oa_status,\ncount(u.doi) as num_views \nfrom papertrail_unpaywall_extracted extracted\njoin unpaywall u on extracted.doi=u.doi \njoin unpaywall_updates_view fixed on fixed.doi=u.doi\nwhere genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\nand fixed.published_date > \'1950-01-01\'::timestamp\nand fixed.published_date < current_date\nand received_at_raw > \'2019-07-01\'\nand received_at_raw <= \'2019-08-01\'\nand extracted.doi != \'10.1038/nature21360\'\ngroup by article_age_months, fixed.oa_status\norder by article_age_months asc\n"""\nviews_by_age_months = read_from_file_or_db("views_by_age_months", q)\n')
# In[23]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# query for views_by_age_years\n\nq = """\nselect datediff(\'days\', fixed.published_date, received_at_raw::timestamp)/(30*12) as article_age_years, \nfixed.oa_status,\ncase when fixed.oa_status=\'bronze\' and journal_issn_l in (select issn_l from journal_delayed_oa_active) then \'delayed\' when fixed.oa_status=\'bronze\' then \'immediate\' else null end as delayed_or_immediate,\ncount(u.doi) as num_views \nfrom papertrail_unpaywall_extracted extracted \njoin unpaywall u on extracted.doi=u.doi \njoin unpaywall_updates_view fixed on fixed.doi=u.doi\nwhere genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\nand fixed.published_date > \'1950-01-01\'::timestamp\nand fixed.published_date < current_date\nand received_at_raw > \'2019-07-01\'\nand received_at_raw <= \'2019-08-01\'\nand extracted.doi != \'10.1038/nature21360\'\ngroup by article_age_years, fixed.oa_status, delayed_or_immediate\norder by article_age_years asc\n"""\nviews_by_age_years = read_from_file_or_db("views_by_age_years", q)')
# In[24]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\nq = """\nselect date_part(\'year\', min_record_timestamp) as year_of_first_availability, \ndatediff(\'days\', fixed.published_date, min_record_timestamp)/30 as months_old_at_first_deposit,\ndate_part(\'year\', fixed.published_date) as published_year,\ncount(*) as num_articles\nfrom unpaywall u\njoin unpaywall_pmh_record_min_timestamp pmh on u.doi=pmh.doi\njoin unpaywall_updates_view fixed on fixed.doi=u.doi\nwhere fixed.oa_status = \'green\'\nand genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\nand year_of_first_availability is not null\ngroup by year_of_first_availability, months_old_at_first_deposit, published_year\n"""\ngreen_oa_with_dates_by_availability = read_from_file_or_db("green_oa_with_dates_by_availability", q)')
# In[25]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# queries delayed_bronze_after_embargos_age_months\n# not used by analysis but here for data dump\n\nmin_prediction_year = 1949\nmax_prediction_year = 2019 + 1\nprediction_year_range = range(min_prediction_year, max_prediction_year)\ndelayed_bronze_after_embargos_age_months = pd.DataFrame()\n\nfor i, prediction_year in enumerate(range(min_prediction_year - 1, max_prediction_year)):\n \n q = """\n select \n datediff(\'days\', fixed.published_date, \'{prediction_year}-01-01\'::timestamp)/30 as article_age_months, \n --datediff(\'days\', fixed.published_date, current_date)/30 as article_age_months_from_now, \n {prediction_year} as prediction_year,\n count(*) as num_articles\n from unpaywall u\n left join journal_delayed_oa_active delayed on u.journal_issn_l = delayed.issn_l\n join unpaywall_updates_view fixed on fixed.doi=u.doi\n where genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\n and fixed.oa_status = \'bronze\'\n and delayed.embargo is not null\n and fixed.published_date > \'1950-01-01\'::timestamp\n and fixed.published_date <= ADD_MONTHS(\'{prediction_year}-01-01\'::timestamp, -embargo::integer)\n group by prediction_year, article_age_months\n order by prediction_year, article_age_months asc\n """.format(prediction_year=prediction_year)\n\n filename_root = "delayed_bronze_sql_parts/{varname}_{index}".format(varname="bronze_rows_by_month", index=i) \n bronze_rows = read_from_file_or_db(filename_root, q)\n \n delayed_bronze_after_embargos_age_months = delayed_bronze_after_embargos_age_months.append(bronze_rows)\ndelayed_bronze_after_embargos_age_months.to_csv("data/delayed_bronze_after_embargos_age_months.csv")\n')
# In[26]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\n# queries delayed_bronze_after_embargos_age_years\n\nmin_prediction_year = 1949\nmax_prediction_year = 2019 + 1\nprediction_year_range = range(min_prediction_year, max_prediction_year)\ndelayed_bronze_after_embargos_age_years = pd.DataFrame()\n\nfor i, prediction_year in enumerate(range(min_prediction_year - 1, max_prediction_year)):\n \n q = """ \n select \n datediff(\'days\', fixed.published_date, \'{prediction_year}-01-01\'::timestamp)/(30*12) as article_age_years, \n {prediction_year} as prediction_year,\n count(*) as num_articles\n from unpaywall u\n left join journal_delayed_oa_active delayed on u.journal_issn_l = delayed.issn_l\n join unpaywall_updates_view fixed on fixed.doi=u.doi\n where genre = \'journal-article\' and journal_issn_l not in (\'0849-6757\', \'0931-7597\')\n and fixed.oa_status = \'bronze\'\n and delayed.embargo is not null\n and fixed.published_date > \'1950-01-01\'::timestamp\n and fixed.published_date <= ADD_MONTHS(\'{prediction_year}-01-01\'::timestamp, -embargo::integer)\n \n group by prediction_year, article_age_years\n order by prediction_year, article_age_years asc\n """.format(prediction_year=prediction_year)\n\n filename_root = "delayed_bronze_sql_parts/{varname}_{index}".format(varname="bronze_rows_by_year", index=i)\n bronze_rows_by_year = read_from_file_or_db(filename_root, q)\n \n delayed_bronze_after_embargos_age_years = delayed_bronze_after_embargos_age_years.append(bronze_rows_by_year)\ndelayed_bronze_after_embargos_age_years.to_csv("data/delayed_bronze_after_embargos_age_years.csv")')
# In[27]:
get_ipython().run_cell_magic(u'capture', u'--no-stderr --no-stdout --no-display', u'\nq = """select u.year::numeric as published_year, count(distinct u.doi) as num_articles \nfrom unpaywall u\njoin unpaywall u_biorxiv_record on u_biorxiv_record.doi = replace(u.best_url, \'https://doi.org/\', \'\')\nwhere u.doi not like \'10.1101/%\' and u.best_url like \'%10.1101/%\'\nand datediff(\'days\', u_biorxiv_record.published_date::timestamp, u.published_date::timestamp)/(30.0) >= 0\nand u.year >= 2013 and u.year < 2019\ngroup by u.year\norder by u.year desc\n"""\nbiorxiv_growth_otherwise_closed = read_from_file_or_db("biorxiv_growth_otherwise_closed", q)')
# *---- delete the text to the line above in the final paper ----*
# <a id="section-4"></a>
# ## 4. Methods and Results
# <a id="section-4-1"></a>
# ### 4.1 Past OA Publication, by date of observation
# <a id="section-4-1-1"></a>
# #### 4.1.1 OA lag
#
# For Gold OA and Hybrid OA understanding OA lag is easy -- there is no lag: papers become OA at the time of publication.
#
# For Green and Bronze OA the lag is more complicated. Authors often self-archive (upload their paper to a repository) months or years after the official publication date of the paper, typically because the journal has a policy that authors must wait a certain length of time (the "embargo period") before self-archiving. Funder policies that mandate Green OA often allow a delay between publication and availability (notably the National Institutes of Health in the USA allows a 12 month embargo, which is relevant for most of the content in the large PMC repository). Finally, some journals open up their back catalogs once articles reach a certain age, which has been called "delayed OA" (Laakso and Björk, 2013) and we consider an important subset of Bronze.
#
# We explore and model these dynamics below.
# <a id="section-4-1-2"></a>
# #### 4.1.2. OA lag for Green OA
# In[28]:
register_new_figure("oa_lag_green");
# Calculating OA lag requires data on both when an article was first published in its journal and the date it was first made OA.
#
# The date an article becomes Green OA can be derived from the date it was made available in a repository, which we can get from its matched [OAI-PMH records](https://www.openarchives.org/pmh/) (as harvested by Unpaywall).
#
# {{ print figure_link("oa_lag_green") }} shows four plots: the leftmost plot shows Green OA articles that were first made OA in 2015, the second plot shows Green OA articles that were first made OA in 2016, and so on. Each plot is a histogram of number of articles by date of publication.
#
# In[29]:
first_detailed_plots("green")
# **{{print figure_link("oa_lag_green")}}: OA lag for Green OA.** Each plot shows articles that were first made available during the given year of observation, by year of their publication on the x-axis.
# By looking at the first plot in depth, we can see that a few articles are made available *before* they are actually published (articles published in 2016 or 2017) -- these were preprints, submitted before publication. Continuing to look at the first row, we can see the bulk of the articles that became available in 2015 were published in 2015 (lag of zero years) or in 2014 (lag of 1 year). A few were published in 2013 (an OA lag of 2 years), and then a long tail represents the backfilling of older articles.
#
# Looking now at all plots in {{ print figure_link("oa_lag_green") }}, we can see that a similar OA lag pattern (a few preprints are available before publication, most articles become available within a 3 year OA lag, then a long tail) has held for the last four years of Green OA availability (the distribution of the bars are similar in all four graphs).
# We can also see that the relative amount of green OA is growing slightly by year of OA-first-availability (the area under the whole histogram gets higher with subsequent histograms). Green OA appears to be growing. We will explore this further in [Section 4.2](#section-4-2).
#
# More details on Green OA lag are included in Supplementary Information, [Section 11.1](#section-11-1).
# <a id="section-4-1-3"></a>
# #### 4.1.3 OA lag for Bronze Delayed OA
# There was no recent, complete, publicly-available list of Delayed OA journals, so we derived a list empirically based on the Unpaywall database. We have made our list publicly available: details are in [Section 7.2](#section-7-2).
#
# To create the list we started by looking at existing compilations of Delayed OA journals, including:
#
# - <https://www.elsevier.com/about/open-science/open-access/open-archive>
#
# - <http://highwire.stanford.edu/cgi/journalinfo#loc>
#
# - <https://www.ncbi.nlm.nih.gov/pmc/journals/?filter=t3&titles=current&search=journals#csvfile>
#
# - <https://en.wikipedia.org/wiki/Category:Delayed_open_access_journals>
#
# - [Delayed open access: An overlooked high‐impact category of openly available scientific literature](https://helda.helsinki.fi/bitstream/10138/157658/3/Laakso_Bj_rk_2013_Delayed_OA.pdf) by Laakso and Björk (2013).
#
# From those sources we determined that almost all embargoes for Delayed OA journals are at 6, 18, 24, 36, 48, or 60 months.
# Next we used the Unpaywall data to calculate the OA rate of all journals, partitioned by age of their articles. We looked at Bronze OA rates before and after each of these common month cutoffs, highlighting cutoffs where OA was much less than 90% before the cutoff and 90% or higher afterwards. For each cutoff that looked like a Delayed OA candidate, we manually examined the full OA pattern for the journal and made a judgment call about whether it had an OA pattern consistent with a Delayed OA journal (low OA rates for articles until an embargo date, then high OA rates). We finally cross-referenced this empirically derived list with the sources again to see if it was roughly equivalent for journals on both lists -- it is, and the empirically derived list is more comprehensive.
# Our resulting list includes 3.6 million articles (4.9% of all articles) published in 546 journals, with the following embargo lengths:
#
# embargo length (months)|number of journals|number of articles
# ---|---|---
# 6 |58 | 511,326
# 12 |175| 1,608,597
# 18 |137 | 68,9820
# 24 |42 | 188,949
# 36 |71 | 269,186
# 48 |63 | 316,510
# **Total** |**546** | **3,584,388**
# In[30]:
register_new_figure("oa_lag_delayed_bronze");
# We used this list to split articles labelled "Bronze" by Unpaywall into two categories: "Delayed Bronze" for articles published in journals in our Delayed OA list, and "Immediate Bronze" for all others.
#
# Immediate Bronze articles have no OA lag: they become available on the publisher site immediately.
#
# We estimate the OA lag for a Delayed Bronze OA article as the Delayed OA embargo for journal it is published in. From there we can also estimate the date it first became OA by adding the embargo period to the publication date of the article.
#
# {{ print figure_link("oa_lag_delayed_bronze") }} shows four plots: the leftmost plot shows Delayed Bronze OA articles that were first made OA in 2015, the second plot shows Delayed Bronze OA articles that were first made OA in 2016, and so on. Each plot is a histogram of number of articles by date of publication.
#
# The distribution of Delayed Bronze OA articles by date first made OA is shown in {{ print figure_link("oa_lag_delayed_bronze") }}, as histograms by publication date. Most articles become available after a 1 year lag. Bumps that represent articles that become available at 24, 36, and 48 months are also clearly visible.
# In[31]:
first_detailed_plots("delayed_bronze")
# **{{print figure_link("oa_lag_delayed_bronze")}}: OA lag for Delayed Bronze OA.** Each plot shows articles that were first made available during the given year of observation, by year of their publication on the x-axis.
# By looking at the first plot of {{ print figure_link("oa_lag_delayed_bronze") }} in depth, we can see that most articles first made available in Delayed Bronze OA journals were made available with 1 year OA lag, in 2014. A few were made available with a lag of less than one year, 2 years, or 4 years.
#
# We can also see that the relative amount of Delayed Bronze OA is not growing very much by year of OA-first-availability (the area under the whole histogram gets higher with subsequent histograms is approximately the same for all histograms). Delayed Bronze OA is not growing quickly. We will explore this further in [Section 4.2](#section-4-2).
#
# More details on Delayed Bronze OA lag are included in Supplementary Information, [Section 11.2](#section-11-2).
# <a id="section-4-1-4"></a>
# #### 4.1.4 Closed access at date of observation
# We consider an article Closed if it has been published and is not considered OA at the time of observation.
# <a id="section-4-1-5"></a>
# #### 4.1.5 Past OA by date of observation and date of publication
# In[32]:
register_new_figure('small-multiples-num-papers-past');
# We combine the OA lag data above to describe OA by date of observation for all OA types, in {{ print figure_link('small-multiples-num-papers-past')}}.
#
# Each column is a year of observation, from 2014 to 2018. Each row is a different OA type. Each mini plot is a histogram of all articles available by publication date, for the given observation year and OA type.
#
# This figure differs from {{ print figure_link("oa_lag_green") }} and {{ print figure_link("oa_lag_delayed_bronze") }} in that it is cumulative over date of first availability: it shows all papers published prior to the year of observation.
# In[33]:
# start here
now_year = 2018
papers_per_year_historical = pd.DataFrame()
for graph_type in graph_type_order:
for prediction_year in range(1990, now_year+1):
papers_per_year = get_papers_by_availability_year(graph_type, prediction_year, just_this_year=True)
papers_per_year["graph_type"] = graph_type
papers_per_year["prediction_year"] = prediction_year
papers_per_year_historical = papers_per_year_historical.append(papers_per_year)
papers_per_year_historical_cumulative = pd.DataFrame()
for graph_type in graph_type_order:
for prediction_year in range(1990, now_year+1):
papers_per_year = get_papers_by_availability_year(graph_type, prediction_year, just_this_year=False)
papers_per_year["graph_type"] = graph_type
papers_per_year["prediction_year"] = prediction_year
papers_per_year_historical_cumulative = papers_per_year_historical_cumulative.append(papers_per_year)
# In[34]:
my_range = range(2014, 2018+1)
fig, axes = plt.subplots(len(graph_type_order)+1, len(my_range), figsize=(12, 6), sharex=True, sharey=False)
axes_flatten = axes.flatten()
plt.tight_layout(pad=0, w_pad=2, h_pad=1)
plt.subplots_adjust(hspace=1)
i = 0
for observation_year in my_range:
ax = axes_flatten[i]
ax.set_axis_off()
column_label = "observation year\n{}".format(observation_year)
ax.text(.3, .2, column_label,
horizontalalignment='center',
verticalalignment='bottom',
fontsize=14,
transform=ax.transAxes)
i += 1
for graph_type in graph_type_order[::-1]:
for observation_year in my_range:
ax = axes_flatten[i]
this_data = papers_per_year_historical_cumulative.copy()
this_data = this_data.loc[this_data.graph_type == graph_type]
this_data = this_data.loc[this_data.prediction_year == observation_year]
this_data["publication_date"] = [int(observation_year - a) for a in this_data.article_years_from_availability]
new_data = graph_available_papers_in_observation_year_by_pubdate(graph_type, this_data, observation_year, ax=ax)
y_max = papers_per_year_historical_cumulative.loc[(papers_per_year_historical_cumulative.graph_type == graph_type) &
(papers_per_year_historical_cumulative.prediction_year <= max(my_range))]["num_articles"].max()
ax.set_ylim(0, 1.2*y_max)
axis_color = "silver"
ax.spines['bottom'].set_color(axis_color)
ax.spines['top'].set_color(axis_color)
ax.spines['right'].set_color(axis_color)
ax.spines['left'].set_color(axis_color)
ax.tick_params(axis='x', colors=axis_color)
ax.tick_params(axis='y', colors=axis_color)
i += 1
i_bottom_left_graph = len(graph_type_order) * len(my_range)
ax_bottom_left = axes_flatten[i_bottom_left_graph]
ax_bottom_left.set_ylabel("articles\n(millions)");
ax_bottom_left.set_xlabel("year of publication");
axis_color = "black"
ax_bottom_left.spines['bottom'].set_color(axis_color)
ax_bottom_left.spines['top'].set_color(axis_color)
ax_bottom_left.spines['right'].set_color(axis_color)
ax_bottom_left.spines['left'].set_color(axis_color)
ax_bottom_left.tick_params(axis='x', colors=axis_color)
ax_bottom_left.tick_params(axis='y', colors=axis_color)
# **{{print figure_link("small-multiples-num-papers-past")}}: Articles by year of observation, 2014-2018.** Each row is an OA Type, each column is a Year of Observation, the x-axis of each graph is the Year of Publication, and the y-axis is the total number of articles (millions) available at the year of observation.
# In[35]:
first_year_row = 2014
# We can see that Gold, Hybrid, and Immediate Bronze OA articles all simply accumulate new articles each year, immediately. For example, the {{ print first_year_row+1}} Gold graph is identical to the {{ print first_year_row}} Gold graph beside it, other than the addition of a new, taller rightmost bar showing new papers published and made available in 2015.
#
# In contrast, Green OA (6th row) and Delayed Bronze OA (2nd row) graphs all have more complicated trends. The graphs for the {{ print first_year_row+1}} observation year differ from the {{ print first_year_row}} graphs beside them in that they have a few new publications in {{ print first_year_row+1}}, but they also boost the {{ print first_year_row}} publication year, and even older years. In fact we can see that when observed in {{ print first_year_row+4}} (the last column of the whole figure) Green OA is higher in all publication years than it was in the observation year {{ print first_year_row}} (the first column in the figure) because of met embargoes and backfilling. A similar trend is visible for Delayed Bronze OA.
# It is hard to see at the scale of {{print figure_link("small-multiples-num-papers-past")}}, but the Closed access graphs (top row) have the opposite trend -- when observed in 2018 (the last column), *fewer* papers in early bars of the histogram were considered Closed OA compared to an observation made in {{ print first_year_row}} (first column). This is because some of what was "observed" as Closed in {{ print first_year_row}} has become Green and Bronze by the observation year of {{ print first_year_row+4}}, and therefore no longer appears in the Closed access histograms.
# <a id="section-4-1-6"></a>
# #### 4.1.6 Combined Past OA by date of observation
# In[36]:
register_new_figure('articles_by_oa_historical');