Building Towards a Future Where Reproducible, Open Science is the Norm
Karthik Ram and Ben Marwick
The traditional boundaries between domain researcher and scientific programmer have been blurring rapidly over the past decade. Pressing societal issues such as global climate change, disease outbreaks, endangered species conservation, and drug discovery cut across traditional scientific silos. Successfully answering such interdisciplinary problems will require researchers to not only access and process ever-increasing quantities of data but also leveraging them in the context of their domain expertise. The cost of collecting this data is also dropping, and new technologies in every aspect of our lives now enable cheap and easy collection of high volumes of highly diverse data. As a result, scientific endeavors have come to rely on massive amounts of data being analyzed with a disparate set of tools and technologies.
Another consequence of the high volumes of data and increasing diversity of software tools is that scientists now produce a vast array of research products such as data, code, algorithms, in addition to traditional publications (Heather A Piwowar & Vision, 2013). Yet, until recently, funding agencies such as the US National Science Foundation did not consider any outputs beyond traditional peer-reviewed publications, as credit-worthy outcomes. While some fields, such as astronomy and high energy physics, have long recognized the importance of making the entire research pipeline publicly available, this is far from normal in most areas of science. In the last decade, many areas of science have had high-profile cases of non-reproducible research. Well-publicised retractions include Diederik Stapel in social psychology, Anil Potti in cancer research, Carmen Reinhart and Kenneth Rogoff in economics, and Marc Hauser in evolutionary biology. In addition, large-scale efforts to reproduce biomedical (Begley & Ellis, 2012) and psychological experiments (Open Science Collaboration, 2015) suggest that the prevalence of non-reproducible research has been underestimated, resulting in news headlines declaring a 'reproducibility crisis' in science. The issue of reproducibility is particularly timely given the recent rise in retractions from high profile journals (Van Noorden, 2011). While some aspects of this crisis are due to bad agents, there are also broader systemic problems that result in the production of non-reproducible research. In this chapter we briefly survey some of the gaps, challenges, and opportunities for improving the reproducibility of research.
Gaps: Reproducibility is hard
For many scientists, generating reproducible research is difficult because of the diversity of hardware and software in their workflow. For example, consider an analytical instrument that outputs data in a particular format, which then needs to be transformed and rearranged in several ways before being into input into a sequence of several different specialized computer programs for analysis. As the data is moved between each program - we can call this between space a 'gap' - additional manual inspection, readjustment and perhaps combination with other data is required. Gaps result from disconnected tools that have been combined to suit a specific research problem. The problems of handling the data in the gaps are typically solved by bespoke methods that are unique to each group or individual, using tools that were never intended for scientific research (e.g.
Make), and are rarely produced with the intention of making them public. The custom and expedient nature of these gap-filling methods make it difficult to capture the entire workflow to enable other researchers to reproduce the result. Because of the high diversity of research problems and tools across different areas of science, attempts to integrate these into a single platform have had limited uptake outside of bioinformatics, where many of these pipeline frameworks were first developed (Leipzig, 2016).
Outside of bioinformatics, some researchers are filling these gaps by using literate programming style that allows programming code and narrative text to be interwoven within a single document. One example of this is the work of FitzJohn et al. (2014), who combined the R package knitr with
Make, among other tools, to create a self-contained and self-documenting workflow for their ecological study. A similar example is the archaeological study by Clarkson et al. (2015), who also used knitr to combine narrative text and programming code to process data from diverse sources. Clarkson et al. also used Docker to provide a self-contained computational environment for their workflow, so that they key software dependencies could be bundled into their research repository with the data. This example is described in more detail in Marwick (2016).
We believe that the use of knitr in these two examples is part of a broader trend in the adoption of executable notebooks in science broadly. An executable notebook is a framework that allows narrative text (and its accessories, such as citations, figures, tables, etc.) and programming code that generates the figures and tables to be interwoven in a single source document. Among R users knitr (a descendant of Sweave) is currently the dominant tool for producing executable notebooks. For Python users there is Jupyter, which can also be used with other programming languages. Our hope is that executable notebooks will be the solution to the problem of gaps in research workflows.
Two other key elements of filling the gaps in the scientific workflow are training for scientists in efficient computer programming, and infrastructure for sharing and collaborating with code. Great progress has been made in these areas, with organizations such as Software Carpentry and Data Carpentry developing and delivering volunteer-led training workshops to researchers across the sciences. Their lesson materials are open source and online to enable self-study for researchers unable to attend workshops in person. The infrastructure for sharing and collaborating has been made available by services such as GitHub and BitBucket. These services, based on the Git version control system, allow researchers to share their code, organize contributions to scientific software projects, and discover code produced by other researchers (Ram, 2013). In our view, the increase in demand by researchers for training in programming, and the rising popularity of GitHub as public repository for scientific code, reflect a trend toward increasing openness in the scientific process, and in the reproducibility of research.
Challenges: Changing the incentives
Traditional incentives in science prioritize highly cited publications of positive, novel, tidy results. The practice of enabling the reproducibility of those results to be assessed by making the data and code publicly available is not part of the traditional incentives of science. However, individual researchers can gain significant personal benefits for their open science efforts. While preparing and depositing data into an easily discoverable repository requires an upfront time investment, there are numerous benefits to doing so. The National Science Foundation (NSF), for example, requires a data management plan as part of the proposal (Donnelly & Jones, 2010) and also count these endeavors under their merit guidelines (NSF, 2012). Further, authors who share data alongside publications are also likely to be cited more (Heather A. Piwowar, Day, & Fridsma, 2007) and benefit from alternate metrics which are strongly correlated with citations (Heather A Piwowar & Vision, 2013).Citation benfits have been demonstrated for code sharing in research publications (Vandewalle, 2012).
The citation advantage from sharing research data has been demonstrated in numerous disciplines. Henneken and Accomazzi (2011) analysed 3814 articles in four astronomy journals and found that articles with links to open datasets on average acquired 20% more citations than articles without links to data. Restricting the sample to papers published in since 2009 in The Astrophysical Journal, Dorch (2012) found that papers with links to data receiving 50% more citations per paper per year, than papers without links to data. In 1,331 articles published in Paleoceanography between 1993 and 2010, Sears (2011) found that publicly available data in articles was associated with a 35% increase in citations. Similar positive effects of data sharing have been described in the social sciences. In 430 articles in the Journal of Peace Research, articles that offered data in any form, either through appendices, URLs, or contact addresses were on average cited twice as frequently as an article with no data but otherwise equivalent author credentials and article variables (Gleditsch & Strand, 2003).
It is clear that researchers in a number of different fields benefit from a citation advantage for their articles that include publicly available datasets. In addition to increased citations for data sharing, Pienta et al. (2010) found that data sharing is associated with higher publication productivity. They examined 7,040 NSF and NIH awards and concluded that a research grant award produces a median of five publications, but when data are archived a research grant award leads to a median of ten publications. These studies suggest the investment of effort in improving reproducibility by sharing data can have payoffs in the traditional incentive system. These efforts are also advantageous in the broader, but very slow, shift in incentives that favor reproducibility over novelty that we sense is occurring in some fields.
The incentivisation of novelty has led to widespread anxiety that sharing of data will result in getting one's own research scooped, and a lack of appropriate rewards for time spent documenting and sharing methods (Heather A. Piwowar et al., 2007). Even when there is an appreciation for open science, the technical challenges such as lack of appropriate skills and knowledge of best practices can hinder this process. By addressing both the cultural and technical challenges we can create a community of practice that would ensure that data sharing is the norm rather than the exception (Birnholtz & Bietz, 2003).
An important step forward in establishing norms for sharing data and using shared data is Daniel Kahneman's (2014) 'reproducibility etiquette'. He proposes that researchers intending to use an open dataset or code repository contact the original authors. When working with code written by others, he especially recommends having a discussion with the authors of the code. The purpose of this to give them a chance to fix bugs or respond to issues you have identified before you make any public statements (Eglen et al., 2016). He also recommends citing code and data in an appropriate fashion. In addition, researchers should also pay close attention to the license agreements attached to specific pieces of code, software, and data products as they unambiguously state the conditions under which such work can be used, adapted, and redistributed (Morin, 2012). Although this is a simple and non-technical detail, we expect that if these values become normalized than the common anxiety of sharing code and data will diminish, and more researcher will feel comfortable to make their work more reproducible.
Making one's research meaningfully reproducible is a significantly more involved effort than merely sharing a handful of scripts and datasets via open repositories (FitzJohn et al., 2014; Mesnard & Barba, 2016). Such activities represent the first of a series of rigorous steps necessary to make a research product truly reproducible. Many of the challenges lie in the analysis phase where the provenance of all inputs and dependencies need to be carefully tracked using automated workflows. It would be naive to suggest that researchers can make their work fully reproducible by following a few simple steps. Even when experienced computational researchers such as FitzJohn et al and Mesnard et al began their study with full reproducibility in mind, challenges around inadequate tooling and workflow complexity made the task quite hard.
Despite such roadblocks, rapid improvements in tools and workflow technology will continue to lower barriers to reproducibility across various disciplines. In the meantime, any level of reproducibility brings us closer to overcoming the challenges.
Opportunities: The promise of open science
Science is in the midst of a dramatic transformation that is being driven by increasing access to large amounts of heterogeneous data. The long-established model where sole researchers collect and analyze their own data will no longer be the dominant approach and instead be replaced by one where disparate datasets from multiple sources are used. It is now widely accepted in many scientific disciplines that existing datasets can be used to solve novel problems not anticipated by the original investigator (Faniel & Zimmerman, 2011; Nielsen, 2012; Whitlock, McPeek, Rausher, Rieseberg, & Moore, 2010). Such open data can serve as a research accelerator, enabling scientists to rapidly collaborate on knowledge creation and synthesis efforts (Neylon, 2012). A similar pattern of collaboration and reuse is also emerging across the scientific software stack as is evident in the case studies described in this book. A rich suite of open source tools are rapidly lowering barriers to collaborations across disparate domains and institutions and helping accelerate the rate of scientific discovery in ways previously unimagined.
This new era of open science is enabling a community of practice that allows collaborations to scale more easily while various links in the chain of scientific reasoning to be used in different contexts. Part of the reason why scientific workflows are not properly curated or shared are an artifact of the way the credit system currently works in science. Due to insufficient incentives to share, original investigators spend very little time on activities other than publishing. As a result, valuable data, code and critical details on implementation are prone to disappearing or becoming less useful over time (Michener, Brunt, Helly, Kirchner, & Stafford, 1997). However the scholarly landscape is changing to provide both the incentives and means for increased data sharing.
Until recently, researchers who put time and effort into documenting and sharing data and details of their analysis were considered outliers. Now the scholarly landscape is in the midst of a revolution, and among the emerging changes are new incentive mechanisms for reporting research impact. For example, altmetrics (H. Piwowar, 2013) track influence of research outputs and data products outside of the traditional citation framework, providing more ways to measure success. Organizations and repositories including DataCite, figshare, Zenodo, Dryad, DataONE, and others provide the means for data to be cited independent of publications. Papers that share data are more likely to receive citations (Heather A. Piwowar et al., 2007), and people who collect and deposit well-curated data can receive measurable recognition for their efforts. This is especially important as the scientific community is calling for data citation to be part of the tenure and promotion practice (Parsons, Duerr, & Minster, 2010).
Once a critical mass of scientists share their data and code, it would serve as a multiplier effect and allow disparate groups of researchers to rapidly solve problems such as climate change, (need a few other applications from other domains) (Peterson et al., 2002). We see these collaborations resulting from sharing data and code as one of the great opportunities to come from reproducible research.
Our discussion so far has focused on the role of the researcher, and the gaps, challenges and opportunities they face. However, there are a few other key groups that are relevant to changing the norms to enhance the reproducibility of research.
Many funders such as the National Science Foundation (NSF) and National Institutes of Health (NIH) have long maintained data sharing requirements although they have been rarely enforced (Borgman, 2012). However, recent changes to funding policies have made these requirements more stringent and explicit. As of 2011, new NSF proposals require a data management plan (Donnelly & Jones, 2010). This plan requires details on how the data will be documented and where it would be deposited upon completion of the effort.
Many fields in science are in the midst of a data revolution and have adapted to the emerging challenges to varying degrees. At one extreme, disciplines such as astrophysics have fully embraced data driven science by developing and supporting the infrastructure, computational methods, and the culture to derive the most value from the data they generate (Venugopal, Buyya, & Ramamohanarao, 2006). At the other, many data-rich disciplines still lack the culture or the practice to leverage or benefit from past endeavors. Funding agencies can serve as sources of change for these disciplines where cultural change is slow.
A second group for whom reproducible research provides new opportunities are research libraries. Concerns about reproducibility now transcend individual disciplines, and there is a need for research institutes and university campuses to provide resources to support reproducible research. Researchers need information on what tools and services are available for reproducible research, and how they can get training for these. Libraries are becoming sensitive to this need, and some have started providing guides to data management planning, software tools for reproducible research, and training sessions. Two particularly good examples that we are aware of are the University of Utah Library Reproducibility of Research resource and the NYU Libraries' Guide to reproducibility.
Journal editors are a third group in the research community that have important opportunities to enact change in support of reproducibility. For example, journal editors could increase the importance of reproducibility by requiring (and enforcing) mandatory full data and code deposition, encouraging and even soliciting replication studies, and supporting reviewers who attempt to reproduce studies while reviewing the paper. Several journals have introduced new guidelines for authors and made specific proposals that attempt to address the problems of non-reproducible research (Begley & Ioannidis, 2015). We see this opportunity for editors to support reproducibility as part of a broader cultural change, one occurring at a generational scale, but that will substantially change the way we share our research outputs.
In this chapter we've surveyed some of the gaps, challenges and opportunities relating to reproducible research. We believe that for the majority of researchers there are now mature software solutions to the joining the gaps of a complex workflow. We are starting to see convergence in several disciplines on executable notebooks as one type of software for tackling the challenges of reproducible research. Reproducible research can provide benefits in the traditional incentive system, but our view is that some of the most compelling opportunities are in how incentives - and the practice of science more generally - can be changed by groups such as funding agencies, journal editors and libraries. Finally, we see opportunities for researchers in the form of new and more diverse research collaborations, equipped with uniquely large datasets to take problems of general interest and wide benefit to humanity. Our observations are that the pace of changes toward more reproducible research is accelerating, but that these are changes of a generational scale and so training, persistence, and optimism are vital to support the technical and policy efforts.
Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531–533. Journal Article. Retrieved from http://dx.doi.org/10.1038/483531a
Begley, C. G., & Ioannidis, J. P. (2015). Reproducibility in science: Improving the standard for basic and preclinical research. Circulation Research, 116(1), 116–126. http://doi.org/10.1161/CIRCRESAHA.114.303819
Birnholtz, J. P., & Bietz, M. J. (2003). Data at work: Supporting sharing in science and engineering. In Proceedings of the 2003 international acm siggroup conference on supporting group work (pp. 339–348). ACM.
Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology, 63(6), 1059–1078.
Clarkson, C., Smith, M., Marwick, B., Fullagar, R., Wallis, L. A., Faulkner, P., … others. (2015). The archaeology, chronology and stratigraphy of madjedbebe (malakunanja ii): A site in northern australia with early occupation. Journal of Human Evolution, 83, 46–64.
Donnelly, M., & Jones, S. (2010). Template for a data management plan. Digital Curation Centre. Retrieved July, 12, 2010.
Dorch, S. (2012). On the citation advantage of linking to data: Astrophysics. Retrieved from https://halshs.archives-ouvertes.fr/hprints-00714715/
Eglen, S., Marwick, B., Halchenko, Y., Hanke, M., Sufi, S., Gleeson, P., … Poline, J.-B. (2016). Towards standard practices for sharing computer code and programs in neuroscience. bioRxiv. http://doi.org/10.1101/045104
Faniel, I. M., & Zimmerman, A. (2011). Beyond the data deluge: A research agenda for large-scale data sharing and reuse. International Journal of Digital Curation, 6(1), 58–69.
FitzJohn, R. G., Pennell, M. W., Zanne, A. E., Stevens, P. F., Tank, D. C., & Cornwell, W. K. (2014). How much of the world is woody? Journal of Ecology, 102(5), 1266–1272. http://doi.org/10.1111/1365-2745.12260
Gleditsch, N. P., & Strand, H. (2003). Posting your data: Will you be scooped or will you be famous? International Studies Perspectives, 4(1), 72–107. http://doi.org/10.1111/1528-3577.04105
Henneken, E. A., & Accomazzi, A. (2011). Linking to data - effect on citation rates in astronomy. CoRR, abs/1111.3618. Retrieved from http://arxiv.org/abs/1111.3618
Kahneman, D. (2014). A new etiquette for replication. Social Psychology, 45(4), 310.
Leipzig, J. (2016). A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics. http://doi.org/10.1093/bib/bbw020
Marwick, B. (2016). Computational reproducibility in archaeological research: Basic principles and a case study of their implementation. Journal of Archaeological Method and Theory, 1–27.
Mesnard, O., & Barba, L. A. (2016). Reproducible and replicable cfd: It’s harder than you think. arXiv, 1605.04339.
Michener, W. K., Brunt, J. W., Helly, J. J., Kirchner, T. B., & Stafford, S. G. (1997). Nongeospatial metadata for the ecological sciences. Ecological Applications, 7(1), 330–342.
Morin, J. A. S., Andrew AND Urban. (2012). A quick guide to software licensing for the scientist-programmer. PLoS Comput Biol, 8(7), 1–7. http://doi.org/10.1371/journal.pcbi.1002598
Neylon, C. (2012). Science publishing: Open access must enable open use. Nature, 492(7429), 348–349. Retrieved from http://dx.doi.org/10.1038/492348a
Nielsen, M. (2012). Reinventing discovery: The new era of networked science. Princeton University Press.
NSF. (2012). US NSF - Dear Colleague Letter - Issuance of a new NSF Proposal & Award Policies and Procedures Guide (NSF13004). Retrieved from http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp?WT.mc\_id=USNSF\_109
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251). http://doi.org/10.1126/science.aac4716
Parsons, M. A., Duerr, R., & Minster, J.-B. (2010). Data citation and peer review. Eos, Transactions American Geophysical Union, 91(34), 297–298.
Peterson, A. T., Ortega-Huerta, M. A., Bartley, J., Sánchez-Cordero, V., Soberón, J., Buddemeier, R. H., & Stockwell, D. R. (2002). Future projections for mexican faunas under global climate change scenarios. Nature, 416(6881), 626–629.
Pienta, A. M., Alter, G. C., & Lyle, J. A. (2010). The enduring value of social science research: The use and reuse of primary research data. Retrieved from http://deepblue.lib.umich.edu/bitstream/handle/2027.42/78307/pienta_alter_lyle_100331.pdf
Piwowar, H. (2013). Altmetrics: Value all research products. Nature, 493(7431), 159–159. http://doi.org/10.1038/493159a
Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175.
Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research data is associated with increased citation rate. PLoS ONE, 2(3), e308. http://doi.org/10.1371/journal.pone.0000308
Ram, K. (2013). Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine, 8(1), 7.
Sears, J. (2011). Data sharing effect on article citation rate in paleoceanography. In AGU fall meeting abstracts (Vol. 1, p. 1628).
Van Noorden, R. (2011). The trouble with retractions. Nature, 478(7367), 6–8. http://doi.org/10.1038/478026a
Vandewalle, P. (2012). Code sharing is associated with research impact in image processing. Computing in Science and Engineering, 14(4), 42–47.
Venugopal, S., Buyya, R., & Ramamohanarao, K. (2006). A taxonomy of data grids for distributed data sharing, management, and processing. ACM Computing Surveys (CSUR), 38(1), 3.
Whitlock, M. C., McPeek, M. A., Rausher, M. D., Rieseberg, L., & Moore, A. J. (2010). Data archiving. The American Naturalist, 175(2), 145–6. Retrieved from http://www.jstor.org/stable/10.1086/650340