[author's note: I published this essay in 2011. It's a bit out of date.]

We Need a Github of Science

Summary

Publish or Perish

I am postdoctoral fellow, and my academic department is currently running a junior faculty search. We are interviewing four candidates, each of whom will present a job talk attended by the entire department. Before each talk, I’ll receive each candidate’s application packets, and my eyes will scan the “publications” section of their resume. The presence of a first-author article in the ultra-prestigious academic journals Science or Nature would all but guarantee an offer. Multiple publications in top-tier journals would indicate a strong application. If those are missing, meaning the publication history is weak, I’ll wonder how that person got an interview in the first place. Cultural fit, letters of reference and other credentials certainly matter, but beyond publications, everything is secondary.

To anyone involved in academia, this overwhelming focus on publications is a given. Publishing is so central to scientists that their academic value can be measured by adding the relative worth of their publications. After many years, citations preferentially accumulate towards publications of significance, and by extension, their authors. Ranking importance by citations received is a powerful concept, and incidentally is the basis for Google’s search algorithm. But at the beginning of an academic’s career, before citations accumulate, reputation rests largely on what journals they have published in.

Getting a paper accepted into an academic journal requires passage through the often opaque process of peer-review. Scientists make a big deal of peer- review, because it is supposed to be the filter that separates mere opinions from trusted, citable sources. However, the peer-review process in science has close analogs in any “old-media” field, such as TV or radio. Like academic journals, these are mediums of limited capacity, and there are always more submissions (or ideas for submissions) than there are openings. Selecting content worthy enough for distribution is made by the field’s establishment, which effectively silences what they don’t choose. This is especially true of peer-review as practiced in prestigious journals, defined as the ones that get their contributors faculty jobs.

Having editorial decisions made by established experts makes sense, since they draw on judgement born from years of experience. But this exposes the system to vulnerabilities common to any decision by committee – especially semi- secret committee – such as lack of agility, an aversion to disruptive innovation, and the tendency of committee members (and their friends) to be more equal in their own eyes than anyone else. Because publishing affects scientists so deeply, the strengths and weaknesses of this system inevitably affect the makeup and character of science as a whole. Which makes one wonder, is there a better way?

GitHubbing

My training has spanned biology, engineering, and computer science. My latest project, Instant Q&A for Physician Communities, relies heavily on open source code and led me to GitHub and Git. The Linux community developed Git, a distributed version control system, to coordinate work on the Linux code repository among thousands of programmers. Git is itself open source, and has become widely adopted for many software projects (open source and not). GitHub is a cloud service that hosts over 1 million Git repositories. Since its launch in 2008, GitHub has quickly become the de facto platform for publishing open source code, whose popularity is changing the world. If you’ve ever been astonished at how quickly the web world seems to move, the primary reasons are 1) it’s not dominated by Microsoft, so we have competition instead of a monopoly, and 2) open source code, widely shared through a multitude of email lists in the past and now centralized at GitHub [1]. How has GitHub become so successful?

GitHub is a social network of code, the first platform for sharing validated knowledge native to the social web [2]. This is a big deal. I believe it represents a demonstrably superior way of distributing validated knowledge than academic publishing. How are these even related? Software developers rarely write applications from scratch. Instead, they often start with various modular bundles of open source code. Within Ruby (the programming language underlying the popular web application framework Ruby on Rails [3]) these bundles are called gems. My current project employs 34 gems. Each one is responsible for a specific task, such as logging in users, interfacing with cloud storage, or making fancy-looking buttons. Science operates in a similar way. Scientists never begin a research project from an intellectual vacuum. They stand on the shoulders of giants, building on the knowledge contained in previous publications to form a new, coherent finding. For example, the article in which I published the bulk of my PhD thesis cites 38 others.

Gems are typically developed, distributed, and promoted through GitHub, and therein lies the connection. GitHub has evolved to solve the same general problem that scientific publishing does: making modular, validated units of knowledge easily usable by a global community, with mechanisms that efficiently allocate prestige to proven contributors. GitHub has the advantage of doing this with 21st century technology, the social web, while academic publishing is based on the printing press. This suggests an opportunity for the scientific community to evolve its publishing practices by assimilating mechanisms proven to work for GitHub.

Published Versus Prestigious

The existing peer-review process arose from the limited carrying capacity of physical journals. Prioritization had to happen before publication, because journals were limited in size to what could be economically printed and shipped. If you were born before 1990, you may recall the prestige formerly associated with being “a published author”. However, in the times we are living in, distributing media is basically free. Anyone can start a blog and deliver content worldwide in minutes. Clay Shirky has made a career of deftly explaining how this has fundamentally changed the media equation, with such unexpected consequences as YouTube videos that get more views than Super Bowl commercials.

Individuals still have a limited capacity for consuming and evaluating content, so prioritization and authentication remain necessary, but look different. These functions are now disconnected from publication. Google prioritizes web pages by analyzing utility after publication, by tracking citations in the form of inbound links. Similarly, anyone can publish a gem to GitHub, and published gems are prioritized by the numbers of developers “watching” for updates or ”forking ” new development lines. This is the social web at work, where the audience gets to decide what and whom to pay attention to all by itself, without requiring assistance from all-powerful editorial committees. One can complain that lowering barriers to publication leads to content that on average is of lower quality. But the abundance of non-significant projects in GitHub does not detract from its usability, because those projects are never brought to anyone’s attention [4].

Prestige is really about having an engaged audience that follows and recognizes your activities. This formerly required publication through established venues, but that’s no longer needed since your audience can use the social web to recognize and engage with you directly.

The Market for Prestige

Gems on GitHub are not just code. They also have authors whose relative contributions are automatically catalogued by Git, as shown in this impact graph for the popular and open source jQuery project. If you’ve visited a web application recently, chances are you’ve benefitted from jQuery, which makes it easy for a web engineer to turn static web sites into responsive web applications (think interactions with buttons instead of navigation through links). This impact graph can let you know precisely which developers are responsible for this awesome-ness. In this way, GitHub acts as an efficient, incorruptible “central bank” of the prestige supply. Furthermore, unlike in Google, great contributions in GitHub bring prestige to their creators, not their domain names. If you wanted to hire a contractor to work on a web application, GitHub can let you know who has publicly demonstrated the skills you’d need. It’s thus not surprising that GitHub profiles are supplanting traditional resume items, such as a CS degree, for discerning employers looking to hire top talent.

By contrast, current Open Science efforts that ask scientists to ”share all your data” have not become mainstream, because they do not appropriately reward knowledge producers. They are all free-distribution and no prestige, solving a different half of the problem than traditional journals but not the whole enchilada. Put another way, when anything can be published, there is no prestige associated with being published, so prestige must be introduced in other ways. Evangelists for Open Science should focus on promoting new, post-publication prestige metrics that will properly incentivize scientists to focus on the utility of their work, which will allow them to start worrying less about publishing in the right journals.

The biomedical world is increasingly permeated by code and data [5], which should be very amenable to GitHub style metrics since they are by nature tied to networked computers. Scientists in fields like genomics and biomedical informatics are being held to the same publication expectations as their peers, but this makes little sense. An article describing a genomic database is nowhere near as useful as an open API for accessing it. We need trusted ways to quantify just how useful that API and associated code are to the scientific community, which can be listed on a scientist’s profile and utilized by committees making hiring and funding decisions [6].

Challenges and Current Efforts

Of course, there are fundamental differences between publishing software code and publishing science. Copying code results in an exact replica and does not affect the original. By contrast, duplicating a research finding may require significant expenses just to recreate experimental conditions. Code is structured by the strict syntax of programming languages, while most scientific research is not. For this reason and others, academic articles and journals are not going to disappear, but they should not be the only way for a scientist to accumulate prestige.

Unfortunately, energy that could be spent developing these new solutions is instead tied up with the older struggle of open-access. Universities still pay outrageous sums to journal publishers to allow them access to the knowledge they just produced, reviewed, and edited on their own dime [7]. Broadly speaking, traditional journals are being reduced to rent- takers on brand names with reputational inertia. arXiv, which provides open access to pre-prints in many quantitative disciplines, is a notable and long-running example of the scientific community’s workaround to this problem. The arXiv is amazing, but why remain dependent on a system it could be replacing [8]?

PLoS is at the cutting edge of both open-access and rethinking the functions of a journal. PLoS One comes closest to what I am describing, in that their peer-review process screens only for scientific rigour, not perceived impact, meaning they will publish content considered unsexy and let future citations determine importance. But they have not yet embraced the social web, as the lack of scientist profiles (with associated prestige metrics) in their website demonstrates. In programmer jargon, PLoS ONE needs to become a web application, not a website that hosts content. One problem might be that they still consider themselves a journal first, and journals have editorial boards, while social web is all about not having editors. There is no editorial board at GitHub.

When I discuss this with current faculty, a typical reaction is that I’m pining for a social network of scientists. That seems reasonable, and it is being tried, but may not be bold enough. GitHub did not succeed by being a social network of programmers. It succeeded by being a social network of code. We need a social network of science, meaning scientific bundles of knowledge must be structured and accessible by API, with the connections among those bundles and appropriate utility metrics being what connects and prioritizes scientists.

APIs for science already exist, and some are incredibly useful, but they have ignored authorship and prestige implications which have prevented them from achieving their potential. For example, biophysicists have the RCSB Protein Data Bank, which stores experimentally determined protein structures. This database is a tremendous asset to the field, but it could represent much more, as a story from my younger days illustrates. In 2004, as an undergrad, I spent a summer writing Python code to download and analyze all existing RCSB structures. That program built a database of “real” structures to train a scoring algorithm, which subsequently scored computationally generated structures to see how “real” they seemed [9]. Unfortunately, my results were not compelling enough to be published in a prestigious academic journal, and therefore not interesting to my research adviser. Open-sourcing and publishing that code might have saved someone’s time, spurred new thinking, or at the very least marked a tangible reward for my work. But the incentives to my adviser weren’t there, so he did not suggest it. That idea did not even occur to me, because I was not a good enough programmer to know about SourceForge, a less-social precursor to GitHub, so the code went nowhere.

Hey Mr. Gates

It may be that the activation energy required to initiate changes won’t arise within the system. In that case, an outside push might do the job, and the best place for this push to come from may be a nimble funding agency. For example, a request for proposals could specify that phase II funding decisions would be based on the impact of online resources developed in phase I, as measured by specific metrics developed with community feedback. Nothing makes a scientist contemplate change faster than a new source of grant money, and the only thing better than a faculty applicant with a paper in Science may be one bringing in a multimillion dollar grant.

Further reading:

Collective knowledge systems: Where the Social Web meets the Semantic Web (Tom Gruber)

Peer Review in Academic Promotion and Publishing: Its meaning, locus and future (Diane Harley and Sophia Krzy)

Mechanisms for (Mis)Allocating Scientific Credit. (Jon Kleinberg and Sigal Ore)

The Life Scientists room on Friendfeed

Notes

1 “The combination of the Internet and open source transformed the functionality in modern programming tools, increasing developer productivity 10 fold” - Ben Horowitz, formerly of Netscape.

2 “Native” in the sense eloquently explained by USV (the VCs who funded Twitter): “Native opportunities are the ones that make use of unique capabilities of [new] platforms”. The social web is the new platform.

3 For example, Twitter, Groupon, and GitHub itself run on Ruby on Rails.

4 I speculate that many gems are also discovered through technical blogs (found through Google) or the programmer Q&A site StackOverflow.

5 Biomedical research is also huge - funding has been squeezed lately, but is still on the order of ~$100B annually. Therefore the potential market is large enough to be worthwhile to build for.

6 The unique requirements of the scientific community probably mean GitHub itself can’t do the job.

7 Or more accurately, on the federal government or philanthropic organizations that fund them. Journals do not compensate their editors or peer reviewers.

8 The peer-review process has been hacked via arXiv before, by Grigori Perelman. But to appreciate how unusual Grigori’s motivations are, consider that he also refused to accept the Fields Medal and its $1M cash prize.

9 Computing protein structure from amino acid sequence is known as “the protein folding problem” and is one of the holy grails of science.

Thanks to Sean Ahrens, Sean Carroll, Manuel Cebrian, Wendy Chapman, Lawrence David, Lucila Ohno-Machado, Carlos von Muhlen, Denise von Muhlen, and Ryan Weald for reading drafts and helpful discussions.