14 February 2009

Peter Norvig's "An Exercise in Species Barcoding"!

On  Charles Darwin's 200th birthday, I was wondering what percentage of those high achievers working at Google USA were among the  the 37% of Americans who accept that natural selection provides the best explanation of the world's species.

Peter Norvig is one of those high achieving Google employees and this week he posted a link to An Exercise in Species Barcoding to his blog. Some excerpts are below.

Recently I've been looking at the International Barcode of Life project. The idea is take DNA samples from animals and plants to help identify known species and discover new ones. While other projects strive to identify the complete genome for a few species, such as humans, dogs, red flour beetles and others, the barcoding project looks at a short 650-base sequence from a single gene. The idea is that this short sequence may not tell the whole story of an organism, but it should be enough to identify and distinguish between species. It will be successful as a barcode if (a) all (or most) members of a species have the same (or very similar) sequences and (b) members of different species have very different sequences. I was able to acquire a data set of 1248 barcode sequences, all of them Lepidoptera (butterflies and moths) from Australia. Each entry gives the name of the specimen (if known), the location it was collected, and a 659 base (i.e. ACTG) barcode.
The Big Questions
  • Can I figure a way to cluster the barcodes into species?
  • How many species are there in this data set?
  • Will there be a clear answer, or will there be many possible solutions?
  • Is the notion of a species even well-defined? That is, do the individuals cluster into groups with large-margin boundaries between them, or do they overlap?
Really we can only hope to answer this question with respect to this particular data set, but perhaps it will give us some insight into other data sets, and into the nature of species in general.

Answering the Big Questions

Now we can attempt to answer the questions.
  • Can I figure a way to cluster the barcodes into species?
    Yes. We can cluster barcodes together. We can get good agreement for about 96 or 97% of the individuals, but are uncertain of the remaining 3 or 4%.
  • How many species are there in this data set?
    I explored answers from 375 to 390, or equivalently 383±2%. There is some evidence (and some hunches) to support 384±1%, but I would hate to have to be more precise than that.
  • Will there be a clear answer, or will there be many possible solutions?
    The data does not seem to support a single answer. But asserting an answer within ±1% seems reasonable.
  • Is the notion of a species even well-defined?
    Inconclusive from this data. There are 1% to 4% or so of individuals that are on the border between two species, according to this data. one way or another. But you could also say the glass is 95% full -- most individuals are conclusively clustered together, in a way that makes sense to the person doing the collecting.
  • More generally, "species" is often defined as a "group of organisms capable of interbreeding and producing fertile offspring." That's a start, but it's not a perfect definition. First of all, the majority of organisms do not even reproduce sexually. Birds do it, bees do it, most macroscopic eukaryotes do it, but bacteria and archaea do not, nor do some plants and fungi. Second, what does "capable of" mean? Historically, the Capulets and Montagues did not interbreed (nor the Sharks and Jets), but most observers would say they would be capable. But what is an observer to say about two groups of frogs that disdain each other? How do we know if they are capable of interbreeding? Third, there is the problem of transativity of species membership. Consider the Ensatina salamander. These exist in the mountains surrounding the Central Valley in California. The mountains are laid out in a horseshoe shape, and as you traverse the horseshoe, you notice variations in the salamanders. Each variation can interbreed with its near neighbors, but the ones at the extreme western end cannot interbreed with those at the far eastern end. They can't be all one species, because they don't all interbreed, but then neighboring pairs do interbreed, so there is no clear answer as to where to draw the barriers. Biologists describe this as a ring species which is neither a single species nor a set of multiple discrete species. It seems we have to accept that species is a natural kind term which has clear prototypes -- paradigmatic cases where everyone can agree what is and isn't a species -- but does not have crisp boundaries.

No comments: