Why Care?, Continued
Why should we care whether or not a set of data follows a power law distribution, as opposed to a Normal distribution or any other form?
A major reason, of course, is to understand the mechanisms underlying the data. For example, suppose you are studying a particular network, such as a genetic regulatory network, in which nodes represent genes and a link from node A to node B means that gene B is regulated by gene A. A common thing to do with such a network is to plot the observed degree distribution. On the x-axis is k, the different possible degrees (degree = number of links coming into or going out of a node). On the y-axis is the probability that a node will have degree k. Here is an example of such a distribution, on a log-log plot:
(From http://aluru-sun.ece.iastate.edu/doku.php?id=arabidopsis_gene_network)
The large dots are the actual data, and the dashed line is the best fit to a straight line. It looks like this degree distribution fits a straight line pretty well from degrees around 10 to 100.
It turns out that it's pretty hard, in general, to say whether or not a picture like this "actually" represents a power law. In fact, that very question has some inherent vagueness to it.
Power laws, defined as mathematical equations of the kind given in my previous post, are functions over a variable x, where x can take on any real value. Clearly in any finite system, such as a genetic regulatory network with a finite number of nodes and links, the variable k (the degree of a node) cannot be larger than the total number of links in the system, so by mathematical definition, the above graph is not a power law. Compounding this is the fact that only part of the data above follows a straight line on the log-log plot. Really, the best we can say is that parts of the data "are fit well by a power law". And that turns out to be a pretty weak statement for a few reasons, as I'll discuss later in this series.
In any case, there are several reasons we might care whether or not the straight-line part of the graph above is (at least approximately) a power law. First, we might be interested in whether or not the distribution has a heavy tail, as compared to a Normal distribution. If so, then extreme events (in this example, genes with a very large degree, i.e., genes that control a large number of other genes) are more likely to happen in the system under study. In the genetic network example, the existence of genes with very high degree might imply that a very small mutation to such a gene would have a huge impact over the entire system. This might have important implications for understanding the causes of certain diseases or for possible treatments via genetic engineering.
Note that while a power law distribution implies a heavy tail, power laws are not the only distribution with heavy tails. Thus if we only care about the qualitative shape of the distribution, it might not matter if the distribution is power law, or something else with similar heavy-tail properties.
A second reason we might care about the kind of distribution the observed data follows is that particular statistical analysis tools are relevant only for particular distributions. For example, consider "standard deviation". People often use standard deviation of a data set to place error bars on curve fitting. My son's school sends home standardized test scores with error bars on them; these tests, and their associated error bars, are used to determine if my son gets a place in the coveted magnet school for gifted kids. My mutual fund advisor sends me the projected results of various fund portfolio choices, with error bars, which are used to determine how risky these portfolios will be. I read in the news that the recent financial crisis was a "16 sigma event" meaning that its likelihood was considered to be 16 standard deviations away from what was expected, so no one could have reasonably predicted it [thanks to Jim Rutt for reminding me of this quote]. All these analyses, large and small, use the notion that standard deviation makes sense for analyzing error or risk. This is true only if the quantity being measured (test scores, portfolio returns, crisis probabilities) are Normally distributed. If these things are distributed according to, say, a power law, all these assumptions fly out the window.
A third reason we might care is that we might be concerned with what physical (or biological, or social, or technologal) mechanism actually gave rise to the observed distribution. Certain mechanisms are known to produce power laws with particular exponents. Other mechanisms are known to produce Normal distributions, or log-Normal, or Poisson, or what-have-you. (More on this later.) Details of the observed distribution can help narrow down what type of mechanism underlies the system. Two problems here are (1) it's quite hard, with a limited amount of data, to be confident about what mathematical distribution best fits the data, and (2) even if we are certain the distribution is power law, there are many different possible mechanisms that could have produced it. Mark Newman's paper discusses in detail several different possible mechanisms, which I'll summarize later on.
No comments:
Post a Comment