Tuesday, January 22, 2013

Blogging the MOOC

I'm obviously not able to keep up with my original goal on this blog, which was to write understandable reviews of topics related to complex systems.  Sorry I haven't been able to accomplish this!

I've taken on a new challenge, which is starting a program at the Santa Fe Institute to offer "Massive Open Online Courses" (or MOOCs) related to complexity.  These courses will be free, open to anyone, and taught at levels ranging from undergraduate-level introductions to graduate-level technical courses.  I'm also designing and teaching the first course in this series, "Introduction to Complexity", which has no prerequisites and should be accessible to anyone who has the interest and the motivation to watch the lectures and do the homework.    You can find information about this course and sign up for announcements at http://www.santafe.edu/mooc/subscribe.  We already have over 4,000 people signed up for announcements!  The course is currently scheduled to start on Jan 28, but we're probably going to have to push it back a week, to start on Feb. 4, due to additional testing needed on the web infrastructure, before we unleash it to the masses.  

Designing a course like this is completely different from anything I've ever done before, especially since this also involves overseeing the building of a web-based platform for this and follow-on MOOCs.    I'll be blogging here a bit about my experiences in building and offering this course. 

Just to give a few random details to start with:

The course will be part of our Complexity Explorer (CE) website, a repository for educational materials, related to complex systems, which I and others from SFI and Portland State have been developing over the past year or so.  This is funded by a grant to SFI from the Templeton Foundation.  The CE website will be launched (hopefully) soon after the course starts.  I'm working with a team of PSU students to develop these materials.   More on this later.

We've hired a great web-development team, based here in Portland, called Bot and Rose, to develop the site, including the MOOC platform.    I've been working closely with them on the design and development strategy.

I'm recording lectures using the Camtasia Studio screen capture software on my Mac.  I also use a Wacom Bamboo tablet, and Sketchbook Express, for the handwritten parts of the lectures.    An undergraduate PSU film major, Shelbi Roake, is doing the video editing.  The TAs for the course are two PSU students, Max Orhai and John Balwit.  They, along with several other students, have developed a series of NetLogo simulations that I'll be using as part of the course.    I'll also be teaching a bit of the NetLogo programming language, which is accessible even if you've never done programming before. 

This MOOC project has been (and will remain) taking up the lion's share of my time for a while.  Fortunately I'm not teaching anything else this term.

I'll write more later on what I'll be teaching in the course and how I'm creating the lectures. 

Friday, May 25, 2012

Power laws, continued

Taking up where I left off on the previous post.

Why Care?, Continued

Why should we care whether or not a set of data follows a power law distribution, as opposed to a Normal distribution or any other form?

A major reason, of course, is to understand the mechanisms underlying the data. For example, suppose you are studying a particular network, such as a genetic regulatory network, in which nodes represent genes and a link from node A to node B means that gene B is regulated by gene A.  A common thing to do with such a network is to  plot the observed degree distribution. On the x-axis is k, the different possible degrees (degree = number of links coming into or going out of a node).  On the y-axis is the probability that a node will have degree k.   Here is an example of such a distribution, on a log-log plot:

degree_distribution.png

 (From http://aluru-sun.ece.iastate.edu/doku.php?id=arabidopsis_gene_network)

The large dots are the actual data, and the dashed line is the best fit to a straight line.  It looks like this degree distribution fits a straight line pretty well from degrees around 10 to 100.   

It turns out that it's pretty hard, in general, to say whether or not a picture like this "actually" represents a power law.   In fact, that very question has some inherent vagueness to it.  
Power laws, defined as mathematical equations of the kind given in my previous post, are functions over a variable x, where x can take on any real value.  Clearly in any finite system, such as a genetic regulatory network with a finite number of nodes and links, the variable k (the degree of a node) cannot be larger than the total number of links in the system, so by mathematical definition, the above graph is not a power law.    Compounding this is the fact that only part of the data above follows a straight line on the log-log plot.  Really, the best we can say is that parts of the data "are fit well by a power law".    And that turns out to be a pretty weak statement for a few reasons, as I'll discuss later in this series.

In any case, there are several reasons we might care whether or not the straight-line part of the graph above is (at least approximately) a power law.  First, we might be interested in whether or not the distribution has a heavy tail, as compared to a Normal distribution.  If so, then extreme events (in this example, genes with a very large degree, i.e., genes that control a large number of other genes) are more likely to happen in the system under study.  In the genetic network example, the existence of genes with very high degree might imply that a very small mutation to such a gene would have a huge impact over the entire system.  This might have important implications for understanding the causes of certain diseases or for possible treatments via genetic engineering.

Note that while a power law distribution implies a heavy tail, power laws are not the only distribution with heavy tails.  Thus if we only care about the qualitative shape of the distribution, it might not matter if the distribution is power law, or something else with similar heavy-tail properties. 

A second reason we might care about the kind of distribution the observed data follows is that particular statistical analysis tools are relevant only for particular distributions.  For example, consider "standard deviation".  People often use standard deviation of a data set to place error bars on  curve fitting.  My son's school sends home standardized test scores with error bars on them; these tests, and their associated error bars, are used to determine if my son gets a place in the coveted magnet school for gifted kids.  My mutual fund advisor sends me the projected results of various fund portfolio choices, with error bars, which are used to determine how risky these portfolios will be.   I read in the news that the recent financial crisis was a "16 sigma event" meaning that its likelihood was considered to be 16 standard deviations away from what was expected, so no one could have reasonably predicted it [thanks to Jim Rutt for reminding me of this quote].  All these analyses, large and small, use the notion that standard deviation makes sense for analyzing error or risk.  This is true only if the quantity being measured (test scores, portfolio returns, crisis probabilities) are Normally distributed.    If these things are distributed according to, say, a power law, all these assumptions fly out the window. 

 A third reason we might care is that we might be concerned with what physical (or biological, or social, or technologal) mechanism actually gave rise to the observed distribution.    Certain mechanisms are known to produce power laws with particular exponents.  Other mechanisms are known to produce Normal distributions, or log-Normal, or Poisson, or what-have-you.  (More on this later.)     Details of the observed distribution can help narrow down what type of mechanism underlies the system.     Two problems here are (1) it's quite hard, with a limited amount of data, to be confident about what mathematical distribution best fits the data, and (2)  even if we are certain the distribution is power law, there are many different possible mechanisms that could have produced it.  Mark Newman's paper discusses in detail several different possible mechanisms, which I'll summarize later on.  

Tuesday, May 22, 2012

Keeping up with blogging

Just as a quick interlude:  In trying to keep up with this blog, I'm reminded of a novel I once read (back in the days when I had time to read novels) called "Passing Time" by the French author Michel Butor.  The novel (originally written in French) took the form of a diary written by a French man who was temporarily living in England.    In my somewhat dim memory of this book, the central character started the diary several months after he arrived, and was trying to write entries for all the stuff that happened in the previous several months, but stuff from his current life kept intruding, and he increasingly found it impossible to keep up with both the past and the present simultaneously.    This blog is suffering from the same problem.  In the Butor book, weird things happened to time, as befitting the post-modernist style of the novel.    I recommend it, though it is not exactly a page-turning thriller (in contrast to this blog, of course). 

Scaling: What's all the fuss about power laws?

The Exploring Complexity blog is back, after a long hiatus.  Life keeps getting in the way of blogging.  Hopefully this time I can write more regularly.

To this end, I'm putting the discussion of the rest of Steve Frank's paper temporarily on hold.

Today's post is a start of a discussion of the paper "Power laws, Pareto distributions, and Zipf's law", by Mark Newman (Contemporary Physics, 46, 323-351 (2005)).  I'm going to separate this into a few different posts, so I have some hope of actually posting something.

Mark's paper is a great, extremely clear review of some ideas related to power laws.  Mark is a terrific writer.  He's also Professor of Physics and Complex Systems at the University of Michigan, and External Professor at the Santa Fe Institute.  Here's his picture:
 

Mark is also the author of this recent textbook on networks (Oxford University Press, 2010):

http://www-personal.umich.edu/~mejn/networks-an-introduction/cover-s.jpg

An interesting (unrelated) fact about Mark, given that this year is the 100th anniversary of Alan Turing's birth: Mark's grandfather, Max Newman, while a mathematics lecturer at Cambridge, introduced Turing to the "Entscheidungsproblem" (or "decidability problem"), which inspired the invention of the "Turing machine", which arguably gave rise to the invention of programmable computers.

Anyway, back to the paper at hand.  

Power Laws Versus Normal Distributions

Just for reference, here is are pictures of a Gaussian or "Normal" distribution
(left /top) and a power-law distribution (right /bottom):


Recall that a "distribution" plots some quantity (e.g., SAT scores) on the x-axis versus the observed frequency or probability of those scores on the y-axis.  Sometimes probabilities (rather than raw frequencies) are plotted on the y-axis, so the sum of all values is equal to 1.

There are some interesting differences to note between the Gaussian and the power law distributions.  First, the Gaussian is symmetrically peaked around a small range of "typical" values, the middle of which happens to be the mean of the distribution.  Also, the distribution falls off to (very close to) zero on either size.  The range of values on the x-axis for which the distribution is non-zero is called the "scale".  The power-law distribution is peaked at the lowest value on the x-axis, and decreases for higher values.  It falls off more slowly than the Gaussian distribution, resulting in a so-called "long tail" or "heavy tail".  It doesn't have an obvious small range of "typical" value in the way that the Gaussian does.


In terms of probabilities, it's clear that for the Gaussian, "extreme events" (e.g. very low or very high SAT scores) are quite low in probability compared to the average of the distribution.  But in the the power law distribution, such extreme events are more probable than in the Gaussian distribution, due to the long tail.  This is one of the more important implications of power-law distributions in the real
world.   As McKelvey and Andriani point out: "The lesson we can draw...is that extreme events, which in a Gaussian world could be safely ignored, are not only more common than expected but also of vastly larger magnitude and far more consequential." [1] 

Examples of Power Laws

In his paper, Mark Newman gives a long list of examples of (purported) power law distributions in natural and technological systems, including:
  • Word frequency in natural language (the most frequent words are vastly more frequent than the least frequent words)
  • Citations of scientific papers (There are a small number of  papers with a huge number of citations and a very large number of papers with no (or very few) citations)
  • Magnitudes of earthquakes (Very small earthquakes are common; very large earthquakes are rare)
  • Intensities of wars
  • Wealth of richest people
  • Populations of cities
and several others.   Mark Newman says: "one can, without stretching the interpretation of the data
unreasonably, claim that power-law distributions have been observed in language, demography, commerce, information and computer sciences, geology, physics and astronomy, and this on its own is an extraordinary statement."

 
Power laws, or similar "heavy-tailed" distributions are found so often in nature and technology, that Willinger et al. have called them "more normal than Normal" [2] .  


Mathematics of Power Laws

Here is the mathematical form of a power law: 



 That is, the probability that some quantity (e.g., earthquake size) has value x is equal to a constant (C) times x raised to the power -alpha.   The constant C normalizes the distribution -- i.e., makes all the probabilities sum to 1.  The inequality on the right says that the power-law relationship holds only for x greater than some minimum value x_0.     



Suppose, for example, alpha = 2.  then we would have

Suppose x_0 = 1.  Then P(x) would be maximum when x = 1, would be 1/4 that value when x=2, 1/9 that value when x=3, etc.  Imagine x represents earthquake size on the Richter scale.  As we would expect, small earthquakes would have the bulk of the probability, whereas any particular large earthquake (e.g., Richter scale 8)  would be very unlikely.  The scary thing is that every large earthquake size has some, albeit low, probability, so the total probability that a "big one" will happen is non-negligable.  That is, the power-law distribution makes it inevitable that a big one *will* almost certainly happen at some point.  If earthquake sizes were Normally distributed, it would be much less likely that a big one would take place.  

Replacing "earthquake size" with "extreme financial crises" in the above, we get the example of the great recession of 2008, which, if such things are power-law distributed, was bound to happen.  Evidently a lot of economists thought that such things were Normally distribution.  It seems that they were probably wrong.   

Many times people show pictures of power-law graphs on "double logarithmic" or "log-log" plots -- that is, the the x and y axes are on a logarithmic scale rather than on an absolute scale.  (E.g., Richter scale readings of 1, 2, 3, ... actually represent 1, 10, 100, etc. times the strength of earthquakes, so are on a base-10 logarithmic scale.)   Let's do a bit of simple algebra:
 

Assuming that the vertical axis plots log P(x) and the horizontal axis plots log x (i.e., a "log-log" plot), the right hand side of the above equation gives the expression for a straight line with slope -alpha and intercept log C.  Thus, if you plot a power law on a log-log plot, you will see a straight line.

The quantify that matters most to understand a power-law distribution is the exponent alpha, which tells something about the underlying process creating the power law.

I should mention that power laws don't only describe distributions such as probabilities of earthquake sizes -- they can describe scaling laws as well -- e.g., metabolic rate of an organism scales as mass raised to the 3/4 power (Kleiber's law) -- more on that in future posts.





Why Care?


Why should we care whether something is a power law (versus some other distribution)?  The form of the distribution can say a lot about the underlying process, which is usually what science is trying to get at. One problem though--many different underlying processes produce power laws.  Mark Newman's paper lists several different possible mechanisms, some of which will be discussed in my next post. 



Upcoming: 

Statistical properties of power-law distributions

What it means for a distribution to be "scale-free"

What do the exponents mean? 

What are Rank-Frequency plots, such as Zipf's law or the Pareto distribution?

What are the mechanisms that might give rise to power-law or other heavy-tail distributions? 

Are we really seeing power laws, or just approximations to power laws, or (in Cosma Shalizi's words) hallucinations of power laws?  Does it matter? 

Will I ever finish this post?

Stay tuned!

References


[1] B. McKelvey and P. Andriani, Why Gaussian statistics are mostly
wrong for strategic organization. Strategic Organization, 3(2): 219-228, 2005.

[2] Willinger, Walter and Alderson, David and Doyle, John C. and Li, Lun (2004) More "normal" than normal: scaling distributions and complex systems. In: Proceedings of the 2004 Winter Simulation Conference. IEEE Press , Piscataway, NJ, pp. 130-141. ISBN 0-7803-8786-4




Wednesday, February 22, 2012

Blog on hold till April

Hello all,

I'm still planning to survey various scaling papers here, but due to teaching and other obligations, I have to put this project on hold until April (I'm off from teaching starting end of March).    In the meantime, you might find the following very recent article interesting: Stumpf, M. P. H and Porter, M. A., Critical truths about power laws, Science, 335, 2012, pp. 665-666.    Here's a link to it:http://www.sciencemag.org/content/335/6069/665.short

This link requires a subscription to Science -- sorry, I couldn't find any non-subscription links to this article yet. 

Friday, December 30, 2011

Top Ten Writing Errors (of My Students) in 2011


Hello all,
I’m back from vacation and preparing my next post to continue the discussion on Steve Frank’s paper. 
In the meantime, in honor of New Years’ top-ten lists, I thought I thought you might be interested in the top ten writing errors made in student papers that I graded this year (and I did a lot of grading!).  Some of these errors are classics in any writing; some are more science-oriented.  I’m sure you-all never would make such elementary mistakes as those in my list (right?) but feel free to point these out to your own students, colleagues, and friends.  It’s a good exercise to do this without sounding too pedantic (an exercise I am about to fail at here, but oh well).  
1. it’s  vs. its.   Incorrect use of these was the most common problem I saw.  I wrote a haiku about this:
            It’s existential. 
            ‘It’s’ has an apostrophe
            Only when it is.

2. each other vs. one another.   Use “each other” when there are two things.  Use “one another” when there are more than two things.   No haiku for this one, yet. 
3.  between vs. among.  This is similar to the previous error.   Use “between” only for two things; use “among” for more than two things.
4.  effect vs. affect.   “Effect” is usually a noun, meaning “impact”:  “Steve Jobs had a huge effect on the personal electronics industry”.  It is sometimes a verb, meaning to create something or  to make something happen, e.g., “Can Obama effect a Mideast peace treaty?”.  “Affect” is the verb form of the noun “effect”:  “Steve Jobs significantly affected the personal electronics industry”, meaning he had a large impact on it.  A bigger fan of Steve Jobs might say something like “Steve Jobs singlehandedly effected the personal electronics industry”, meaning, he was singlehandedly the person who made it happen. 
Other, less common meanings:  “affect” (stress on first syllable) means “emotion”.      
All this seems to confuse some students – it needs a haiku.   Any takers?
5.  Don’t use “they” or “their”, or “his” or “her” or “it” or any other pronoun far away from its original reference, or where the reference might be ambiguous.   This error is very common in my students’ writings. “I enjoyed reading about cellular automata and genetic algorithms.  It is an important topic in computer science.”   Or “The author of the paper did not talk about the meaning of the results.  They ranged from high to low.   It might have been due to an error.  They were unexpected and he didn’t say how he got them.”     Help!
6.  Don’t use passive voice.  E.g., “Experiments were performed to demonstrate that…” is worse than “We performed experiments to demonstrate that…”.    In the first, it’s not clear who did the experiments — the author, or someone else?
7. Be aware of the singular versus plural forms of Latinate words.   For example:  optimum is singular, optima is plural.  Automaton is singular, automata is plural.  It’s not correct to say “Only one cellular automata has been proved to be universal”. 
8.  Don’t use overly long, complex sentences, at least in non-fiction writing.  You are free to be another James Joyce or David Foster Wallace in your novel, but it’s not a good idea in a scientific paper. 
9. Avoid “identity-crisis sentences” — my term for sentences such as, “As a student of Shakespeare, Hamlet is my favorite play.”
10.  Use complete sentences.  Please!
11.  (Not exactly a writing error, but a plea to students.) If you include a figure or a table in your paper (or talk), include an informative caption.    If you include a plot, please, please, puhleez, I beg you:  label the axes.  
In the future I am going to include all these as a checklist with every writing assignment I give.   For future work, the list needs “the proper role of the semicolon” and “the difference between an n-dash and an m-dash”.  I’m sure you can hardly wait!
(Some of my students think I’m too picky about writing and other details.   I’m currently reading the new Steve Jobs biography, and was interested to learn that Jobs once had a huge fight with a graphic designer over where to put the period after his (Jobs') middle initial on his business cards—at the bottom of the “P” or right under the semi-circle at the top?  Jobs fought hard for the latter and finally won the battle.)  
Happy 2012 to all!