Monday, December 29, 2014

Multivariate Medians

I'll bet that in the very first "descriptive statistics" course you ever took, you learned about measures of "central tendency" for samples or populations, and these measures included the median. You no doubt learned that one useful feature of the median is that, unlike the (arithmetic, geometric, harmonic) mean, it is relatively "robust" to outliers in the data.

(You probably weren't told that J. M. Keynes provided the first modern treatment of the relationship between the median and the minimization of the sum of absolute deviations. See Keynes (1911) - this paper was based on his thesis work of 1907 and 1908. See this earlier post for more details.)

At some later stage you would have encountered the arithmetic mean again, in the context of multivariate data. Think of the mean vector, for instance.

However, unless you took a stats. course in Multivariate Analysis, most of you probably didn't get to meet the median in a multivariate setting. Did you ever wonder why not?

One reason may have been that while the concept of the mean generalizes very simply from the scalar case to the multivariate case, the same is not true for the humble median. Indeed, there isn't even a single, universally accepted definition of the median for a set of multivariate data!

Let's take a closer look at this.

The key point to note is that the univariate concept of the median is that it relies on our ability to order (or rank) univariate data. In the case of multivariate data, there is no natural ordering of the data points. In order to develop the concept of the median in this case, we first have to agree on some convention for defining "order".

This gives rise to a host of different multivariate medians, including:
  • The L1 Median
  • The Geometric Median
  • The Vector of Marginal Medians (or coordinate-wise median)
  • The Spatial Median
  • The Oja Median
  • The Liu Median
  • The Tukey Median.
For most of these measures a variety of different numerical algorithms are available. This complicates matters even further. You have to decide on a median definition, and then you have to find an efficient algorithm to compute it. To get idea of the issues involved, take a look at this interesting paper.

You can compute multivariate medians in R, using the "med" function. However, for the most part the associated algorithms are limited to two-dimensional data.

If this topic interests you, then a good starting point for further reading is the survey paper by Small (1990).

Finally, it's worth keeping in mind that the median is just one of the "order statistics" associated with a body of data. The issues associated with defining a median in the case of multivariate data apply equally to other order statistics, or functions of the order statistics (such as the "range" of the data).


References

Keynes, J. M., 1911. The principal averages and the laws of error which lead to them. Journal of the Royal Statistical Society, 74, 322–331.

Small, C. G., 1990. A survey of multidimensional medians. International Statistical Review, 58, 263–277.


© 2014, David E. Giles

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.