In one of our projects, we encountered this dilemma where we had to nitpick on (the probability of) co-occurrence of a pair of events and correlation between the pair of events.
Here is my attempt at disambiguating between the two. Looking forward to any pokes at loopholes in my argument.
Consider two events e1 and e2 that have a temporal signature. For instance, they could be login events of two users on a computer system across time. Let us also assume that time is organized as discrete units of constant duration each (say one hour).
We want to now compare the login behaviour of e1 and e2 over time. We need to find out whether e1 and e2 are taking place independently or are they correlated. Do they tend to occur together (i.e. co-occur) or do they take place independent of one another?
This is where terminologies are freely used and things start getting a bit confusing. So to clear the confusion, we need to define our terms more precisely.
Co-occurrence is simply the probability that the events e1 and e2 occur together. In other words, this is the joint probability of e1 and e2. Let's represent this as p(e1,e2). The joint probability of a pair of random processes A and B is defined as:
When we are talking about temporally distributed processes like e1 and e2, the intersection is simply the number of times e1 and e2 have occurred in the same time bucket and the union is the total number of times e1 or e2 have occurred (counting the co-occurrences only once).
Another form of measuring relatedness between the events e1 and e2 is to compute their correlation coefficient. Correlation measures the linear relationship between pairs of random variables.
Intuitively, suppose our time units were divided into discrete buckets t1 .. tn. In each time bucket, we have counted the number of times e1 and e2 have occurred. We now take a 2-dimensional plot and for each time bucket, we place a point on the plot, whose x and y values are the number of times e1 and e2 have occurred respectively.
(Here there is an implicit assumption that the resolution of our measurement is the width of the time bucket. That is, if the time bucket is 1 hour, it does not matter when exactly an event occurred in that hour.)
Given this scatter plot, if we can now draw a line passing through all the points such that the error between the points and the line is minimal, we say that the events are (linearly) correlated. A popular way of computing this line is to use the Pearson coefficient.
The correlation coefficient takes into consideration similarity in occurrences across each bucket. The scatter plot can also reveal specific patterns of correlations that cannot be discerned by the probability of co-occurrence. For instance, suppose that if the events e1 and e2, co-occur, they co-occur not more than 5 times within a time bucket or not co-occur at all. This peculiarity is lost in the co-occurrence computation.
On the other hand, the co-occurrence probability gives an easy way of summarizing pairwise relationships in a large set of events. It is also easier to build generative models and synthetic data sets that reflect co-occurrence probabilities than those that reflect co-occurrences (I think).
There are many other ways to compute relatedness. One other metric worth mentioning here is the mutual information metric between pairs of events. This is a small but significant addition over computing the joint probability. Intuitively, the mutual information between a pair of random variables A and B is the amount of bits that are required to change from a description of A to a description of B. Formally:
Here p(a,b) is the joint probability of events in A and B and p(a) and p(b) are the marginal (unconditional) probabilities of events in A and B respectively.
Here is my attempt at disambiguating between the two. Looking forward to any pokes at loopholes in my argument.
Consider two events e1 and e2 that have a temporal signature. For instance, they could be login events of two users on a computer system across time. Let us also assume that time is organized as discrete units of constant duration each (say one hour).
We want to now compare the login behaviour of e1 and e2 over time. We need to find out whether e1 and e2 are taking place independently or are they correlated. Do they tend to occur together (i.e. co-occur) or do they take place independent of one another?
This is where terminologies are freely used and things start getting a bit confusing. So to clear the confusion, we need to define our terms more precisely.
Co-occurrence is simply the probability that the events e1 and e2 occur together. In other words, this is the joint probability of e1 and e2. Let's represent this as p(e1,e2). The joint probability of a pair of random processes A and B is defined as:
When we are talking about temporally distributed processes like e1 and e2, the intersection is simply the number of times e1 and e2 have occurred in the same time bucket and the union is the total number of times e1 or e2 have occurred (counting the co-occurrences only once).
Another form of measuring relatedness between the events e1 and e2 is to compute their correlation coefficient. Correlation measures the linear relationship between pairs of random variables.
Intuitively, suppose our time units were divided into discrete buckets t1 .. tn. In each time bucket, we have counted the number of times e1 and e2 have occurred. We now take a 2-dimensional plot and for each time bucket, we place a point on the plot, whose x and y values are the number of times e1 and e2 have occurred respectively.
(Here there is an implicit assumption that the resolution of our measurement is the width of the time bucket. That is, if the time bucket is 1 hour, it does not matter when exactly an event occurred in that hour.)
Given this scatter plot, if we can now draw a line passing through all the points such that the error between the points and the line is minimal, we say that the events are (linearly) correlated. A popular way of computing this line is to use the Pearson coefficient.
The correlation coefficient takes into consideration similarity in occurrences across each bucket. The scatter plot can also reveal specific patterns of correlations that cannot be discerned by the probability of co-occurrence. For instance, suppose that if the events e1 and e2, co-occur, they co-occur not more than 5 times within a time bucket or not co-occur at all. This peculiarity is lost in the co-occurrence computation.
On the other hand, the co-occurrence probability gives an easy way of summarizing pairwise relationships in a large set of events. It is also easier to build generative models and synthetic data sets that reflect co-occurrence probabilities than those that reflect co-occurrences (I think).
There are many other ways to compute relatedness. One other metric worth mentioning here is the mutual information metric between pairs of events. This is a small but significant addition over computing the joint probability. Intuitively, the mutual information between a pair of random variables A and B is the amount of bits that are required to change from a description of A to a description of B. Formally:
Here p(a,b) is the joint probability of events in A and B and p(a) and p(b) are the marginal (unconditional) probabilities of events in A and B respectively.
Comments