For illustration purpose here a graph showing the seismicity of Germany ( a fact I was largely unaware of)
I am writing to you because I saw your blogpost with a call for help/bleg. I am not sure if this is what you had in mind, but I might have something to contribute.
I am a seismologist working on ground motion models for probabilistic seismic hazard analysis (PSHA). In PSHA, we try to estimate the exceedance rate of a destructive ground motion level for a particular site (e.g. a nuclear power plant). It basically comes down to the following:
- What is the probability that an earthquake with magnitude M occurs in a distance R from the site in a particular time period.
- Given M and R, what is the probability that a certain ground motion level will be exceeded.
- Integrate over all magnitude and distance combinations.
I am working on part 2. Here, we want to estimate the conditional probability of the ground motion parameter Y given magnitude and distance. Y can be something like peak ground acceleration. Usually, a lognormal distribution is assumed, and we get something like this:
log Y = f(M,R)+epsilon
The parameters of f are estimated from large strong motion datasets, which consist of recordings of ground motion from large earthquakes at seismic stations.
Now comes the part where you are probably interested in. The estimation of f is not easy. There is physical knowledge about the relationships between Y and M and R, but there are still unexplained effects. For example, physical models make the assumption that an earthquake is a point source, but large earthquakes can have fault dimensions of up 10s or 100s of kilometers. This has to be taken into account. There are also other variables (site effects, rupture directivity effects and others) which influence ground motion. Not for all of them exist physical knowledge about the exact relation.
Another problem is missing data. Site effects are usually quantified by the shear wave profile under the station, but this is not always available. Similar, there is missing data for other variables as well.
There is also the problem that the data is not independent. One normally has many recordings from one earthquake at different stations, which are not independent. Similarly, you can have on station recording several earthquakes.
As I said, I am not sure if this is what you had in mind as problems, but if you are interested, I can provide you with more details....
back to Nicolas' problem which is not centered on just Germany, I then responded with the following:
Thanks for the e-mail. I like your problem. ... I am curious on how: - the lognormality of Y was assumed
- is there a predetermined shape for f
- how was this shape found how ?
- how do you account for missing data ?
Nicolas then responded with:
Any other detail is also interesting as well.
I'm glad that you like the problem. I am also interested to hear about problems from other fields, and how these fields cope. ... Below, I have provided some more details.
- Lognormality: There are two reasons why this is assumed. One is data. I have attached a normal probability plot of ground-motion data from a strong motion dataset (about 3000 datapoints). The plot is taken from some lecture, and you can see that the points follow a straight line, but there are some deviations in the tails.
The second one is more theoretical: The Fourier spectrum of seismic ground motion can be written as F(f) = E(f)xP(f)xS(f), where f is frequency and E(f) is the source component, P(f) is the path and S(f) is the site part (so F(f) is a convolution of these parts). This is a multiplication of random variables, which leads to a lognormal distribution.
There is one problem with the assumption of lognormality which is widely recognized, but no satisfying solution has been proposed: It assigns a nonzero probability for very high, physically impossible ground motions (the tails). This becomes especially important for very low exceedance rates, where these tails become important. Critical facilities such as nuclear power plants or waste deposits must be designed to withstand ground motions with very low exceedance rates.
* shape for f: This is also based on the Fourier spectrum. The source part is proportional to e^M, the path part is proportional to 1/R e^-R, so the most simple model becomes:
f(M,R) = a +bM-cLog(R)-dR
This is based on a model that treats an earthquake as a point. For extended faults, where each part of the fault can emit seismic waves, there is interference and so on. Two observations can be made:
1. a magnitude saturation, where the larger the magnitude, the less the difference in ground motion. This is modeled usually either by a m^2 term in f or by a piecewise linear function for the magnitude.
2. There is an interaction term between M and R (the larger the magnitude, the less the decrease of Y with distance). This is modeled either as (c+c1M)Log(R) or by cLog(R+de^(gM)).
Site effects are usually modeled as discrete variables (the near surface underground is classified as ROCK, STIFF SOIL, SOFT SOIL), each with an individual coefficient. There are different ways how people classify the site conditions, though. In newer studies, one finds also the use of Vs30, which is the average shear wave velocity in the upper 30m, as a predictor variable.
Then, there is the style-of-faulting, which measures how the fault ruptures (horizontally or vertically). It is a discrete, three valued variable.
This leads to this form, which forms the basis of most ground-motion models:
f = a_1+a_2M+(a_3+a_4M) Log (R)+a_5R+a_6 SS+a_7 SA+a_8 FN+a_9FR,
where SS is 1 if the site class is STIFF SOIL and 0 otherwise, SA is one of the site class is SOFT SOIL, and FN is 1 if the style-of-faulting is normal, FR is 1 if the style-of-faulting is reverse.
Newer models take into account more variables and effects (fault dip, nonlinear site amplification, whether the earthquake ruptures the surface or not).
* missing data: This is treated differently
Sometimes, it can be possible to estimate one variable from a related one. E.g., there exist different magnitude measures (moment magnitude, local magnitude, body wave magnitude), on there exist conversion rules between them. There exist also different distance measures, which can be converted.
If a station has no shear wave velocity profile, one can look at the geology. This all introduces uncertainty, though.
What is also sometimes done is first determining the coefficients of magnitude and distance (for which information is usually complete), and later determine the remaining coefficients using the data that is available.
I have tried to determine the coefficients of a model using Bayesian inference via OpenBUGS, where I treated the missing data as parameters for which a posterior distribution was determined.
I am leaving the problematic as is for the time being. In many respects, the type of issues mentioned in Nicolas' problem are very reminiscent of a whole slew of problems found in many different science and engineering fields and subfields. One more thing, Nicolas also has a presentation and a poster that talks about some part of his implementation in A Bayesian Ground Motion Model for Estimating the Covariance Structure of Ground Motion Intensity Parameters and A hierarchical Global Ground Motion Model to Take Into Account Regional Differences.