Hi Igor,
I'm glad that you like the problem. I am also interested to hear about problems from other fields, and how these fields cope. ... Below, I have provided some more details.
- Lognormality: There are two reasons why this is assumed. One is data. I have attached a normal probability plot of ground-motion data from a strong motion dataset (about 3000 datapoints). The plot is taken from some lecture, and you can see that the points follow a straight line, but there are some deviations in the tails.
The second one is more theoretical: The Fourier spectrum of seismic ground motion can be written as F(f) = E(f)xP(f)xS(f), where f is frequency and E(f) is the source component, P(f) is the path and S(f) is the site part (so F(f) is a convolution of these parts). This is a multiplication of random variables, which leads to a lognormal distribution.
There is one problem with the assumption of lognormality which is widely recognized, but no satisfying solution has been proposed: It assigns a nonzero probability for very high, physically impossible ground motions (the tails). This becomes especially important for very low exceedance rates, where these tails become important. Critical facilities such as nuclear power plants or waste deposits must be designed to withstand ground motions with very low exceedance rates.
*
shape for f: This is also based on the Fourier spectrum. The source part is proportional to e^M, the path part is proportional to 1/R e^-R, so the most simple model becomes:
f(M,R) = a +bM-cLog(R)-dR
This is based on a model that treats an earthquake as a point. For extended faults, where each part of the fault can emit seismic waves, there is interference and so on. Two observations can be made:
1. a magnitude saturation, where the larger the magnitude, the less the difference in ground motion. This is modeled usually either by a m^2 term in f or by a piecewise linear function for the magnitude.
2. There is an interaction term between M and R (the larger the magnitude, the less the decrease of Y with distance). This is modeled either as (c+c1M)Log(R) or by cLog(R+de^(gM)).
Site effects are usually modeled as discrete variables (the near surface underground is classified as ROCK, STIFF SOIL, SOFT SOIL), each with an individual coefficient. There are different ways how people classify the site conditions, though. In newer studies, one finds also the use of Vs30, which is the average shear wave velocity in the upper 30m, as a predictor variable.
Then, there is the style-of-faulting, which measures how the fault ruptures (horizontally or vertically). It is a discrete, three valued variable.
This leads to this form, which forms the basis of most ground-motion models:
f = a_1+a_2M+(a_3+a_4M) Log (R)+a_5R+a_6 SS+a_7 SA+a_8 FN+a_9FR,
where SS is 1 if the site class is STIFF SOIL and 0 otherwise, SA is one of the site class is SOFT SOIL, and FN is 1 if the style-of-faulting is normal, FR is 1 if the style-of-faulting is reverse.
Newer models take into account more variables and effects (fault dip, nonlinear site amplification, whether the earthquake ruptures the surface or not).
*
missing data: This is treated differently
Sometimes, it can be possible to estimate one variable from a related one. E.g., there exist different magnitude measures (moment magnitude, local magnitude, body wave magnitude), on there exist conversion rules between them. There exist also different distance measures, which can be converted.
If a station has no shear wave velocity profile, one can look at the geology. This all introduces uncertainty, though.
What is also sometimes done is first determining the coefficients of magnitude and distance (for which information is usually complete), and later determine the remaining coefficients using the data that is available.
I have tried to determine the coefficients of a model using Bayesian inference via
OpenBUGS, where I treated the missing data as parameters for which a posterior distribution was determined.
Cheers,
Nico