Declustering¶
We focus on the declustering technique widely used in geostatistics and we apply this method to a data set from environmental industry relying on synthetic arsenic. Declustering tool allows to better approximate the distribution of a spatial variable with irregular and biaised sampling.
In order to simplify the example, we consider a 3D data points and we set the $z$ coordinate to $0$. We apply both the Moving window method and the Cell declustering algorithm from C. Deutsch - 1989 Journal Article Link.
Dataset¶
import pandas as pd
import numpy as np
import geolime as geo
from matplotlib import pyplot as plt
%matplotlib inline
As we can see in the dataset below, in order to perform a 2D declustering, with consider a $z=0$ coordinate. The methods can be applied with a non-zero coordinate as well.
df = geo.datasets.load("synthetic_arsenic/synthetic_arsenic.csv")
var = 'arsenic'
df.head()
x | y | z | arsenic | |
---|---|---|---|---|
0 | 273.041475 | 779.883382 | 0 | 23.533643 |
1 | 351.382489 | 791.545189 | 0 | 22.402858 |
2 | 358.294931 | 706.997085 | 0 | 21.052205 |
3 | 197.004608 | 849.854227 | 0 | 20.723272 |
4 | 330.645161 | 794.460641 | 0 | 18.726256 |
Below we build a drillhole from the data.
collar = pd.DataFrame(columns=["HOLEID", "X_COLLAR", "Y_COLLAR", "Z_COLLAR"])
assay = pd.DataFrame(columns=["HOLEID", "FROM", "TO"])
collar['X_COLLAR'] = df['x'].values
collar['Y_COLLAR'] = df['y'].values
collar['Z_COLLAR'] = df['z'].values
collar['Z_COLLAR'] = collar['Z_COLLAR'].astype(np.float64)
assay[var] = df[var].values
assay['FROM'] = 0.
assay['TO'] = 1.
collar['HOLEID'] = np.arange(0,len(df))
assay['HOLEID'] = np.arange(0,len(df))
collar['HOLEID'] = collar['HOLEID'].astype(str)
assay['HOLEID'] = assay['HOLEID'].astype(str)
drillholes = geo.create_drillholes('ArsenicData', collar, assay)
2023-10-13 13:49:51,883 [INFO] GeoLime Project - |GEOLIME|.drillholes_api.py : Following mapping has been identified: { "HOLEID": "HOLEID", "X_COLLAR": "X_COLLAR", "Y_COLLAR": "Y_COLLAR", "Z_COLLAR": "Z_COLLAR", "FROM": "FROM", "TO": "TO" }
We visualize the data points in the figure below, we observe that the data are not well spatialy distributed. More precisely, there are two clusters of high arsenic concentration (red) in the north-west and south-east.
coords = drillholes.coords()
x = coords[:, 0]
y = coords[:, 1]
z = coords[:, 2]
val_var = drillholes.data("arsenic")
# Plot data points
fig = plt.figure(figsize=(20,10))
widths = [1, 2]
heights = [1]
spec = fig.add_gridspec(
ncols=2,
nrows=1,
width_ratios=widths,
height_ratios=heights
)
ax1 = fig.add_subplot(spec[0, 0], projection='3d')
ax1.scatter(x, y, z, s=100,marker='o')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('z')
ax1.set_title('Data points in 3D', fontsize=20)
ax2 = fig.add_subplot(spec[0, 1])
sc = ax2.scatter(x,y, s=200, c=val_var, marker='o', cmap=plt.cm.jet)
cbar = plt.colorbar(sc)
cbar.set_label('Arsenic concentration', fontsize=20)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Data points in 2D', fontsize=20)
plt.show()
We observe some clusters, so that the declustered weight would help on approximating the variables with more accuracy. In order to decluster these data, we will perform two differents declustering methods and compare our results.
Below we set the anisotropy coefficients to 1.
rad_x = 1.
rad_y = 1.
rad_z = 1.
Moving window¶
The Moving window method is based on a volume centered in each point $(x_i, y_i, z_i)$ with $i\in \{1,...,n\}$ where $n$ is the number of data points. The volume $V_i$ can be an ellipsoid as well as a paralleliped. The method consists in counting the number of point (also called neighboors) $n_i$ in the volume $V_i$ centered in point $(x_i, y_i, z_i)$.
The weight $w_i$ of point $(x_i, y_i, z_i)$ is given by \begin{equation} w_i = \dfrac{\overline{n_i}}{n_i} \end{equation} where $\overline{n_i}$ is the mean of the $n_i$. Then the weights are normalized so that the mean $\overline{w_i}$ is equal to $1$.
Declustered weights computation¶
The Moving window method is applied on our dataframe in the isotropic case. The size of the ellipsoid is set to $100$ since the well adapted cell size is around $100$ (see 3.2).
cell_size = 100
diam_x = rad_x * cell_size
diam_y = rad_y * cell_size
diam_z = rad_z * cell_size
declus_weight_mw = geo.moving_window_declus(
obj=drillholes,
obj_region=None,
obj_attribute='arsenic',
diam_x=diam_x,
diam_y=diam_y,
diam_z=diam_z,
geometry='BALL'
)
The Moving window method applies a normalization whichs provides weights with a mean equal to 1.
print("Weight mean: ", declus_weight_mw.mean())
Weight mean: 1.0
Visualization of declustered variable¶
In order to visualize the weights, we plot the declustered arsenic values and the naive arsenic values (with weights equal to 1). Since locations with higher concentrations of arsenic were oversampled, we expect the declustered statistics to reduce the number of high arsenic samples.
# weights plot: naive vs declustered values
naive_w = np.ones(len(drillholes.data(var)))
plt.figure(figsize=(16,6))
plt.hist(
[drillholes.data(var), drillholes.data(var)],
bins = 20,
density=True,
weights=[declus_weight_mw, naive_w],
label=['Declustered arsenic', 'Naive arsenic']
)
plt.legend(fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.xlabel('Arsenic concentration', fontsize=16)
plt.title('Declustered arsenic - Moving window', fontsize=18)
plt.show()
Indeed, we can check that lower arsenic concentration are much more weighted, whereas the higher arsenic concentration are less weighted than in the naive case. Moreover, the Moving window method provides weighted arsenic which amounts up to 55 .
Cell declustering¶
The Cell declustering is one of the most common declustering algorithm, designed by C. Deutsch in 1988. The method adopts a predefined grid covering all the data points to assign weights. The weights are computed based on the number of points in each cell of the grid. The methods consists in considering several cell size $h$ of the grid, and several origins $o$. For each grid, we count the number of points present in each cell.
The weight $w_i^{h,o}$ of point $(x_i, y_i, z_i)$ is given by $\dfrac{1}{n_i}$ where $n_i$ is the number of points present in the cell containing $(x_i, y_i, z_i)$.
In practice, we can define the constants $c_\min$ and $c_\max$ standing for the minimum and maximum cell size. The best adapted cell size is the one which minimizing or maximizing the declusterd mean of the variable of interest. The declustered mean $\bar{v}$ is computed with the weight $w^h$ - mean of all the weight on the origins offsets - given by \begin{equation} \bar{v} = \dfrac{\sum_{i=1}^n v_i w_i }{\sum_{i=1}^n w_i}, \end{equation} where $v_i$ is the value of the variable $v$ in point $(x_i, y_i, z_i)$.
In this example, with set $c_\min$ to $1$ and $c_\max$ to $750$.
The whole process, iterating on origins offsets, is repeated over all the cell sizes. The weight are averaged over origins and finally normalized on cell sizes. The normalization is so that the mean of the weight is 1.
Declustered weights computation¶
cell_size = 150
size_x = cell_size
size_y = cell_size
size_z = cell_size
# nboff = nb of origins
nb_off = 25
declus_weight_cd = geo.cell_declustering(
obj=drillholes,
obj_region=None,
obj_attribute=var,
size_x=size_x,
size_y=size_y,
size_z=size_z,
nb_off=nb_off
)
We check that the Moving window method applies a normalization provinding weights with a mean equals to 1.
print("Weights mean: ", declus_weight_cd.mean())
Weights mean: 1.0
Visualization of declustered variable¶
Below we represent the declustered arsenic compared to the naive arsenic (with weights equal to $1$).
plt.figure(figsize=(16,6))
plt.hist(
[drillholes.data(var), drillholes.data(var)],
bins = 20,
density=True,
weights=[declus_weight_cd, naive_w],
label=['Declustered arsenic', 'Naive arsenic']
)
plt.legend(fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.xlabel('Arsenic concentration', fontsize=16)
plt.title('Declustered arsenic - Cell Declustering', fontsize=18)
plt.show()
As for the Moving window method, we observe that the lower concentration of arsenic are more weighted than the higer concentration arsenic (more than $15$). Furthermore, we can see that weighted arsenic is up to $17.5$ whereas it reaches $55$ for the Moving window method. This difference can be explained by the difference in weights normalization.
Declustered mean for optimal cell size¶
We vary the cell size and compute the mean of the variable of interest in order to evaluate the cell size which provides the minimum declustered mean.
In the figure below we plot the declustered variable mean with respect to the cell size. Indeed, the optimal weights are those whose achieve the minimum cell size. We observe that the minimum declustered mean is reached for cell sizes included in $[120, 200]$.
cmin = 0
cmax = 750
inc = 4
nb_off = 25
x_cs_cd = np.arange(cmin, cmax, i