Declustering¶
We focus on the declustering technique widely used in geostatistics and we apply this method to a data set from environmental industry relying on synthetic arsenic. Declustering tool allows to better approximate the distribution of a spatial variable with irregular and biaised sampling.
In order to simplify the example, we consider a 3D data points and we set the $z$ coordinate to $0$. We apply both the Moving window method and the Cell declustering algorithm from C. Deutsch - 1989 Journal Article Link.
Dataset¶
import pandas as pd
import numpy as np
import geolime as geo
from matplotlib import pyplot as plt
%matplotlib inline
As we can see in the dataset below, in order to perform a 2D declustering, with consider a $z=0$ coordinate. The methods can be applied with a non-zero coordinate as well.
df = geo.datasets.load("synthetic_arsenic/synthetic_arsenic.csv")
var = 'arsenic'
df.head()
x | y | z | arsenic | |
---|---|---|---|---|
0 | 273.041475 | 779.883382 | 0 | 23.533643 |
1 | 351.382489 | 791.545189 | 0 | 22.402858 |
2 | 358.294931 | 706.997085 | 0 | 21.052205 |
3 | 197.004608 | 849.854227 | 0 | 20.723272 |
4 | 330.645161 | 794.460641 | 0 | 18.726256 |
Below we build a drillhole from the data.
collar = pd.DataFrame(columns=["HOLEID", "X_COLLAR", "Y_COLLAR", "Z_COLLAR"])
assay = pd.DataFrame(columns=["HOLEID", "FROM", "TO"])
collar['X_COLLAR'] = df['x'].values
collar['Y_COLLAR'] = df['y'].values
collar['Z_COLLAR'] = df['z'].values
collar['Z_COLLAR'] = collar['Z_COLLAR'].astype(np.float64)
assay[var] = df[var].values
assay['FROM'] = 0.
assay['TO'] = 1.
collar['HOLEID'] = np.arange(0,len(df))
assay['HOLEID'] = np.arange(0,len(df))
collar['HOLEID'] = collar['HOLEID'].astype(str)
assay['HOLEID'] = assay['HOLEID'].astype(str)
drillholes = geo.create_drillholes('ArsenicData', collar, assay)
2023-10-13 13:49:51,883 [INFO] GeoLime Project - |GEOLIME|.drillholes_api.py : Following mapping has been identified: { "HOLEID": "HOLEID", "X_COLLAR": "X_COLLAR", "Y_COLLAR": "Y_COLLAR", "Z_COLLAR": "Z_COLLAR", "FROM": "FROM", "TO": "TO" }
We visualize the data points in the figure below, we observe that the data are not well spatialy distributed. More precisely, there are two clusters of high arsenic concentration (red) in the north-west and south-east.
coords = drillholes.coords()
x = coords[:, 0]
y = coords[:, 1]
z = coords[:, 2]
val_var = drillholes.data("arsenic")
# Plot data points
fig = plt.figure(figsize=(20,10))
widths = [1, 2]
heights = [1]
spec = fig.add_gridspec(
ncols=2,
nrows=1,
width_ratios=widths,
height_ratios=heights
)
ax1 = fig.add_subplot(spec[0, 0], projection='3d')
ax1.scatter(x, y, z, s=100,marker='o')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('z')
ax1.set_title('Data points in 3D', fontsize=20)
ax2 = fig.add_subplot(spec[0, 1])
sc = ax2.scatter(x,y, s=200, c=val_var, marker='o', cmap=plt.cm.jet)
cbar = plt.colorbar(sc)
cbar.set_label('Arsenic concentration', fontsize=20)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Data points in 2D', fontsize=20)
plt.show()
We observe some clusters, so that the declustered weight would help on approximating the variables with more accuracy. In order to decluster these data, we will perform two differents declustering methods and compare our results.
Below we set the anisotropy coefficients to 1.
rad_x = 1.
rad_y = 1.
rad_z = 1.
Moving window¶
The Moving window method is based on a volume centered in each point $(x_i, y_i, z_i)$ with $i\in \{1,...,n\}$ where $n$ is the number of data points. The volume $V_i$ can be an ellipsoid as well as a paralleliped. The method consists in counting the number of point (also called neighboors) $n_i$ in the volume $V_i$ centered in point $(x_i, y_i, z_i)$.
The weight $w_i$ of point $(x_i, y_i, z_i)$ is given by \begin{equation} w_i = \dfrac{\overline{n_i}}{n_i} \end{equation} where $\overline{n_i}$ is the mean of the $n_i$. Then the weights are normalized so that the mean $\overline{w_i}$ is equal to $1$.
Declustered weights computation¶
The Moving window method is applied on our dataframe in the isotropic case. The size of the ellipsoid is set to $100$ since the well adapted cell size is around $100$ (see 3.2).
cell_size = 100
diam_x = rad_x * cell_size
diam_y = rad_y * cell_size
diam_z = rad_z * cell_size
declus_weight_mw = geo.moving_window_declus(
obj=drillholes,
obj_region=None,
obj_attribute='arsenic',
diam_x=diam_x,
diam_y=diam_y,
diam_z=diam_z,
geometry='BALL'
)
The Moving window method applies a normalization whichs provides weights with a mean equal to 1.
print("Weight mean: ", declus_weight_mw.mean())
Weight mean: 1.0
Visualization of declustered variable¶
In order to visualize the weights, we plot the declustered arsenic values and the naive arsenic values (with weights equal to 1). Since locations with higher concentrations of arsenic were oversampled, we expect the declustered statistics to reduce the number of high arsenic samples.
# weights plot: naive vs declustered values
naive_w = np.ones(len(drillholes.data(var)))
plt.figure(figsize=(16,6))
plt.hist(
[drillholes.data(var), drillholes.data(var)],
bins = 20,
density=True,
weights=[declus_weight_mw, naive_w],
label=['Declustered arsenic', 'Naive arsenic']
)
plt.legend(fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.xlabel('Arsenic concentration', fontsize=16)
plt.title('Declustered arsenic - Moving window', fontsize=18)
plt.show()
Indeed, we can check that lower arsenic concentration are much more weighted, whereas the higher arsenic concentration are less weighted than in the naive case. Moreover, the Moving window method provides weighted arsenic which amounts up to 55 .
Cell declustering¶
The Cell declustering is one of the most common declustering algorithm, designed by C. Deutsch in 1988. The method adopts a predefined grid covering all the data points to assign weights. The weights are computed based on the number of points in each cell of the grid. The methods consists in considering several cell size $h$ of the grid, and several origins $o$. For each grid, we count the number of points present in each cell.
The weight $w_i^{h,o}$ of point $(x_i, y_i, z_i)$ is given by $\dfrac{1}{n_i}$ where $n_i$ is the number of points present in the cell containing $(x_i, y_i, z_i)$.
In practice, we can define the constants $c_\min$ and $c_\max$ standing for the minimum and maximum cell size. The best adapted cell size is the one which minimizing or maximizing the declusterd mean of the variable of interest. The declustered mean $\bar{v}$ is computed with the weight $w^h$ - mean of all the weight on the origins offsets - given by \begin{equation} \bar{v} = \dfrac{\sum_{i=1}^n v_i w_i }{\sum_{i=1}^n w_i}, \end{equation} where $v_i$ is the value of the variable $v$ in point $(x_i, y_i, z_i)$.
In this example, with set $c_\min$ to $1$ and $c_\max$ to $750$.
The whole process, iterating on origins offsets, is repeated over all the cell sizes. The weight are averaged over origins and finally normalized on cell sizes. The normalization is so that the mean of the weight is 1.
Declustered weights computation¶
cell_size = 150
size_x = cell_size
size_y = cell_size
size_z = cell_size
# nboff = nb of origins
nb_off = 25
declus_weight_cd = geo.cell_declustering(
obj=drillholes,
obj_region=None,
obj_attribute=var,
size_x=size_x,
size_y=size_y,
size_z=size_z,
nb_off=nb_off
)
We check that the Moving window method applies a normalization provinding weights with a mean equals to 1.
print("Weights mean: ", declus_weight_cd.mean())
Weights mean: 1.0
Visualization of declustered variable¶
Below we represent the declustered arsenic compared to the naive arsenic (with weights equal to $1$).
plt.figure(figsize=(16,6))
plt.hist(
[drillholes.data(var), drillholes.data(var)],
bins = 20,
density=True,
weights=[declus_weight_cd, naive_w],
label=['Declustered arsenic', 'Naive arsenic']
)
plt.legend(fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.xlabel('Arsenic concentration', fontsize=16)
plt.title('Declustered arsenic - Cell Declustering', fontsize=18)
plt.show()
As for the Moving window method, we observe that the lower concentration of arsenic are more weighted than the higer concentration arsenic (more than $15$). Furthermore, we can see that weighted arsenic is up to $17.5$ whereas it reaches $55$ for the Moving window method. This difference can be explained by the difference in weights normalization.
Declustered mean for optimal cell size¶
We vary the cell size and compute the mean of the variable of interest in order to evaluate the cell size which provides the minimum declustered mean.
In the figure below we plot the declustered variable mean with respect to the cell size. Indeed, the optimal weights are those whose achieve the minimum cell size. We observe that the minimum declustered mean is reached for cell sizes included in $[120, 200]$.
cmin = 0
cmax = 750
inc = 4
nb_off = 25
x_cs_cd = np.arange(cmin, cmax, inc)
naive_mean = np.mean(drillholes.data(var))
var_mean_cd = [None] * len(x_cs_cd)
i = 0
for s in x_cs_cd[1:]:
declus_weight_cd = geo.cell_declustering(drillholes, None, var, s, s, s, nb_off)
var_mean_cd[i] = np.dot(declus_weight_cd, drillholes.data(var)) / np.sum(declus_weight_cd)
i += 1
var_mean_cd[0] = naive_mean
plt.figure()
plt.plot(x_cs_cd, var_mean_cd, linewidth=4, label="declustered mean")
plt.plot(x_cs_cd, np.full((len(x_cs_cd)),naive_mean), linewidth=4, label="naive mean")
plt.title("Declustered mean - Cell declustering", fontsize=18)
plt.xlabel("Cell size", fontsize=16)
plt.ylabel("{} mean".format(var), fontsize=16)
plt.legend(loc="upper right", fontsize=16)
plt.show()
We see in the figure above that the minimum arsenic mean is obtained for a cell size around 120.
Methods comparison¶
Declustered variable¶
In what follows, we perform a comparison of the Moving window method and the Cell declustering algorithm by plotting the declustered values of arsenic.
fig, ax = plt.subplots(figsize=(50,20))
ax.hist(
drillholes.data(var),
rwidth = 0.25,
weights = declus_weight_mw,
color = 'blue',
density=True,
bins = np.arange(0, drillholes.data(var).max(), 1),
label = f'Declustered {var} - Moving Window'
)
ax.hist(
drillholes.data(var) + 0.25 ,
rwidth = 0.25,
weights = declus_weight_cd,
color = 'orange',
density=True,
bins = np.arange(0 + 0.25, drillholes.data(var).max() + 0.25, 1),
label = f'Declustered {var} - Cell Declustering'
)
ax.hist(
drillholes.data(var) + 0.5,
rwidth = 0.25,
color = 'green',
density=True,
bins = np.arange(0 + 0.5, drillholes.data(var).max() + 0.5, 1),
label = f'Naive {var}'
)
ax.set_xlabel("Arsenic concentration", fontsize = 50.0)
ax.set_ylabel("Frequency", fontsize = 50.0)
plt.xticks(fontsize= 50 )
plt.yticks(fontsize= 50 )
plt.title('Weights comparison', fontsize = 50.0)
ax.legend(fontsize=50)
plt.show()
We observe that the Moving window method provides higher weights, while the Cell declustering yields much less high weights. This can be related to the different normalizations in each method.
Declustered means¶
In order to compare the declustered mean of the two methods, we plot the declustered mean for the Moving window method as in 3.3) for the Cell declustering. To do so, we consider the size of the moving window $V_i$ as the ellipsoid radii, and we vary the radii in each direction $x$, $y$, $z$ of same amount - as we consider the isotropic case.
Moreover, we perform computations with a Moving window defined by an ellipsoid, as well as a parallelepiped. The parallelepiped geometry allows to approach the Cell declustering procedure which is performed with cubes in 3D.
x_cs = np.arange(10,760,10)
var_mean_cd = [None]*len(x_cs)
naive_mean = np.mean(drillholes.data(var))
declus_weight_mw_ellipse = [None]*len(x_cs)
var_mean_ellipse = [None]*len(x_cs)
declus_weight_mw_parallelepiped = [None]*len(x_cs)
var_mean_parallelepiped = [None]*len(x_cs)
i = 0
for s in x_cs[1:]:
declus_weight_cd = geo.cell_declustering(drillholes, None, var, s, s, s, nb_off)
var_mean_cd[i] = np.dot(declus_weight_cd, drillholes.data(var)) / np.sum(declus_weight_cd)
i += 1
var_mean_cd[0] = naive_mean
i = 0
for s in x_cs:
declus_weight_mw_ellipse[i] = geo.moving_window_declus(drillholes, None, var, s, s, s, 'BALL')
var_mean_ellipse[i] = np.dot(declus_weight_mw_ellipse[i], drillholes.data(var))/np.sum(declus_weight_mw_ellipse[i])
declus_weight_mw_parallelepiped[i] = geo.moving_window_declus(drillholes, None, var, s, s, s, 'PARALLELEPIPED')
var_mean_parallelepiped[i] = np.dot(declus_weight_mw_parallelepiped[i], drillholes.data(var))/np.sum(declus_weight_mw_parallelepiped[i])
i += 1
var_mean_cd[0] = naive_mean
var_mean_ellipse[0] = naive_mean
var_mean_parallelepiped[0] = naive_mean
plt.figure(figsize=(12,8))
plt.plot(x_cs, var_mean_ellipse, linewidth=4, label="declustered mean (ellipse)", color="#2ca02c")
plt.plot(x_cs, var_mean_parallelepiped, linewidth=4, label="declustered mean (parallelepiped)", color="#e377c2")
plt.plot(x_cs, var_mean_cd, linewidth=4, label="declustered mean (Cell declustering)", color="#1f77b4")
plt.plot(x_cs, np.full((len(x_cs)), naive_mean), linewidth=4, label="naive mean", color="#ff7f0e")
plt.title("Declustered mean - Moving window", fontsize=18)
plt.xlabel("Cell size", fontsize=16)
plt.ylabel("Arsenic mean", fontsize=16)
plt.legend(loc="lower right", fontsize=16)
plt.show()
We observe that results from ellipsoid geometry an parallelepiped geometry are similar. The results of the Cell declustering method are quite different but we get an overall same trend. Furthermore, we observe that the minimum cell size is included in $[60,120]$ for the Moving window method and in $[120,200]$ for the Cell declustering method.
Declustered data points¶
We represent the declustering weights for the two methods below. We can check that the clusters are lower weighted and isolated points are more weighted.
coords = drillholes.coords()
x = coords[:, 0]
y = coords[:, 1]
z = coords[:, 2]
# Plot data points
fig = plt.figure(figsize=(20,10))
widths = [1, 1]
heights = [1]
spec = fig.add_gridspec(ncols=2, nrows=1, width_ratios=widths,
height_ratios=heights)
ax1 = fig.add_subplot(spec[0, 0])
sc = ax1.scatter(x,y, s=200, c=declus_weight_mw, marker='o', cmap=plt.cm.jet)
cbar = plt.colorbar(sc)
cbar.set_label('Declustered weight', fontsize=20)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Declustering Moving window', fontsize=20)
ax2 = fig.add_subplot(spec[0, 1])
sc = ax2.scatter(x,y, s=200, c=declus_weight_cd, marker='o', cmap=plt.cm.jet)
cbar = plt.colorbar(sc)
cbar.set_label('Declustered weight', fontsize=20)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Declustering Cell declustering', fontsize=20)
plt.show()
Conclusion¶
We have implemented the Cell declustering and the Moving window method for declustering data points with clusters. We have computed the weights for a wide range of cell size in order to determine which cell size yields a minimum declustered mean and obtain an optimal cell size around 100 for each method.
We have compared the Moving window results for both a ball geometry and a parallelepiped geometry and have obtained an overall same trend. Moreover, our results show that the Cell declustering provides weights lower than the Moving window. This can be explained by borders effects and overlapping cells.
The two methods can be applied similarly with an appropriate cell size.