Quick Soil Sampling Data Summary With Python

Geologists often use soil sampling methods in mineral exploration to gather information about element anomalies in the soil of a certain area. The output of this activity is a database containing coordinates and various element concentrations from the sample taken at that point. Before diving deep into the data and doing any fancy analysis, it is always a good idea to familiarize ourselves with the data by looking into the summary statistics and the general and spatial relationship of the data as a whole. By using Python we can write a simple code to get the summary statistics and some useful graphs to summarize the data in just a few minutes.

We will use pandas, numpy, and matplotib libraries for this code, so let’s begin with importing our library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We can use pandas to read our dataset and put it into a variable called data. We can then use head() function from pandas to look at the first 5 rows of the data.

data = pd.read_csv('soil_data_example.csv')
data.head()

	EASTING	NORTHING	Si	P	S	Cl	Ca	Ti	...	Cd	Sn	Pb	Th	U	LE
0	803041	9880671	53.495	86.0	686.0	149	405.0	6.244	...	0	35	31.0	39	0	844.453
1	803137	9880700	53.272	84.0	660.0	0	461.0	5.452	...	0	0	66.0	41	4	813.117
2	803055	9880624	57.548	154.0	621.0	0	203.0	6.865	...	0	95	41.0	37	0	831.336
3	803150	9880652	68.581	101.0	684.0	0	325.0	6.396	...	0	45	58.0	38	0	834.693
4	803070	9880576	73.638	184.0	758.0	0	251.0	7.636	...	26	126	68.0	31	0	819.252

5 rows × 37 columns

Note: for this example, we use data with dummy coordinates.

Summary statistics

By using pandas, getting summary statistics is as simple as this:

pd.set_option('display.max_columns', None)
data.iloc[:,2:].describe()

	Mg	Al	Si	P	S	Cl	Ca	Ti	V	Cr	Mn	Fe	Co	Ni	Cu	Zn	As	Se	Rb	Sr	Y	Zr	Nb	Mo	Ag	Cd	Sn	Sb	W	Hg	Pb	Bi	Th	U	LE
count	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000	483.000000
mean	0.361609	37.306886	79.901859	164.188741	635.832406	108.314700	447.012607	4.826319	35.511387	67.778468	75.228035	65.516644	256.832961	14.291925	22.339545	50.455487	43.819876	1.204969	90.469979	19.761905	30.440994	322.449275	25.619048	1.917184	90.842650	5.383023	14.844720	5.236025	0.409938	0.300207	49.049996	0.267081	31.828157	0.389234	799.849393
std	2.524420	19.170464	21.018874	105.011094	116.399578	166.519336	364.993895	1.125096	27.704977	72.420758	132.400090	48.937698	266.687782	16.964439	21.153868	42.358326	47.032683	1.866225	53.418809	14.534807	7.541458	108.510507	6.765470	5.917603	78.306505	10.428192	67.141786	22.061828	2.077758	1.339993	89.163245	2.859267	13.579391	1.446719	40.073827
min	0.000000	0.000000	38.589000	0.000000	1.004000	0.000000	1.000000	2.137000	0.000000	0.000000	0.000000	4.597000	0.000000	0.000000	0.000000	7.000000	0.000000	0.000000	11.000000	7.000000	13.000000	104.000000	9.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	631.444000
25%	0.000000	40.015500	63.242000	96.500000	565.500000	0.000000	1.376500	3.935000	19.000000	0.000000	0.000000	24.906500	0.000000	0.000000	8.000000	22.000000	16.000000	0.000000	50.500000	12.000000	26.000000	247.500000	21.000000	0.000000	27.000000	0.000000	0.000000	0.000000	0.000000	0.000000	21.000000	0.000000	22.000000	0.000000	776.857500
50%	0.000000	44.624000	77.059000	154.000000	637.000000	0.000000	536.000000	4.613000	37.000000	42.000000	32.000000	47.867000	159.000000	12.000000	18.000000	37.000000	31.000000	0.000000	76.000000	16.000000	30.000000	313.000000	25.000000	0.000000	65.000000	0.000000	0.000000	0.000000	0.000000	0.000000	28.000000	0.000000	30.000000	0.000000	811.330000
75%	0.000000	48.510000	95.729000	251.000000	702.000000	191.500000	774.000000	5.567500	50.000000	108.000000	75.500000	101.228000	483.000000	22.000000	30.000000	68.000000	54.000000	2.000000	118.000000	22.000000	34.000000	385.500000	29.000000	0.000000	147.500000	0.000000	0.000000	0.000000	0.000000	0.000000	39.000000	0.000000	42.000000	0.000000	829.934000
max	22.825000	77.831000	150.081000	465.000000	970.000000	830.000000	999.000000	9.502000	337.000000	385.000000	896.000000	243.907000	989.000000	232.000000	145.000000	490.000000	455.000000	12.000000	309.000000	149.000000	59.000000	735.000000	45.000000	96.000000	351.000000	55.000000	827.000000	223.000000	22.000000	12.000000	867.000000	45.000000	66.000000	13.000000	865.822000

For practicality, pandas by default will not show the full range of rows and columns of our dataset. The first line of the code tells pandas to show the full range of the columns. In the second line, we use iloc[:, 2:] attribute to ignore the first two columns of our dataset because we don’t need summary statistics of the x and y coordinate (unless for some reason you do).

Histograms

Next, we will draw the histogram of each element’s concentration in the area. Because of the logarithmic nature of element concentration in the soil, we will plot the log concentration value into the histogram. Since we are dealing with a lot of 0’s, we will add a very small value to all the concentration data to make it easier with the logarithmic transformation. There are various ways to deal with zero values in the dataset, but for this example, let’s just add 0.0001 to all concentration data.

data.iloc[:,2:] = np.log(data.iloc[:,2:]+0.0001)

np.log is a natural logarithm with base e if we want logarithm transformation with base 10 we can use np.log10. Now if we call data.head() we will see that that the concentration of each element already transformed into its logarithmic value.

Note: code above will return an error if there are still zero values in the datasets. There are various ways to deals with 0 and NaN value in our dataset and it was up to you which method you will use for your case. You can also ignore this line if you don’t want to do any logarithmic transformation for your dataset

data.head()

	EASTING	NORTHING	Mg	Al	Si	P	S	Cl	Ca	Ti	V	Cr	Mn	Fe	Co	Ni	Cu	Zn	As	Se	Rb	Sr	Y	Zr	Nb	Mo	Ag	Cd	Sn	Sb	W	Hg	Pb	Bi	Th	U	LE
0	803041	9880671	-9.21034	-9.21034	3.979590	4.454348	6.530878	5.003947	6.003887	1.831637	3.135499	3.761202	-9.210340	3.880823	4.543296	2.302595	2.302595	4.219509	4.060445	-9.21034	3.044527	2.484915	3.555351	6.146329	3.367299	-9.21034	3.713575	-9.21034	3.555351	-9.21034	-9.21034	-9.21034	3.433990	-9.21034	3.663564	-9.210340	6.738689
1	803137	9880700	-9.21034	-9.21034	3.975413	4.430818	6.492240	-9.210340	6.133398	1.696001	3.295841	4.812185	6.284134	4.346828	5.752573	2.772595	-9.210340	4.700481	4.709531	-9.21034	3.761202	2.833219	3.663564	5.993962	3.367299	-9.21034	4.174389	-9.21034	-9.210340	-9.21034	-9.21034	-9.21034	4.189656	-9.21034	3.713575	1.386319	6.700875
2	803055	9880624	-9.21034	-9.21034	4.052621	5.036953	6.431331	-9.210340	5.313206	1.926451	2.995737	3.496511	3.465739	3.970464	5.003947	3.091047	3.091047	4.234108	4.174389	-9.21034	2.772595	2.564957	3.610921	6.208590	3.496511	-9.21034	3.931828	-9.21034	4.553878	-9.21034	-9.21034	-9.21034	3.713575	-9.21034	3.610921	-9.210340	6.723034
3	803150	9880652	-9.21034	-9.21034	4.228017	4.615122	6.527958	-9.210340	5.783825	1.855688	3.178058	-9.210340	3.295841	3.667786	-9.210340	2.890377	-9.210340	4.043053	4.143136	-9.21034	3.295841	2.833219	3.610921	6.122493	3.367299	-9.21034	3.555351	-9.21034	3.806665	-9.21034	-9.21034	-9.21034	4.060445	-9.21034	3.637589	-9.210340	6.727064
4	803070	9880576	-9.21034	-9.21034	4.299163	5.214936	6.630684	-9.210340	5.525453	2.032887	3.806665	4.248497	-9.210340	3.742731	4.025353	2.890377	2.197236	4.189656	3.988986	-9.21034	2.890377	2.708057	3.688882	6.272877	3.555351	-9.21034	4.174389	3.25810	4.836283	-9.21034	-9.21034	-9.21034	4.219509	-9.21034	3.433990	-9.210340	6.708392

Next, we will use matplotlib to define our canvas and axes and then loop into each element in our dataset to plot each histogram.

fig, ax = plt.subplots(nrows=7, ncols=5, figsize=(20,8), sharey=True)
plt.subplots_adjust(top=3)


row = 0
col = 0
for element in data.iloc[:,2:].columns:
    x=list(data[element])
    ax[row,col].hist(x, bins=35)
    ax[row,col].set_title("log " + element)
    if col<4:
        col+=1
    else:
        col=0
        row+=1
        
plt.show()

Image of element concetration histogram in log ppm.

Image 1. Elements concentration (log ppm)

The code above can be separated into two sections. The first section of the code will define the figure and axis canvas for our plot. Since we are dealing with 35 variables we will divide our subplots into 7 rows and 5 columns.

We can check the length of our variable with:

'''
it is getting annoying to keep typing data.iloc[:,2:] 
to get our elements so let's just put it into a
variable called grade
'''
grade = data.iloc[:,2:]
    
len(grade.columns)

The second section of the code defines the loop behavior that will plot each histogram into predefined subplots.

Correlation Matrix

Our next figure will be a correlation matrix. This is a very useful and neat graph that will tell us the correlation coefficient of one variable with another variable within the dataset. The correlation matrix value ranges from -1 to 1, where:

-1 = perfect negative correlation
0 = no correlation
1 = perfect positive correlation.

Pandas come in handy with its corr() function to calculate a correlation matrix of our dataset. Combine it with matplotlib functions such as: matshow() to draw the matrix into a figure; cmap() to denote it with colormap; and colorbar() to draw the legend, and we can get this informative graph.

cor_matrix = grade.corr()
fig = plt.figure(figsize=(20, 18))
ax = fig.add_subplot(111)
cax = ax.matshow(cor_matrix,cmap='coolwarm', vmin=-1, vmax=1)
ticks = np.arange(0,len(grade.columns),1)

fig.colorbar(cax)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
plt.title('Correlation Matrix')
ax.set_yticks(ticks)
ax.set_xticklabels(grade.columns)
ax.set_yticklabels(grade.columns)

plt.show()

Image of heatmap showing the correlation matrix between log-transformed elemental concentrations, where color intensity indicates the strength and direction of correlation between elements.

Image 2. Elements concentration correlation matrix

Map plots

With the same concept as the histogram plot, we can make quick and dirty map plots of our points to get a glimpse of the spatial correlation of each element. For this purpose, we use scatter() function with our UTM coordinates as its x and y value and denote the concentration value of each element to a color map using cmap(). We do this using for loop so it can loop to each element in our grade data.

fig, ax = plt.subplots(figsize=(20,50), nrows=12, ncols=3)

rows=0
cols=0 
for element in grade:
    cax = ax[rows,cols].scatter(data['EASTING'], data['NORTHING'], c=grade[element], cmap='coolwarm', s=10, alpha=0.9)
    ax[rows,cols].set_xticklabels('')
    ax[rows,cols].set_yticklabels('')
    ax[rows,cols].set_title(element)
    ax[rows,cols].axis('equal')
    fig.colorbar(cax, ax=ax[rows,cols])
    if cols<2:
        cols+=1
    else:
        cols = 0
        rows += 1
ax[11,2].set_axis_off()

plt.show()

Image of a set of spatial maps displaying the distribution of individual element concentrations across the survey area, with color gradients representing concentration levels.

Image 3. Elements concentration map

And that’s it. Your soil sampling data summary in just about a minute. You can download the full code on my GitHub page and reuse it for your next project. Hope it can save you the trouble and you can be more focused on your real geologist work. Have fun exploring!

Summary statistics

Histograms

Correlation Matrix

Map plots

Enjoy Reading This Article?