Quick Soil Sampling Data Summary With Python

Geologists often use soil sampling methods in mineral exploration to gather information about element anomalies in the soil of a certain area. The output of this activity is a database containing coordinates and various element concentrations from the sample taken at that point. Before diving deep into the data and doing any fancy analysis, it is always a good idea to familiarize ourselves with the data by looking into the summary statistics and the general and spatial relationship of the data as a whole. By using Python we can write a simple code to get the summary statistics and some useful graphs to summarize the data in just a few minutes.

We will use pandas, numpy, and matplotib libraries for this code, so let’s begin with importing our library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We can use pandas to read our dataset and put it into a variable called data. We can then use head() function from pandas to look at the first 5 rows of the data.

data = pd.read_csv('soil_data_example.csv')
data.head()
EASTING NORTHING Mg Al Si P S Cl Ca Ti ... Cd Sn Sb W Hg Pb Bi Th U LE
0 803041 9880671 0.0 0.0 53.495 86.0 686.0 149 405.0 6.244 ... 0 35 0 0 0 31.0 0 39 0 844.453
1 803137 9880700 0.0 0.0 53.272 84.0 660.0 0 461.0 5.452 ... 0 0 0 0 0 66.0 0 41 4 813.117
2 803055 9880624 0.0 0.0 57.548 154.0 621.0 0 203.0 6.865 ... 0 95 0 0 0 41.0 0 37 0 831.336
3 803150 9880652 0.0 0.0 68.581 101.0 684.0 0 325.0 6.396 ... 0 45 0 0 0 58.0 0 38 0 834.693
4 803070 9880576 0.0 0.0 73.638 184.0 758.0 0 251.0 7.636 ... 26 126 0 0 0 68.0 0 31 0 819.252

5 rows × 37 columns

Note: for this example, we use data with dummy coordinates.

Summary statistics

By using pandas, getting summary statistics is as simple as this:

pd.set_option('display.max_columns', None)
data.iloc[:,2:].describe()
Mg Al Si P S Cl Ca Ti V Cr Mn Fe Co Ni Cu Zn As Se Rb Sr Y Zr Nb Mo Ag Cd Sn Sb W Hg Pb Bi Th U LE
count 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000 483.000000
mean 0.361609 37.306886 79.901859 164.188741 635.832406 108.314700 447.012607 4.826319 35.511387 67.778468 75.228035 65.516644 256.832961 14.291925 22.339545 50.455487 43.819876 1.204969 90.469979 19.761905 30.440994 322.449275 25.619048 1.917184 90.842650 5.383023 14.844720 5.236025 0.409938 0.300207 49.049996 0.267081 31.828157 0.389234 799.849393
std 2.524420 19.170464 21.018874 105.011094 116.399578 166.519336 364.993895 1.125096 27.704977 72.420758 132.400090 48.937698 266.687782 16.964439 21.153868 42.358326 47.032683 1.866225 53.418809 14.534807 7.541458 108.510507 6.765470 5.917603 78.306505 10.428192 67.141786 22.061828 2.077758 1.339993 89.163245 2.859267 13.579391 1.446719 40.073827
min 0.000000 0.000000 38.589000 0.000000 1.004000 0.000000 1.000000 2.137000 0.000000 0.000000 0.000000 4.597000 0.000000 0.000000 0.000000 7.000000 0.000000 0.000000 11.000000 7.000000 13.000000 104.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 631.444000
25% 0.000000 40.015500 63.242000 96.500000 565.500000 0.000000 1.376500 3.935000 19.000000 0.000000 0.000000 24.906500 0.000000 0.000000 8.000000 22.000000 16.000000 0.000000 50.500000 12.000000 26.000000 247.500000 21.000000 0.000000 27.000000 0.000000 0.000000 0.000000 0.000000 0.000000 21.000000 0.000000 22.000000 0.000000 776.857500
50% 0.000000 44.624000 77.059000 154.000000 637.000000 0.000000 536.000000 4.613000 37.000000 42.000000 32.000000 47.867000 159.000000 12.000000 18.000000 37.000000 31.000000 0.000000 76.000000 16.000000 30.000000 313.000000 25.000000 0.000000 65.000000 0.000000 0.000000 0.000000 0.000000 0.000000 28.000000 0.000000 30.000000 0.000000 811.330000
75% 0.000000 48.510000 95.729000 251.000000 702.000000 191.500000 774.000000 5.567500 50.000000 108.000000 75.500000 101.228000 483.000000 22.000000 30.000000 68.000000 54.000000 2.000000 118.000000 22.000000 34.000000 385.500000 29.000000 0.000000 147.500000 0.000000 0.000000 0.000000 0.000000 0.000000 39.000000 0.000000 42.000000 0.000000 829.934000
max 22.825000 77.831000 150.081000 465.000000 970.000000 830.000000 999.000000 9.502000 337.000000 385.000000 896.000000 243.907000 989.000000 232.000000 145.000000 490.000000 455.000000 12.000000 309.000000 149.000000 59.000000 735.000000 45.000000 96.000000 351.000000 55.000000 827.000000 223.000000 22.000000 12.000000 867.000000 45.000000 66.000000 13.000000 865.822000

For practicality, pandas by default will not show the full range of rows and columns of our dataset. The first line of the code tells pandas to show the full range of the columns. In the second line, we use iloc[:, 2:] attribute to ignore the first two columns of our dataset because we don’t need summary statistics of the x and y coordinate (unless for some reason you do).

Histograms

Next, we will draw the histogram of each element’s concentration in the area. Because of the logarithmic nature of element concentration in the soil, we will plot the log concentration value into the histogram. Since we are dealing with a lot of 0’s, we will add a very small value to all the concentration data to make it easier with the logarithmic transformation. There are various ways to deal with zero values in the dataset, but for this example, let’s just add 0.0001 to all concentration data.

data.iloc[:,2:] = np.log(data.iloc[:,2:]+0.0001)

np.log is a natural logarithm with base e if we want logarithm transformation with base 10 we can use np.log10. Now if we call data.head() we will see that that the concentration of each element already transformed into its logarithmic value.

Note: code above will return an error if there are still zero values in the datasets. There are various ways to deals with 0 and NaN value in our dataset and it was up to you which method you will use for your case. You can also ignore this line if you don’t want to do any logarithmic transformation for your dataset

data.head()
EASTING NORTHING Mg Al Si P S Cl Ca Ti V Cr Mn Fe Co Ni Cu Zn As Se Rb Sr Y Zr Nb Mo Ag Cd Sn Sb W Hg Pb Bi Th U LE
0 803041 9880671 -9.21034 -9.21034 3.979590 4.454348 6.530878 5.003947 6.003887 1.831637 3.135499 3.761202 -9.210340 3.880823 4.543296 2.302595 2.302595 4.219509 4.060445 -9.21034 3.044527 2.484915 3.555351 6.146329 3.367299 -9.21034 3.713575 -9.21034 3.555351 -9.21034 -9.21034 -9.21034 3.433990 -9.21034 3.663564 -9.210340 6.738689
1 803137 9880700 -9.21034 -9.21034 3.975413 4.430818 6.492240 -9.210340 6.133398 1.696001 3.295841 4.812185 6.284134 4.346828 5.752573 2.772595 -9.210340 4.700481 4.709531 -9.21034 3.761202 2.833219 3.663564 5.993962 3.367299 -9.21034 4.174389 -9.21034 -9.210340 -9.21034 -9.21034 -9.21034 4.189656 -9.21034 3.713575 1.386319 6.700875
2 803055 9880624 -9.21034 -9.21034 4.052621 5.036953 6.431331 -9.210340 5.313206 1.926451 2.995737 3.496511 3.465739 3.970464 5.003947 3.091047 3.091047 4.234108 4.174389 -9.21034 2.772595 2.564957 3.610921 6.208590 3.496511 -9.21034 3.931828 -9.21034 4.553878 -9.21034 -9.21034 -9.21034 3.713575 -9.21034 3.610921 -9.210340 6.723034
3 803150 9880652 -9.21034 -9.21034 4.228017 4.615122 6.527958 -9.210340 5.783825 1.855688 3.178058 -9.210340 3.295841 3.667786 -9.210340 2.890377 -9.210340 4.043053 4.143136 -9.21034 3.295841 2.833219 3.610921 6.122493 3.367299 -9.21034 3.555351 -9.21034 3.806665 -9.21034 -9.21034 -9.21034 4.060445 -9.21034 3.637589 -9.210340 6.727064
4 803070 9880576 -9.21034 -9.21034 4.299163 5.214936 6.630684 -9.210340 5.525453 2.032887 3.806665 4.248497 -9.210340 3.742731 4.025353 2.890377 2.197236 4.189656 3.988986 -9.21034 2.890377 2.708057 3.688882 6.272877 3.555351 -9.21034 4.174389 3.25810 4.836283 -9.21034 -9.21034 -9.21034 4.219509 -9.21034 3.433990 -9.210340 6.708392

Next, we will use matplotlib to define our canvas and axes and then loop into each element in our dataset to plot each histogram.

fig, ax = plt.subplots(nrows=7, ncols=5, figsize=(20,8), sharey=True)
plt.subplots_adjust(top=3)


row = 0
col = 0
for element in data.iloc[:,2:].columns:
    x=list(data[element])
    ax[row,col].hist(x, bins=35)
    ax[row,col].set_title("log " + element)
    if col<4:
        col+=1
    else:
        col=0
        row+=1
        
plt.show()
Image 1. Elements concentration (log ppm)

The code above can be separated into two sections. The first section of the code will define the figure and axis canvas for our plot. Since we are dealing with 35 variables we will divide our subplots into 7 rows and 5 columns.

We can check the length of our variable with:

'''
it is getting annoying to keep typing data.iloc[:,2:] 
to get our elements so let's just put it into a
variable called grade
'''
grade = data.iloc[:,2:]
    
len(grade.columns)
35

The second section of the code defines the loop behavior that will plot each histogram into predefined subplots.

Correlation Matrix

Our next figure will be a correlation matrix. This is a very useful and neat graph that will tell us the correlation coefficient of one variable with another variable within the dataset. The correlation matrix value ranges from -1 to 1, where:

  • -1 = perfect negative correlation
  • 0 = no correlation
  • 1 = perfect positive correlation.

Pandas come in handy with its corr() function to calculate a correlation matrix of our dataset. Combine it with matplotlib functions such as: matshow() to draw the matrix into a figure; cmap() to denote it with colormap; and colorbar() to draw the legend, and we can get this informative graph.

cor_matrix = grade.corr()
fig = plt.figure(figsize=(20, 18))
ax = fig.add_subplot(111)
cax = ax.matshow(cor_matrix,cmap='coolwarm', vmin=-1, vmax=1)
ticks = np.arange(0,len(grade.columns),1)

fig.colorbar(cax)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
plt.title('Correlation Matrix')
ax.set_yticks(ticks)
ax.set_xticklabels(grade.columns)
ax.set_yticklabels(grade.columns)

plt.show()
Image 2. Elements concentration correlation matrix

Map plots

With the same concept as the histogram plot, we can make quick and dirty map plots of our points to get a glimpse of the spatial correlation of each element. For this purpose, we use scatter() function with our UTM coordinates as its x and y value and denote the concentration value of each element to a color map using cmap(). We do this using for loop so it can loop to each element in our grade data.

fig, ax = plt.subplots(figsize=(20,50), nrows=12, ncols=3)

rows=0
cols=0 
for element in grade:
    cax = ax[rows,cols].scatter(data['EASTING'], data['NORTHING'], c=grade[element], cmap='coolwarm', s=10, alpha=0.9)
    ax[rows,cols].set_xticklabels('')
    ax[rows,cols].set_yticklabels('')
    ax[rows,cols].set_title(element)
    ax[rows,cols].axis('equal')
    fig.colorbar(cax, ax=ax[rows,cols])
    if cols<2:
        cols+=1
    else:
        cols = 0
        rows += 1
ax[11,2].set_axis_off()

plt.show()
Image 3. Elements concentration map

And that’s it. Your soil sampling data summary in just about a minute. You can download the full code on my GitHub page and reuse it for your next project. Hope it can save you the trouble and you can be more focused on your real geologist work. Have fun exploring!




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Earthquake Playthrough 01
  • Using Raster Calculator in QGIS GUI and Python Console to Make SR Blocks
  • Exploratory Data Analysis (EDA) for Hyperspectral Imagery
  • Different Distance Metrics Example in Spectral Unmixing
  • Bayesian Spectral Unmixing