Accelerate NetCDF-4 read access

For simulating the (global) terrestrial biosphere I need time series of climate data, which is mostly provided in a 3-dimesional NetCDF file with the dimensions longitude, latitude and time. If everything is in a single file, this is often several GB large and extracting a timeseries along one position can take quiet some time, since by default NetCDF files are optimized to extract time slices. In the following I show how I speed up reading from a large NetCDF file along the time axis.

Continue reading “Accelerate NetCDF-4 read access”

Global top soil sand content

Global soil data for vegetation modelling?

Quiet some time ago I uploaded some scripts to github, which import global soil data to Grass-GIS. In the following I will demonstrate the usage of the script with the Harmonized World Soil Database (HWSD) as example. At the end I will briefly discuss some of the data.

The soil data is stored in a MS Access database linked to a raster file. These datasets are freely available in the internet: DSMW, HWSD available through the geonetwork of the FAO and ISRIC-WISE on their webpage.

Grass-GIS is a very powerful tool. I use it mainly to convert geographical data, for calculations and for maps in scientific journals, p.e. the nitrogen and carbon fixation maps in Elbert et al. (2012) were processed and created in Grass-GIS.

My script will store the data in a SQLite data file linked to a Grass-GIS vector map. The individual layers can finally be exported to various other formats for example as input to dynamic global vegetation models.

To use the script you need Grass-GIS with a lat-lon location and 30 arc seconds resolution. Here is the output of the Grass-GIS command “g.region -p”, once the locations is created:

projection: 3 (Latitude-Longitude)
zone: 0
datum: wgs84
ellipsoid: wgs84
north: 90N
south: 90S
west: 180W
east: 180E
nsres: 0:00:30
ewres: 0:00:30
rows: 21600
cols: 43200
cells: 933120000

Another prerequisite is the linux package “mdb-tools” with the commands “mdb-export” and “mdb-schema” to handle the Access database and the linux package “sqlite3”. The following command are executed in the above Grass-GIS environment. First some environmental variables are set. The variable HWSDDIR must point to the directory, where the HWSD data was downloaded to. A temporary directory defined by the variable TMPDIR needs to be present for temporary files.

rastBasemap="HWSD_basemap"
vectBasemap="HWSD_basemap"
GISDB=`g.gisenv GISDBASE`/`g.gisenv LOCATION_NAME`/`g.gisenv MAPSET`/db.sqlite
HWSDDIR=/data/external/global/Soil/HWSD
TMPDIR=`g.gisenv GISDBASE`/`g.gisenv LOCATION_NAME`/tmp.$$

if [ -d $TMPDIR ]; then
  echo "Temporary \"$TMPDIR\"folder exists!"
  echo "Delete it first"
  exit 1
else
  mkdir $TMP_DIR
fi

db.connect driver=sqlite database=${GISDB}

Then the underlying raster map is imported and converted to a vector map and connected to the database. The “v.clean” command cleans up the topology and speeds up further processing substantially.

r.in.gdal -o --overwrite input=${HWSDDIR}/hwsd.bil output=${rastBasemap}
r.to.vect --overwrite input=${rastBasemap} output=${vectBasemap}_tmp feature=area
sqlite3 ${GISDB} "ALTER TABLE ${vectBasemap}_tmp ADD area REAL"
v.to.db option=area type=centroid units=k map=${vectBasemap}_tmp column=area

v.clean --overwrite input=${vectBasemap}_tmp output=${vectBasemap} tool=bpol,rmarea thresh=0.4,0.4
g.remove vect=${vectBasemap}_tmp

The following commands set up the tables in the database of all tables in the original soil data.

mdb-schema ${HWSDDIR}/HWSD.mdb mysql > ${TMPDIR}/HWSD_CREATE.sql
sqlite3 ${GISDB} < ${TMPDIR}/HWSD_CREATE.sql

The following loop imports all data into the just created tables

for table in `mdb-tables ${HWSDDIR}/HWSD.mdb`; do
  mdb-export -d '|' ${HWSDDIR}/HWSD.mdb $table \
    | sed -e '1d' \
    | sed -e 's/\"//g' > ${TMPDIR}/${table}.csv
  cat << EOF | sqlite3 ${GISDB}
.separator "|"
.import ${TMPDIR}/${table}.csv $table
.quit
EOF
  rm -f ${TMPDIR}/${table}.csv
done

The following loop joins the data of the soil mapping unit with the vector database:

newColNames=(`sqlite3 ${GISDB} '.schema HWSD_SMU' | sed -e '1,2d' | sed -e '$d' | sed -e 's/^\s*//' | sed -e 's/\s.*//' | sed -e 's/\`//g'`)
newColTypes=(`sqlite3 ${GISDB} '.schema HWSD_SMU' | sed -e '1,2d' | sed -e '$d' | sed -e 's/,\s//' | sed -e 's/.*\s\s//' | sed -e 's/\s//g'`)
NnewCols=${#newColNames[*]}

cat /dev/null > ${TMPDIR}/join.sql
for ((i=0; i > ${TMPDIR}/join.sql
  echo "ALTER TABLE ${vectBasemap} ADD ${newColNames[$i]} ${newColTypes[$i]};" >> ${TMPDIR}/join.sql
  echo "UPDATE ${vectBasemap} SET ${newColNames[$i]}=(SELECT ${newColNames[$i]} FROM HWSD_SMU WHERE HWSD_SMU.MU_GLOBAL=${vectBasemap}.value);" >> ${TMPDIR}/join.sql
done
sqlite3 ${GISDB} < ${TMPDIR}/join.sql

Now the general data is in the Grass-GIS database and can be exported or displayed. For that I add a temporary column to the database of type integer. The map can either be used as raster or vector map. You can define the variable either as a single value as done here or you create a loop over some values to convert several soil type at once.

soil="AR"
echo "ALTER TABLE ${vectBasemap} ADD isubset INTEGER;" | sqlite3 ${GISDB}
echo "UPDATE ${vectBasemap} SET isubset=0;" > new.sql
echo "UPDATE ${vectBasemap} SET isubset=1 WHERE SU_SYMBOL='${soil}';" >> new.sql
echo "UPDATE ${vectBasemap} SET isubset=0 WHERE isubset IS NULL;" >> new.sql
sqlite3 ${GISDB} < new.sql
v.to.rast --quiet --overwrite input=${vectBasemap} output=SOIL_${soil} use=attr column=isubset
r.mapcalc "SOIL_tmp=if(SOIL_${soil}==1, 1, null())"
r.to.vect --quiet --overwrite input=SOIL_tmp output=SOIL_${soil} feature=area

There is an additional information in the table HWSD_DATA, which is the raw material the SU_SYMBOL was made of. That is the data derived from different other sources like the Digital Soil Map of the World (DSMW), Chinese, European soil data and SOTER. However, usage of these soil parameters, including physical and chemical properties need to be taken with caution. Especially across the border of the metioned data sources as I will demonstate.

Allthough the HWSD allready has a very high resolution of 30 arc seconds, which is slightly less than a kilometer at the equator, it can consist of 10 different soil classes within one gridcell given by the column SEQ (counter) and SHARE (percent coverage) in the table HWSD_DATA. The SHARE goes even down to 1%, meaning a soil definition of less than one hectare on a global scale.

echo "ALTER TABLE ${vectBasemap} ADD rsubset REAL;" | sqlite3 ${GISDB}
echo "UPDATE ${vectBasemap} SET rsubset=0;" > new.sql
echo "UPDATE ${vectBasemap} SET rsubset=(SELECT SHARE FROM HWSD_DATA WHERE HWSD_DATA.MU_GLOBAL=${vectBasemap}.MU_GLOBAL AND (HWSD_DATA.SU_SYM74 LIKE '${soil}%' OR HWSD_DATA.SU_SYM85 LIKE '${soil}%' OR HWSD_DATA.SU_SYM90 LIKE '${soil}%'));" >> new.sql
echo "UPDATE ${vectBasemap} SET rsubset=0 WHERE rsubset IS NULL;" >> new.sql
sqlite3 ${GISDB} < new.sql
v.to.rast --quiet --overwrite input=${vctBasemap} output=SHARE_${soil} type=area use=attr column=rsubset

I calculated the total area of three soil types (Arenosols, Solonetz and Solonchack) and dune sand based on the fractional coverage as well as the dominant soil type:

dominant fractional
Arenosol 957 442
Dune sand 357 280
Solonchack 128 88
Solonetz 204 140

And another drawback comes up, once you look at the distribution of the soils. The color code in the following map is the fractional coverage and the black lines are around regions, where the soil type is defined as dominant. The Arenosol (upper panel) stops at the Egyptian border and is dune sand (lower panel) in Lybia. That’s because at the Arenosols were not defined in the DSMW, which is the source for the soil definition in most of the Sahara region.

AR_DS

And this brings me to my next criticism. If you properly extract the high resolution texture (in the following I will concentrate on sand only). And caculate a wheighted mean of all subsoil types per pixel, you also see the artefact of the different source databases:
T_SAND_mn

Those pixel fractions defined as dune sand in the DSMW source regions didn’t even get any texture information in the HWSD_DATA table. However, the mean sand content of all Arenosols is 90% and if I give the dune sand this texture property and recalculate the sand content my map looks much better:
T_SAND

My overall conclusion hereafter is, using the HWSD as input for global applications is much to complicated and not straight forward. If you want to have a consistent dataset of global soil properties one should stick to the older DSMW or IRIC-WISE data. Nevertheless, I am also looking forward to soilgrids a new dataset of global soil properties.