The ClimateSOS will use offerings to give the data some structure, and a regional subdivision seems like the natural choice. But as the subset format of ds570.0 does not contain this kind of information, I had to look elsewhere. Luckily, the dataset uses WMO Identifiers (World Meteorological Organization) for the included stations. And the WMO’s Volume A dataset contains exactly the fields that I needed, namely

But it wasn’t that simple.

First of all, it took me a little while to find a good resource about the identifiers – see http://www.weathergraphics.com/identifiers/ – that lead me to the “WMO Pub 9  A” dataset. Excerpts from that website:

World Meteorological Organization (WMO) identifier. The WMO identifier relies on a 5-digit numeric code to identify a weather station. It is widely used in synoptic and upper air reports. The entire identifier is often called the “index number”. The first two digits are sometimes referred to as the “block number” and refer to the geographic area (00-29 Europe, 30-59 Asia, 60-68 Africa, 69 special use, 70-79 North America, 80-89 South America, 90-99 Oceania). The last three digits are loosely referred to as the “station number” in the context of “block numbers”. The WMO provides free access to its WMO identifier assignments.

WMO Pub 9 A This is the sole, authoritative source of synoptic identifier numbers. Thankfully the WMO does a great job of putting its publications online, and this up-to-date resource can be consulted for all the synoptic identifiers that might be encountered. A flatfile (HUGE) may be obtained here (pick the latest Pub9volA).

The parsing of the Pub 9 A dataset was quite simple – tab-separated values and good documentation. But during the development of a parser I recognized that the matching wasn’t that simple. The “wmo number” in the ds570.0 dataset of a station consists of the fields “IndexNbr” (WMO Station Index Number) and “Index SubNbr” (sub-index number) without (!) leading zeros. So I had to implement a little conversion function (yes, it is simple once you know how it actually is) and added a fast hash map to my library so that I don’t have to care about performance when querying with one or the other.

But there were too many stations that could not be matched… so, debug the code! I found out that some stations had different sub-index numbers, and also varied a little bit in name and/or position. But not enough for my coarse regional assignment, so I implemented a check for that as well (essentially searching linearly and ignoring the last digit). You’ll see all that in the code of the parser once it is released.

Sadly, the ds570.0 dataset contains some “made up” WMO numbers for stations not managed by the WMO, so I will loose some stations in the matching process. More on that soon when the parsing and loading of the data is finished!