gsm icon     interpolating    mapping    searching geospatial methods

Geographic Search Methods by Data Type


This page has been turned into a poster for the HDF and HDF-EOS Workshop VIII which took place in October, 2004
It has also turned into a presentation for the ESIP Federation Winter meeting which took place in January, 2005.




Abstract:

It is not enough to collect the data and produce data products. In order to be useful the data has to be used. To facilitate data use eventually a search interface has to be developed, probably many. And those interfaces can only be as good as the metadata they have to work with. HDF is not so much a data format as a file format that packages the data with the metadata, so facilitating data access by providing adequate and appropriate metadata starts with data production.

In many areas this is not much of a challenge. The temporal coverage of the granules is generally well known, channel and derived parameter names are generally just a matter of convention, etc. But spatial coverage can vary quite a bit, especially for remotely sensed data, and can often be problematic. This paper goes through the five most common spatial types (point, grid, tile, scene, and swath) discusses the problems associated with each, and makes some recommendations for the metadata that needs to be included with the data to facilitate fast, efficient, accurate search when the time comes.

Introduction:

Historically Relational Database Management Systems (RDBMSs) have not handled spatial information terribly well. A lat/lon bounding box on a flat Earth was the best they could do, and good enough for most commercial needs. The Earth Science community has a different set of needs but, not being a major customer, had to make do with the flat Earth paradigm.

That has changed in the last five years or so. These days most RDBMSs can handle a variety of spatial types on a spherical Earth, either natively or via a spatial extension, in fairly efficient ways. The Earth Science community now has an opportunity to improve the accuracy of the system by rethinking how spatial coverage is defined for different data types. And in order to take advantage of these new capabilities we need to make sure the required metadata are available.


Point Data:

Point data is data from a single point, generally in-situ measurements such as whether station data and ice cores. Point data is exactly the kind of data most RDBMS's are designed to handle and offers no challenges for spatial search.

Often, however, point data is aggregated. If the aggregate is sparsely populated, as with whether station data in Northern Russia where the stations may be 100s or even 1000s of kilometers apart, most RDBMSs have a “multi-point” that may prove more useful than simply defining the entire area of coverage.

For denser aggregates preserving the pointiness of the data may be overkill. For example if multiple shallow cores are drilled during a field experiment it may be reasonable to define the coverage as the small area containing all the cores. Species sightings are another example where defining the coverage as an area may be reasonable. In those cases the spatial coverage is at the data set level and more appropriately handled at that level, as with gridded data. Indeed, if the points are dense enough that interpolating between them is a reasonable thing to do the resulting product just is a gridded data set.


Gridded Data:

Gridded data are data that have been processed into a predefined grid where a "grid" is defined as a lattice of a set resolution overlaid onto a projection. So a typical grid might be "Northern Hemisphere Azimuthal 25 km." indicating that a nominal 25 km resolution lattice has been overlaid on an azimuthal projection of the Northern Hemisphere.

For remotely sensed data coverage is generally global and most gridded data products will have global, or hemispheric, coverage. For global products there is no point even doing a spatial search and simply using the keyword “global” should suffice to define the coverage area. For hemispheric products a single lat/lon bounding box at the dataset level will suffice. But things get trickier when the grid is regional.

Even some regional grids are compatible with using a lat/lon bounding box, and that method is still more efficient than spherical methods so it should be used when appropriate. For example the Tropical Rainfall Measuring Mission (TRMM) produces gridded data in the lat/lon projection but only between (-40, 40) degrees latitude. While neither global nor hemispheric a lat/lon bounding box will still perfectly describe the coverage of those data because they are in the lat/lon projection.

MODIS CMG - global coverage SSMI Norther EASE-Grid (Hemispheric coverage) TRMM coverage (-40, 40)

Other regional grids are not so compatible as the choice of projection often has more to do with the data, the region, or the users needs. Interpolated point data, for example, is often gridded to whatever projection best represents the region the points are in. And specific data products, such as the sea ice product pictured below, may cater to the needs of the user community.

The key point for gridded data is that all granules in the data set are in the same grid, and consequently have the same coverage. So there is no need to do spatial search at the granule level. If the area of interest overlaps the grid then every granule overlaps the area of interest. Conversely if the area of interest does not overlap the grid there's no point doing a granule level search at all, since none of the granules overlap the area of interest.

Polar Stereographic Sea Ice Product Sea Ice Grid on a flat Earth Sea Ice coverage and a lat/lon bounding box on a Polar Sterographic projection

For the sea ice product (above, left) the polar stereographic projection was chosen because that’s what users are used to, and this specific region was chosen because there is no sea ice outside this region. But a lat/lon bounding box is a bad fit to the data coverage area as shown on a flat Earth (above, center) and the data’s native projection (above, right). A spherical quadrilateral using only the corner points of the data (below, left) is a much better fit, but still has room for improvement.

Sea Ice and a Spherical Qaudralateral Sea Ice and a Spherical 20-gon

Because the coverage is at the data set level the consequences of being wrong are fairly severe. If a user happens to pick an area of interest that intersects the defined coverage area, but is outside the actual coverage area, every single granule in this dataset will be returned and all of it will be useless. The end result will be a greatly inconvenienced, and probably upset, user.

Adding more points to the polygon slows performance, but the area comparison is done only once for the entire data set, so performance isn’t much of an issue. Even if the comparison takes a whole hundredth of a second, that’s only one hundredth of a second. In this case it’s better to be accurate than fast and a spherical 20-gon (above, right) is much more accurate. Indeed, to really lock it down you might consider using 50 or 100 points, or even every single edge point, which in this case would be a spherical 1504-gon.


Tiled data:

Tiled data is a special form of regional grid. A tile is a piece of a grid. For some sensors the data are of such high resolution that the gridded product is much too large to include entire grid in a single granule. When that is the case the gridded product may be chopped into sub-grids typically by applying a much courser lattice to the grid. For example the MODIS 500 meter resolution Northern and Southern Hemisphere Azimuthal product is chopped into (18x18 = 324) tiles, and the Global Sinusoidal product is chopped into (18x36 = 648) tiles, just to keep the file sizes reasonable.

MODIS Northern EASE-Grid tiles MODIS Southern EASE-Grid tiles MODIS Sinusoidal Tiles

So even though tiled data is a kind of gridded data it is not the case that every granule in the data set has the same spatial coverage. Still there is some regularity, every granule with the same tile coordinates does have the same spatial coverage, so there is no point doing a spatial search on each and every granule. The best way to do spatial search on tiled data is to run a search on the spatial coverage of the tiles, then search the inventory by tile coordinates.

This is best accomplished via a look-up table. For the MODIS Sinusoidal grid it would be a look-up table with only about 470 rows, since many of the tiles have no coverage. The spatial part of the search gets run on those 470 tiles and then a second search looks for granules with those tile coordinates. Because the expensive spatial part of the search is limited to the 470 tiles, instead of the thousands of granules, the spatial coverage of the tiles can be defined more accurately without significantly degrading performance.

MODIS Sinusoidal tile 05v08h MODIS Sinusoidal tile 05v08h on a flat Earth MODIS Sinusoidal tile 05v08h on a spherical Earth

Tile 05v08h (above, left) from the sinusoidal grid is not a good fit to either a lat/lon bounding box (above, center) or a spherical quadrilateral (above, right) so a spherical polygon of more than four points is probably warranted. As with regional grids accuracy is more important than speed, but there are a few hundred tiles to search on so performance is more of an issue and one may want to limit the number of points in the polygon somewhat.


Scene data:

A "scene" is a small partial swath. Most environmental satellites have about a 100 minute period and often the data are divided into 5 minute scenes covering approximately 1/20th of an orbit or 18 degrees. Larger scenes are obviously possible and the difference between a scene and a partial swath is highly subjective.

(picture of a browse image of a scene, picture of the coverage area on an orthographic)

How the spatial coverage of a scene is defined is highly variable. For satellites like Landsat, which are in tightly controlled orbits, the coverage can be defined by indexing the limited number of possible orbits the satellite can have and the limited number of scans made by the sensor during those orbits. For Landsat these indexes are referred to as "path" and "row" respectively and the World Reference Systems I and II were designed for exactly that purpose.

Using WRSI or WRSII the spatial coverage of a scene can be defined as something like (Path: 35, Rows: 123-158) but one still has what those indexes mean. The Landsat user community has grown accustomed to the (path, row) scheme over the years and for many Landsat users it is second nature to refer to their area of interest by the (path, row) coordinates.

For satellites in different orbits, or less tightly controlled orbits, WRSI and WRSII are not useable so other methods have to be employed. Historically RDBMS's have been unable to work with spherical coordinates so defining the coverage of a scene using a lat/lon bounding box was the best available option. Lat/lon bounding boxes are reasonably accurate representations of the coverage area of a scene at lower and mid latitudes, where the satellite track is more or less parallel to lines of longitude. But lat/lon bounding boxes are less effective in the extreme latitudes because the angle of the ground track relative to lines of longitude changes rapidly as the satellite crosses the polar regions.

These days most commonly used RDBMS's can handle spherical coordinates and scene data is "almost" a spherical rectangle. Indeed defining the coverage of a 5-minute scene as a spherical rectangle using the corner points of the scene will always include the entire scene, and always be a more accurate representation of the coverage area than the lat/lon bounding box. If greater accuracy is desired the coverage can be better defined by a spherical polygon using additional points from the edges of the scene.

But running spatial searches on spherical polygons is cpu intensive so performance can become poor. Most RDBMS's use a number of heuristics to improve performance but these heuristics become less effective as the size of the scene increases. As mentioned above the point at which it is better to start using swath search methods is highly subjective but it is generally when the "scene" is larger than a quarter or third of an orbit.


Swath data:

Swath data is by far the most difficult data to work with when doing a spatial search. As the satellite circles the rotating Earth a single orbit will cross all lines of longitude and most lines of latitudes. If the swath width of the sensor is wide enough the data will cover all longitudes and all latitudes, but only a small portion of the Earth's surface.

The "unusual shape" of swath data is often cited as the reason it is so difficult to work with. But the shape of a swath is only unusual if you look at it projected on a flat Earth. On spherical Earth the shape of the swath is quite ordinary and much easier to work with.

Weird orbital shape on a flat Earth Boring orbital shape on a round Earth

Many methods for running a spatial search on swath data exist and each method requires defining the coverage of the swath using different parameters, so the two are intimately linked. This makes it important to decide what search method you are going to use prior to data production - so the data coverage can be appropriately described in the metadata that is ingested into the database. Two methods that don't work well, lat/lon bounding boxes and spherical polygons, are discussed above and we won't repeat that discussion here. Three other popular methods are: Nominal Orbit Spatial Extent (NOSE), predict orbit search, and backtrack orbit search.

NOSE: The NOSE scheme is a hybrid of many of the schemes described above. Like the Landsat WRS the NOSE scheme relies on creating it's own custom coordinate system for each data set, but for NOSE those coordinates are called (track, block). A "track" is analogous to a WRS path, but a "block" is more like a predefined scene. In practice there are usually 36 blocks per track which makes each block approximately a 2.5 minute scene. Much like the tile scheme the spatial coverage of each block is put in a lookup table which is used to determine what (track, block) coordinates cover the area of interest so a second search can search the inventory for granules that include those coordinates. It's a clever idea but it's incredibly inefficient.

(picture of a typical track/bock scheme)

The primary problem with NOSE is it is expensive. Because it is sensor specific a new set of NOSE parameters must be developed for each sensor. The NOSE scheme also loses out on the advantages of using a look-up table because the look-up table ends up being huge. Instead of the 324 rows needed in the lookup table for tiled data NOSE requires a row to define the coverage of each block in the set. If tracks are conservatively spaced at one degree intervals, and there are 36 blocks per track, that's (36x360 = 12960) rows to achieve one degree accuracy for one sensor. Multiply that by the number of sensors using the NOSE scheme and just the size of the database footprint becomes an issue.

Predict: Predict orbit search methods rely on using an obit model to predict when the satellite will be over the area of interest and consequently turn the spatial search into a temporal search. This method uses an orbit propagator initialized with the two line elements of the satellite but because the propagator spins the orbit forward to predict the satellite position error accumulates. Consequently the propagator must be reinitialized with daily or weekly two line elements as it runs.

Predict methods are fairly fast and quite accurate but they have a high set-up and maintenance cost. Initial set-up requires linking the propagator into the system and then maintenance requires periodic updates of the two line element sets. Once everything is working new satellites can be added quite easily so predict ends up being less expensive than NOSE if the system is meant to work with multiple satellites. Indeed one application predict methods are well suited for is coincident search - where the user is looking for data from multiple satellites covering their area of interest at about the same time.

Predict methods are not well suited to long time series searches. Predict turns the spatial search into a temporal search on the actual time the satellite is predicted to have been over the area of interest and for long time series this clause of the search can become quite large. For example polar orbiting satellites pass over or near Thule, Greenland 9 or 10 times per day so a search for just a years worth of data over Thule would result in about 3500 specific times to search on. An example query for a single days worth of data is shown below.

   where ( ('2003/02/18 01:23:03' between startDateTime and endDateTime) or 
           ('2003/02/18 03:03:57' between startDateTime and endDateTime) or 
           ('2003/02/18 04:44:34' between startDateTime and endDateTime) or 
           ('2003/02/18 06:05:21' between startDateTime and endDateTime) or 
           ('2003/02/18 07:46:32' between startDateTime and endDateTime) or 
           ('2003/02/18 12:47:46' between startDateTime and endDateTime) or
           ('2003/02/18 14:10:32' between startDateTime and endDateTime) or 
           ('2003/02/18 15:51:02' between startDateTime and endDateTime) or 
           ('2003/02/18 17:11:31' between startDateTime and endDateTime) or 
           ('2003/02/18 18:52:07' between startDateTime and endDateTime) )
                                            

Backtrack: Backtrack is a generalization of the NOSE scheme in several ways which results in it being both cheaper to implement and more accurate. Instead of creating a custom coordinate system for each data set backtrack relies on the natural coordinate system of the data. Instead of predefined tracks backtrack indexes swaths by the equator crossing longitude so there are (theoretically) an infinite number of tracks. That would require a look-up table of infinite size if backtrack used look-up tables, but backtrack gets around that problem by using Math instead.

There is also a sense in which Backtrack is a specialization of the predict method but, as the name implies, backtrack spins the orbit model in the opposite direction. The orbit model used by predict methods is quite flexible and can be used with any satellite, but when working with Earth Science data that flexibility is unnecessary. Environmental satellites collecting Earth Science data are always in circular orbits so the sensor's field of view will remain constant.  Consequently the orbits can be more easily modeled as great circles. Limiting backtrack to only circular orbits greatly simplifies the method and greatly reduces the implementation costs.

Backtrack starts with the area of interest and answers the question "If the sensor saw this area where must the satellite have crossed the equator?" Because backtrack spins back by at most one orbit there is no cumulative error. A small amount of error is introduced because backtrack assumes the satellites orbit is stable for the lifetime of the sensor but in general the satellite is in a stable orbit for the lifetime of the sensor and small variations in that orbit are generally less than the resolution of the sensor. So while backtrack isn't quite as accurate as predict methods it is more than sufficient for purposes of spatial search, without the added costs.


Track data:

Track data is data that follows an arbitrary path as it moves. Examples include over flight data, cruise data, and data from drifting stations or buoys. Track data is essentially a moving point so it is best represented as a line or multi-line and most RDBMSs can handle that natively. And even track data from a sensor with some width associated with its field of view can be represented as a line if the search is then tuned to be "within X meters" of the line.


Various data:

One method that isn't specific to a data type is binning. For any given data set binning won't be as accurate as a method appropriate to that data type but binning is good general method if you have a wide variety of data you wish to search using a single method. The main idea behind binning is you get the expensive, time consuming, part of the search over with at ingest, when there isn't a user waiting for their results.

As the name implies this method relies on sorting the data into bins covering the Earth. Bins can be of any size or shape but in practice it is usually expedient to use a simple lat/lon grid of some appropriate resolution. The coverage of each data granule is then compared to the coverage of the bins at ingest and a list is compiled of bins that overlap the granule. That list is then stored in the database as the coverage information for that granule. At search time the user's area of interest is converted to a similar list and the spatial portion of the search becomes a simple list comparison.

Tony Rees at CSIRO Marine Research in Hobart, Tasmania has developed the binning method into a software package called c-squares which offers a number of advantages if you decide the binning method is the way you want to go. Tony's binning scheme is based on existing standards, it's extensable to whatever degrees of accuracy you desire, and the bin labels are meaningful so the translation between spatial coordinates and bin labels is systematic. Tony has also built quite a bit of infrastructure around the c-squares system so much of the functionality one might want to implement has already be developed. Tony was kind enough to send the following image.

c-squares satelite coverage

This example is binned using 0.5 x 0.5 degree squares, and a quadtree-like syntax is used which means that not every square needs to be encoded individually. Among the approach's strengths is that data crossing the date line (as in this example) or polar regions creates no problem for this method (as opposed to bounding boxes, for example) and it also copes well with datasets which are patchy (i.e., collections of disparate objects) as well as datasets with significant holes (e.g. marine data around a continent but not across the middle). Also, the approach needs no spatial engine supporting it at search time, so can be implemented in any system which supports text-based matching. On the down side the requirement to potentially store multiple codes per object can be a storage and search overhead for dataset coverages of continental scale and above (i.e., the system works "best" at small-to-medium scales, and for large scale datasets e.g. hemispheric or global coverages, other approaches may be preferable).



Please send comments, questions and queries to swick@geospatialmethods.org

| Home | About | Contact | FAQ |