NewDataFormat
18 Mar 2010 15:28 UTC 2010077+1528 UTC

New Data Format Discussion

Current database format is specifically built for associating data points with the nearest six minute interval of the hour (:00, :06, :12, :18, etc.). See NewSixMinFormat for more information.

James and Scott (now with Thunderdata in Austin) designed a new database structure before Scott left. All issues have not been resolved, so this page was created to facilitate discussion regarding the table structure and concepts.

Note: In the past, source has meant the method by which the data was received/transmitted to DNR. When used below, it means the origin of the data (e.g., a specific data collection platform: Tide Station at White Point, or perhaps a NCEP model: GFS40).

Modifications to Locations Table

At the June 6, 2008 meeting James, Ryan and I discussed modifications to enforce unique names for the Locations table.

  • Each location has a unique internal name. These internal names can be assigned, matched, and renamed (if conflicting) automatically.
    • E.g., if "MorgansPoint" already exists and has a different latitude and longitude, system could rename location "MorgansPoint (2)".
    • A possible solution to ADCP/ADV data is to name bins using their rank and the total number of bins in the set. E.g., "binX/Y", where X is the bin's rank and Y is the total number of bins in that data set. If the number of bins ever changes, the internal names (automatically generated) remain unique. We could change the bottom, mid, and top aliases to the appropriate bins.
  • Each location has one or more human-friendly name(s).
  • Two different locations (for a given data source) could have similarly-named aliases. In this case, we need a means of determining which is the "correct" location for a given alias. We could include a startgmt field with each alias (see below) which contains a date/time for which that alias-location mapping is valid (think stnattr).

PRELIMINARY Locations table

create table if not exists locations (
    iname    char(80) not null,          /* internal name--static */
    srcid    char(3) not null,           /* data source id */
    lat      int not null,               /* latitude */
    lon      int not null,               /* longitude */
    startgmt int not null,               /* time of earliest data for this location */
    method   varchar(20),                /* method used to det lat,lon */
    note     varchar(255),               /* user-added note(s) */

    primary key (iname),
    /* TODO: redefine this index in some way that makes sense. */
    unique index (iname)
);

PRELIMINARY Alias table

create table if not exists alias (
    iname    char(80) not null,          /* internal name--static */
    alias    varchar(255),               /* alias for this location */
    aliasgmt int not null,               /* starting time for which alias is valid */

    primary key (iname,alias),
    /* TODO: redefine this index in some way that makes sense. */
    unique index (iname,alias)
);

Proposed Table Structure

Data table

create table if not exists data (
    locid   int unsigned not null,      /* location id (locations table) */
    ser     char(4) not null,           /* series identifier */
    gmt     int not null,               /* seconds since epoch (signed) */
    toff    int not null,               /* time offset (signed) */
    val     int not null,               /* measured value */
    route   char(4) not null,           /* how data was transported */

    primary key (locid,gmt,ser,toff,route)
);

  • locid - Location ID from the Locations table (unqiue).
  • gmt - Time for which the data value is observed/predicted in seconds since the epoch. Signed integer should give us sufficient flexibility (1901-2038).
  • toff - Time offset. Difference (in seconds) between the reported time of the value and the time at which it is predicted (forecast: toff > 0).
  • val - Fixed-point integer representing "sampled" value.
  • route - Transmission method; how we retrieved the data. E.g., nesdis, freewave.

You can subtract toff from gmt to determine when the data point was predicted (or recorded, if toff < 0).

Locations table

create table if not exists locations (
    locid   int unsigned auto_increment not null, /* unique location id */
    srcid   char(3) not null,           /* data source id */
    lat     int not null,               /* latitude */
    lon     int not null,               /* longitude */
    locgmt  int not null,               /* time of loc. "goodness" */
    name    varchar(80),                /* location name */
    method  varchar(20),                /* method used to det lat,lon */
    note    varchar(255),               /* user-added note(s) */

    primary key (locid),
    /* TODO: redefine this index in some way that makes sense. */
    unique index (srcid,lat,lon,method)
);

  • srcid - Each location should be associated with single source. A single "source", however, may have multiple locations. For the present data collection platforms, the srcid would simply be the present CBI ID (3 digit number).
  • locid - Unique location identifier.
  • name - Text description of location (e.g., CC Bay Platform, Texas State Aquarium).
  • method - Method used to determine lat and lon (e.g., gps, survey).
  • locgmt - Time (seconds since the epoch) when lat, lon was first determined. For predicted data, this field holds the gmt value (from the data table) of the earliest data point in the database.
  • note - A field where users/QC folks/PIs can add information regarding particular locations.

Questions/Answers:

Some redundancy below.

1. A "station" may have multiple locations. May a location exist without association to a particular station? May a single location be associated with more than one station?

A: Yes, a station may have multiple locations. We will assume that each location is tied to a particular station, thus the answer to the second question is "no".

A: No, a station is a source with only a single location.

2. Depending on the answers to the previous question, do we want to include a stnid column in the Location table? With the current design the only way to determine what location(s) a station is associated with is to search the entire Data table which may be expensive. I'm optimizing prematurely here, I believe.

A: Yup, we included a stnid column in the Location table.

3. Currently the value storage in the database is a smallint, which in MySQL 5.0 defaults to a signed 16-bit integer [-32768,32767]. For some instruments we may be able to measure things more accurately than 4-5 significant digits. Do we want to use a larger integer? 32 bits?

A: We're using a 32-bit integer now.

4. What units are we using to measure gmt? Seconds since the epoch? If so, shouldn't it be unsigned?

A: Seconds since the epoch. We'll use signed since the range should be sufficient for our purposes (1901, 2038), and this allows us pharosdb to handle historical data as well, at least back to 1902. Looking forward, 64-bit processors and int values should be commonplace by the end of this decade (they're already fairly common in PCs now).

5. What are the units for the time offset (toff)? Seconds? Doesn't it make sense that we would only have positive values in this field? Predictions are useful, but if we already know the real data values then our offset would be 0. Or do I have this backwards?

A: Seconds. We selected signed. We can use this offset to represent forecasts--the toff field will hold a negative value, which when added to the gmt gives us the time at which the prediction was made. A positive value represents the time at which a data value makes it into the database, the time delta from when it was originally recorded/polled to when it becomes available. We can use this "lag" to determine how fast, on average, data for a particular station takes to get into the database.

6. What units should z (elevation) be in? Should the default value be part of the table definition or a part of the code?

A: z is in mm. We may not have any reasonable value for elevation so it can legally be null. The point at which an elevation is in reference to will be stored in a station datum file or something.

7. Can locations ever be removed? Can they ever be reassigned to another location? I.e., some form of equivalence? Say "PortAransas" and "Port Aransas" are the same location even though they occupy different entries in the Locations table because their string names are not equal?

A: This hasn't been resolved. Notes on locations are only stored in the database and are not preserved outside of it. Because locations are assigned as each incoming data file is parsed (NCEP data), locids are not consistent from database to database. I.e., if we drop the database and reimport everything, old locids will not match the new ones [unless we reimport everything in the same order]. See Location Location Location and Forecast Errors for more about location issues.

Notes:

  1. James and I plant to have the scanning program (Clerk; e.g., cdlscan2) import the data directly into the database instead of using SmAdd and SmWrite to produce spool files which are then parsed by sximp and placed into the database.
  2. Increased val from smallint (2 bytes) to signed integer (4 bytes). There is little apparent benefit with using a 2-byte smallint here (4 significant digits) vs. a 4-byte int (9 significant digits) since the size difference is only 2 bytes out of 25/27 per row.

Listserv Discussion

Follow the discussion in the dnr-is archive (1/29/2007 - ?).

Page last modified on June 06, 2008, at 02:36 PM