======================================= GAVO DaCHS: DirectGrammars and Boosters ======================================= :Author: Markus Demleitner :Email: gavo@ari.uni-heidelberg.de A DirectGrammar is some program that gets called by the ingestor and outputs binary COPY material to be dumped into the table to stdout. Note that, by these restrictions, direct grammars can only operate on single tables; data descriptors containing more than one ``make`` cannot have direct grammars. The record definition of the table to be filled in this way must of course have ``onDisk="true"``. This program can in principle be anything you like, but usually they are written in C using the boosterskel infrastructure (``resources/src/boosterskel.c``). This infrastructure yields programs that expect to be called with a single argument, the source file to operate on. To use a booster, say :: in a ``data`` element. The path in the ``cbooster`` attribute is interpreted relative to the RD's resdir. Input Formats ------------- By default a booster expects to read from a text file. Currently, the maximum line length is set to 4000 (``INPUT_LINE_MAX`` in ``boosterskel.c``). It is up to the parsing function to split and digest this text line. When you get binary data of fixed record length, set the ``recordSize`` attribute on the ``DirectGrammar`` element:: Note that a ``recordSize`` larger than ``INPUT_LINE_MAX`` will cause a buffer overflow. You will, in this case, want to generate the booster using the -b flag. Another type of booster reads from text files, but the fields are not defined by extent but by separator characters. For the booster environment, this doesn't make any difference, so you don't need to declare it. The booster itself should be generated using the -s flag to ``gavo mkboost``. Writing a booster ----------------- To write a booster, first define the table. You can already specify the ``DirectGrammar`` as noted above. Then call ``gavo mkboost`` with the id or the path to the resource descriptor and the id of the table, e.g.:: cd res gavo mkboost ../q.rd main > boosterfunc.c This will write template code for a column-based source to stdout. See above on what to do when your source has a different format. Place the template code anywhere (use the convention of putting it in res/boosterfunc.c unless you have a good reason to do otherwise). The template code starts somewhat like this:: #include #include #include "boosterskel.h" #define QUERY_N_PARS 33 enum outputFields { fi_localid, /* Identifier, text */ fi_pmra, /* PM (alpha), real */ fi_pmde, /* PM (delta), real */ You can add includes as you need them. The definition of QUERY_N_PARS (which is the number of columns in the table) is essential and must not be edited or removed, since the function building the booster greps it out of the source code to communicate this value to the booster boilerplate. The code continues with an enumeration of the field indices; the names are simple fi\_ and the field destination lowercased. If you only use these names to access fields, the cut'n'paste effort if the table should later change should be manageable and you can add your code directly in the mkboost-generated source. While you shouldn't need to change any of this, you have to change the function that follows:: Field *getTuple(char *inputLine) { static Field vals[QUERY_N_PARS]; parseWhatever(inputLine, F(fi_localid), start, len); parseFloat(inputLine, F(fi_pmra), start, len); parseFloat(inputLine, F(fi_pmde), start, len); parseFloat(inputLine, F(fi_raerr), start, len); Here, it's your job to assign the proper values to the ``Field`` elements of ``vals``. The template adds parseXXX function calls that will do in the simplest cases, except you need to fill in ``start`` (the character index of the start of the value in C-convention, i.e., the first character has index 0) and ``len``, the length of the field in characters. ``Field`` is defined as follows:: typedef struct Field_s { valType type; int length; /* ignored for anything but VAL_TEXT */ union { char *c_ptr; double c_double; float c_float; int32_t c_int32; int8_t c_int8; } val; } Field; where ``type`` is one of:: typedef enum valType_e { VAL_NULL, VAL_BOOL, VAL_CHAR, VAL_INT, VAL_FLOAT, VAL_DOUBLE, VAL_TEXT, VAL_JDATE, } valType; JDATE is a julian day number to be dumped as a date (rather than a datetime). For other ways to represent dates and datetimes, see below. You can, and frequently will, fill the stuff by hand. There are, however, a couple of functions that care about some standard situations, in particular when parsing column-structured text files: * ``void linearTransform(Field *field, double offset, double factor)`` -- changes field in place to ``offset+factor*oldValue``. Handles NULL correctly, silently does nothing for anything non-numeric * ``void parseFloatWithMagicNULL(char *src, Field *field, int start, int len, char *magicVal)`` -- parses a float from src[start:start+len] into field, writing NULL when magicVal is found in the field. * ``void parseDouble(char *src, Field *field, int start, int len)`` -- parses a double from src[start:start+len] into field, writing NULL if it's empty * void ``parseInt(char *src, Field *field, int start, int len)`` -- parses a 32-bit int into field. * void ``parseShort(char *src, Field *field, int start, int len)`` -- parses a 16-bit int into field. * ``void parseBlankBoolean(char *src, Field *field, int srcInd)`` -- parses a boolean such that field becomes true when src[srcInd] is nonempty. * ``void parseString(char *src, Field *field, int start, int len, char *space)`` -- copies len bytes starting at start from src into space (you are responsible for allocating that; usually, a static buffer should do, since the postgres input is generated before the next input line is parsed) and stuffs the whole thing into field. * ``void parseChar(char *src, Field *field, int srcInd)`` -- guess. Of course, you can also manually copy or delimit data and use fieldscanf as documented in `Boosters reading from separated data`_. Boosters reading from binary data --------------------------------- Your booster can also read from binary data. You are mainly on your own in terms of segmentation, but for entering values, you can use the following macros: * MAKE_NULL(fi) -- makes fi NULL * MAKE_DOUBLE(fi, value) -- make fi a double with value * MAKE_FLOAT(fi, value) -- * MAKE_SHORT(fi, value) -- * MAKE_CHAR(fi, value) -- * MAKE_JDATE(fi, value) -- * MAKE_TEXT(fi, value) -- note that you must manage the memory of value yourself. In particular, it must not be automatic memory of getTuple, since that will not be valid when the tuple actually is built. Most commonly, you'll be using a static buffer here. * MAKE_CHAR_NULL(fi, value, nullvalue) -- makes fi a char with value unless value==nullvalue; in that case, fi becomes a NULL For these in particular, use the the portable type specifiers for integral types, viz., ``int8_t``, ``int16_t``, ``int32_t``, and ``int64_t`` and these names with a ``u`` in front. Boosters reading from separated data ------------------------------------ When the input data comes as xSV (e.g., values separated by vertical bars or tabs), use ``gavo mkboost -s ``, e.g.:: gavo mkboost -s '\t' q.rd sources for Tab-separated values. This creates a source like:: char *curCont = strtok(inputLine, "\t"); fieldscanf(curCont, fi_objid, VAL_INT_64); curCont = strtok(NULL, "\t"); fieldscanf(curCont, fi_run, VAL_SHORT); etc. Thus, the input line is parsed using strtok, and each value is parsed using the fieldscanf function. This function takes the string containing the literal in the first argument, the field index in the second, and finally the type specifier. If the data comes in the sequence of the table columns, the generated source *might* just work. Filling in data manually ------------------------ The ``F(index)`` macro lets you access the field info directly. So, you could enter a fixed-length piece of memory into ``fi_magic`` like this:: static char bufForMagic[8]; memcpy(bufForMagic, inputLine+20, 8); F(fi_magic)->type = VAL_TEXT; F(fi_magic)->val.c_ptr = bufForMagic; F(fi_magic)->length = 8; Having static buffers in ``getTuple`` is usually ok since the COPY input is generated before ``getTuple`` is called again. It is quite common to have to handle null values. In the example above, this could look like this if a NULL for magic were signified by a F in ``inputLine[19]``:: static char bufForMagic[8]; if (inputLine[19]=='F') { F(fi_magic)->type = VAL_NULL; } else { memcpy(bufForMagic, inputLine+20, 8); ... Casts ''''' Make sure you always properly cast what you read, e.g., :: MAKE_DOUBLE(fi_dej2000, -90+*(int32_t*)(line+4)); /* SPD */ You can use the ``int_t`` and ``uint_t`` types defined by your compiler's (or library's) headers. Utility Functions ''''''''''''''''' While you can, of cource, manipulate ``F(fi_X).value`` as you see fit, it may be convenient to use boosterskel utility functions: * ``linearTransform(fi, offset, factor)`` -- computes factor*value+offset for floats, doubles, and ints. * mjdToJYear(mjd) -- returns a julian year for mjd (for when you don't want actual timestamps in your tables, which your shouldn't). * ``AS2DEG(field)`` -- turns a field value in arcsecs to degrees * ``MAS2DEG(field)`` -- turns a field value in milli-arcsecs to degrees Dates and times ''''''''''''''' The boosters treat "normal" dates and datetimes as ``struct tm``s. If you need a larger range, use ``VAL_JDATE``, which lets you store julian dates in floats. Julian dates are serialized to dates rather than datetimes. To parse ``VAL_DATE`` or ``VAL_DATETIME``, you will write something like:: fieldscanf(curCont, fi_date, VAL_DATE, "%Y-%m-%d"); if parsing from date strings. If your input is something weird, figure out a way to generate a ``struct tm`` as defined in ``time.h``. Then write:: struct tm timeParts; timeParts.tm_sec = 12; ... timeParts.tm_year = 1920; F(fi_dt)->val.time = timeParts; F(fi_dt).type = VAL_DATETIME; (or ``VAL_DATE``, as the case may be).