Markus Demleitner
It’s allowed to cut-and-paste the queries…
Also, there’s a PDF version that you can print and keep as a makeshift reference.
The queries and problems in the text are supposed to be run against the server ivo://uni-heidelberg.de/gaia/tap with the access URL http://gaia.ari.uni-heidelberg.de/tap (unless something else is specified).
Gaia Data Queries with ADQL
T(able) A(ccess) P(rotocol)
A(stronomical) D(ata) Q(uery) L(anguage)
Open a browser on http://docs.g-vo.org/adql-gaia/html
Data Intensive Science
Using many data collections
Point (1) requires standard formats and access protocols to the data, point (2) means moving the data to your box and operating on it with FORTRAN and grep becomes infeasible.
The Virtual Observatory (VO) in general is about solving problem (1), TAP/ADQL in particular about (2).
A First Query
In TAP URL: at the bottom of the window, enter http://gaia.ari.uni-heidelberg.de/tap and click ”Use Service”.
At the bottom of the form, at Mode: check “Synchronous” and enter
SELECT TOP 1 1+1 AS result FROM gaiadr1.tgas_source
in the text box, then click “Ok”. This should give you a table with a single 2 in it. If that hasn’t worked complain.Copying and Pasting from http://docs.g-vo.org/adql-gaia is legal.
Note that in the top part of the dialog there’s metadata on the tables exposed by the service (in particular, the names of the tables and the descriptions, units, etc., of the columns). Use that when you construct queries later.
Why SQL?
SQL has been chosen as a base because
Solid theory behind it (relational algebra)
Relational Algebra
Good News: You don’t need to know any of this. But it’s reassuring to know that there’s a solid theory behind all of this.
SELECT for real
SELECT [TOP setLimit] selectList FROM fromClause [WHERE conditions] [GROUP BY columns] [ORDER BY columns]
In reality, there are yet a few more things you can write, but what’s shown covers most things you’ll want to do. The real magic is in selectList, fromClause (in particular), and conditions.
TOP
setLimit: just an integer giving how many rows you want returned.
SELECT TOP 5 * FROM gaiadr1.tgas_source
SELECT TOP 10 * FROM gaiadr1.tgas_source
SELECT: ORDER BY
ORDER BY takes columns: a list of column names (or expressions), and you can add ASC (the default) or DESC (descending order):
SELECT TOP 5 source_id, parallaxFROM gaiadr1.tgas_source
ORDER BY parallax
SELECT TOP 5 source_id, parallax
FROM gaiadr1.tgas_source
ORDER BY parallax DESC
SELECT TOP 5 source_id, phot_g_mean_mag , parallax
FROM gaiadr1.tgas_source
ORDER BY phot_g_mean_mag, parallax
Note that ordering is outside of the relational model. That sometimes matters because it may mess up query planning (a rearrangement of relational expressions done by the database engine to make them run faster)
1
SELECT TOP 20source_id, phot_g_mean_mag
FROM gaiadr1.tgas_source
ORDER BY phot_g_mean_mag
SELECT: what?
SQL expressions are not very different from those of other programming languages.
SELECT TOP 10
source_id,
SQRT(POWER(pmdec_error,2)+POWER(pmra_error,2)) AS pm_errTot
FROM gaiadr1.tgas_source
The value literals are as usual:
The usual arithmetic, comparison, and logical operators work as expected:
+, -, *, /; as in C, there is no power operator in ADQL. Use the POWER function instead.
Here’s a list of ADQL functions:
Trigonometric functions, arguments/results in rad: ACOS, ASIN, ATAN, ATAN2, COS, SIN, TAN; atan2(y,x) returns the inverse tangent in the right quadrant and thus avoids the degeneracy of atan(y∕x).
Note that all names in SQL (column names, table names, etc) are case-insensitive (i.e., VAR and var denote the same thing). You can force case-sensitivity (and use SQL reserved words as identifiers) by putting the identifiers in double quotes (that’s called delimited identifiers). Don’t do that if you can help it, since the full rules for how delimited identifiers interact with normal ones are difficult and confusing.
Also note how I used AS to rename a column. You can use the names assigned in this way in, e.g., ORDER BY:
SELECT TOP 10source_id,
SQRT(POWER(pmdec_error,2)+POWER(pmra_error,2)) AS pm_errTot
FROM gaiadr1.tgas_source
ORDER BY pm_errTot
To select all columns, use *
SELECT TOP 10 * FROM gaiadr1.tgas_source
Use COUNT(*) to figure out how many items there are.
SELECT count(*) AS numEntries
FROM gaiadr1.tgas_source
COUNT is what’s called an aggregate function in SQL: A function taking a set of values and returning a single value. The other aggregate functions in ADQL are (all these take an expression as argument; count is special with its asterisk):
MAX, MIN
1
SELECT TOP 20source_id, ra, dec,
SQRT(POWER(pmra,2)+POWER(pmdec,2))/1000 AS pmtot,
phot_g_mean_mag
FROM gaiadr1.tgas_source
ORDER BY pmtot DESC
SELECT: WHERE clause
Behind the WHERE is a logical expression; these are similar to other languages as well, with operators AND, OR, and NOT.
SELECT source_id, ra, decFROM gaiadr1.tgas_source
WHERE
phot_g_mean_flux > 13
AND parallax < 0.2
SELECT TOP 20 source_id, ra, dec, phot_g_mean_mag,Notice, that the WHERE clause is necessary because the log of a negativ number must fail.5+5*log10(parallax/1000)+phot_g_mean_mag as absmag
FROM gaiadr1.tgas_source
WHERE parallax > 1
ORDER BY phot_g_mean_mag DESC
SELECT: Grouping
SELECT COUNT(*) AS n,ROUND(phot_g_mean_mag) AS bin,
AVG(parallax) AS parallax_mean
FROM gaiadr1.tgas_source
GROUP BY bin
ORDER BY bin
Note how the aggregate functions interact with grouping (they compute values for each group).
Also note the renaming using AS. You can do that for columns (so your expressions are more compact) as well as for tables (this becomes handy with joins).
For simple GROUP applications, you can shortcut using DISTINCT (which basically computes the “domain”).
SELECT DISTINCTROUND(phot_g_mean_mag), ROUND(parallax)
FROM gaiadr1.tgas_source
1
SELECTROUND(phot_g_mean_mag) AS bin,
COUNT(*) AS n,
AVG(SQRT(POWER(pmra,2)+POWER(pmdec,2))) AS pmavg
FROM gaiadr1.tgas_source
GROUP BY bin
ORDER BY bin
SELECT: JOIN USING
SELECT TOP 10 h1.ra, h1.dec, h1.hip, t1.hipCheck the tables in the Table Metadata shown by TOPCAT: astroref is from hipparcos, hp_mag from hipparcos_newreduction; both tables have a hip column.FROM hipparcos AS h1
JOIN tycho2 AS t1
USING (hip)
JOIN is a combination of cartesian product and a select.
FROM hipparcos AS h1
JOIN tycho2 AS t1
USING (hip)
yields the cartesian product of the hipparcos and tycho2 tables but only retains the rows in which the hip columns in both tables agree.
Note that while the hip column we’re joining on is in both tables but only occurs once in the joined table.
SELECT: JOIN ON
SELECT TOP 20 source_id, h.hip
FROM gaiadr1.tgas_source AS tgas
LEFT OUTER JOIN hipparcos as h ON (tgas.phot_g_mean_mag BETWEEN
h.hpmag -0.05 AND h.hpmag+0.05)
This particular query gives, for each source_id in tgas_source, all ids from ucas4 belonging to stars having about the same aperture magnitude as the mean g magnitude given in tgas_source. This doesn’t make any sense, but you may get the idea.
There are various kinds of joins, depending on what elements of the cartesian product are being retained. First note that in a normal join, rows from either table that have no “match” in the other table get dropped. Since that’s not always what you want, there are join variants that let you keep certain rows. In short (you’ll probably have to read up on this):
t1 INNER JOIN t2 (INNER is the default and is usually omitted): Keep all elements in the cartesian product that satisfy the join condition.
Geometries
Keep the crossmatch pattern somewhere handy (everything is in degrees):
SELECT TOP 5
source_id, tgas.ra, tgas.dec, tm.raj2000,
tm.dej2000, hmag, e_hmag
FROM gaiadr1.tgas_source as tgas
JOIN twomass AS tm
ON 1=CONTAINS (
POINT(’ICRS’, tm.raj2000, tm.dej2000),
CIRCLE(’ICRS’, tgas.ra, tgas.dec, 1.5/3600))
In theory, you could use reference systems other than ICRS (e.g., GALACTIC, FK4) and hope the server converts the positions, but I’d avoid constructions with multiple systems – even if the server implements the stuff correctly, it’s most likely going to be slow.
When catalogs are on different epochs, you may need to account for proper motions to match faster stars. You should, however, not apply the proper motions in the primary selection. If you do that, the index cannot be used, and your query will waste a lot of CPU and disk bandwidth. Instead, decide about the maximum proper motion your objects might have (to get an idea, of the statistics, try selecting the fastest stars from ppmx.data – apart from the fact that the catalog got the fastest stars pretty wrong with two copies of some fast stars, there’s only a handful stars moving faster than four arcsecs per year).
Then multiply this with your epoch difference and make that your initial crossmatch radius. Then filter out the spurious matches with an extra where clause taking into account the proper motions. For moderate epoch differences, don’t worry about going into the tangential plane to apply proper motions and, for now, say something like
SELECT
TOP 30
*
FROM ppmxl.main AS m
JOIN gaia.dr1 AS g
ON 1=CONTAINS(POINT(’ICRS’, m.raj2000, m.dej2000),
CIRCLE(’ICRS’, g.raj2000, g.dej2000, 30./3600.))
WHERE 1=CONTAINS(POINT(’ICRS’,
m.raj2000+m.pmra*COS(RADIANS(m.dej2000))*15,
m.dej2000+m.pmde*15),
CIRCLE(’ICRS’, g.raj2000, g.dej2000, 0.5/3600.))
The 15 is because Gaia DR1 is on J2015, whereas PPMXL is on J2000. Also, be careful with the units – in many catalogs, positions and proper motions are given in different units.
Also note how the outer PM-based filter is just a WHERE-clause. Since JOIN is a combination of operators of the relational algebra, the result of a join is a relation again and thus can be treated like any other table.
1
SELECT TOP 5000 hipp.hip, id,SQRT(POWER(hipp.pmra,2)+POWER(hipp.pmdec,2))/1000 AS pmtothip,
SQRT(POWER(t2.pmra,2)+POWER(t2.pmde,2))/1000 AS pmtott2
FROM hipparcos as hipp
JOIN tycho2 AS t2 ON (
1=CONTAINS(
POINT(’ICRS’, hipp.ra, hipp.dec),
CIRCLE(’ICRS’, t2.ra, t2.dec, 0.001)))
Subqueries
SELECT count(*) as n, round((hmag-jmag)*2) as bin
FROM (
SELECT TOP 4000 * FROM twomass) AS q
GROUP BY bin
ORDER BY bin
TAP: Uploads
Note that not all servers already support uploads. If one doesn’t, politely ask the operators for it.
Example: Take a subset of tgas_source with positions and proper motions and crossmatch it with sdss to get colors. First we make the subset with:
SELECT TOP 200source_id, ra, dec, pmra, pmdec
FROM gaiadr1.tgas_source
WHERE 1=CONTAINS(POINT(’ICRS’, raj2000, dej2000),
CIRCLE(’ICRS’, 18.02, 9.281, 4.0 ))
TAP: Uploads 2
Then we change the TAP Service to http://dc.zah.uni-heidelberg.de/tap and perform the following query:
SELECT TOP 100
tgas.*, sdss.u, sdss.i, sdss.r, sdss.g
FROM sdssdr7.sources AS sdss
JOIN TAP_UPLOAD.t1 AS tgas
ON 1=CONTAINS(
POINT(’ICRS’, sdss.ra, sdss.dec),
CIRCLE(’ICRS’, tgas.ra, tgas.dec, 3./3600.))
You must replace the 1 in tap_upload.t1 with the index of the table you want to match.
You may also need to adjust the column names of RA and Dec for your table, and the match radius.