Posted on April 15, 2013 10:00 pm

7 handy SQL features for data scientists

No matter where you work odds are your company uses some sort of relational database. And nearly all of these use SQL for data retrieval. So it goes without saying that ultimatley if you want to play with data, you need to learn a little bit of SQL. You might argue that SQL does a lot of the same things that you can do in R, Python, or (heaven forbid) Excel. This is true to a certain extent, but by writing the SQL code yourself there a few advantages:

  • Performing operations in the database is way more effecient
  • Easily reproducible scripts
  • Closer to the data

Demand for data scientists is booming, but even highly quantitative analysts and PhDs may have limited experience with databases. In making the transition from academia to industry, a little SQL can go a long way. Sometimes just knowing what features are out there is all it takes.

This is a post about awesome things you can do with SQL.

Why Postgres?

There are lots of database flavors: MySQL, PostgreSQL, Oracle, Microsoft SQL Server. Even with all of the varients, the SQL syntax remians relatively constant across all of them. We’re using PostgreSQL b/c it’s widely used, has handy analytical features, and it’s popular in both data warehousing and production.

  1. Generating queries from a query

    I often find the need to use data in Postgres to fetch related data found in another system such as a CRM, MongoDB, or a website. Doing basic string concatenation makes it really easy to generate these queries in mass. I’ve found it to be really useful for spidering and web scraping.

    1 2 3 4 5 6 7 8 9 10 11
    –MongoDB or other databases
    select
    ‘db.foo.findOne({_id: ObjectId(“‘ || _id || ‘”)})’
    from foo;
    –db.foo.find({_id: ObjectId(“5066fa4abd8ad53408a4869c”)})
    –web based (great for spidering)
    select
    ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=’ || name as url
    from
    restaurants;

    You can use psycopg2 to query Postgres from Python and then use requests to fetch data from Yelp.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
    import psycopg2
    import requests
    conn = psycopg2.connect(“{YOUR CONNECTION”})
    cur = conn.cursor()
    q = “””
    select
    ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=’ || name as url
    from
    restaurants
    limit 3;”””
    cur.execute(q)
    urls = cur.fetchall()
    print urls
    #[“http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=ray’s pizza”,
    # ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=shake shack’,
    # ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=rubirosa’]
    # find the number of times pizza is mentioned in each yelp search results
    for url in urls:
    html = requests.get(url).text
    print url, html.count(“pizza”)
    #http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=ray’s pizza 212
    #http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=shake shack 0
    #http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=rubirosa 63
  2. Basic date operations

    Dates are always tricky. Despite best efforts, dates never seem to be in the format that you need them in.

    iso_8601Fear not, Postgres has fantasic date functions to deal with this stuff.

    date_trunc, to_char, and to_timestamp can take you a pretty long way in terms for formatting and type conversion. For extracting certain values from a date (say the day of the week) you can use date_part.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    select
    date_trunc(‘year’, ‘2013-04-05’::date) as year
    , date_trunc(‘month’, ‘2013-04-05’::date) as month
    , date_trunc(‘day’, ‘2013-04-05’::date) as day;
    — year | month | day
    ————————–+————————+————————
    — 2013-01-01 00:00:00-05 | 2013-04-01 00:00:00-04 | 2013-04-05 00:00:00-04
    select
    to_char(‘2013-04-05’::date, ‘YYYY’) as year
    , to_char(‘2013-04-05’::date, ‘MM-YYYY’) as month
    , to_char(‘2013-04-05’::date, ‘YYYYMMDD’) as day;
    — year | month | day
    ——–+———+———-
    — 2013 | 04-2013 | 20130405
    select
    to_timestamp(1365560673);
    — to_timestamp
    ————————
    — 2013-04-09 22:24:33-04
    select date_part(‘day’, ‘2013-04-05’::date);
    — date_part
    ————-
    — 5
    select date_part(‘second’, to_timestamp(1365560673));
    –date_part
    ————-
    — 33
  3. Text mining

    When I’m doing text mining I’m often tempted to immediately reach for a scripting language. While a language like Python, Ruby, or R definitely should make a showing in any text mining project, I advocate going as far as you can with SQL. Postgres has great built-in string functions that’ll run much faster than any scripting language.

    My favorite Postgres string function is regexp_split_to_table which takes a piece of text, splits it by a pattern, and returns tokens as rows.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
    –use SQL for parsing text
    select
    speaker
    , regexp_split_to_table(lower(line), ‘\s+’) as word
    from
    script_ferris_bueller
    where
    speaker = ‘FERRIS’
    LIMIT 10;
    –word counts by character
    select
    speaker
    , regexp_split_to_table(lower(line), ‘\s+’) as word
    , count(1) as wc
    from
    script_ferris_bueller
    group by
    speaker
    , word
    order by
    3 desc
    LIMIT 10;
    — speaker | word | wc
    ———–+——+—–
    — FERRIS | a | 136
    — FERRIS | i | 120
    — FERRIS | to | 119
    — FERRIS | the | 112
    — FERRIS | you | 104
    — FERRIS | and | 101
    — FERRIS | of | 68
    — FERRIS | i’m | 52
    — FERRIS | in | 49
    — CAMERON | i | 49
    –(10 rows)

    See #5 below for loading the Ferris Bueller dataset.

    More string functions
  4. Median aggregate function

    Median is sort of the lost aggregate function in Postgres. It’s one of those that everyone assumes comes standard but it actually doesn’t for reasons that are beyond the scope of this post. For many DBAs and DW Engineers, it’s the first custom function they write for a new Postgres instance. Here you’ll find the page from the Postgres wiki with code snippets for a median function. In the gist below I’m using the pure SQL implementation.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
    –median (http://wiki.postgresql.org/wiki/Aggregate_Median)
    CREATE OR REPLACE FUNCTION _final_median(anyarray)
    RETURNS float8 AS
    $$
    WITH q AS
    (
    SELECT val
    FROM unnest($1) val
    WHERE VAL IS NOT NULL
    ORDER BY 1
    ),
    cnt AS
    (
    SELECT COUNT(*) AS c FROM q
    )
    SELECT AVG(val)::float8
    FROM
    (
    SELECT val FROM q
    LIMIT 2 – MOD((SELECT c FROM cnt), 2)
    OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) – 1,0)
    ) q2;
    $$
    LANGUAGE sql IMMUTABLE;
    CREATE AGGREGATE median(anyelement) (
    SFUNC=array_append,
    STYPE=anyarray,
    FINALFUNC=_final_median,
    INITCOND='{}’
    );
    –median and avg scores of NCAA football games
    SELECT
    median(score)
    ,avg(score)
    FROM
    game_results;
    — median | avg
    ———-+———————
    — 27 | 28.6536160957041871
  5. \COPY to load data into your database

    Sooner or later you’re going to want to load some of that great data you’ve scraped from Yelp or some Census CSV file into your database. Once you have your flat file ready, cd to the directory and connect to your Postgres instance.

    We’re going to load some dialogue from Ferris Bueller’s Day Off into our database. You can get the file here. Open up Postgres in the terminal:

    $ psql -d {your database} -U {your username}

    Now execute the \COPY command, specifying the table name, column names (optional), filename, and if you’re handling a CSV, then include the CSV argument. This file also has a header row so add an additional argument HEADER.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
    –create a table to store the data
    create table script_ferris_bueller
    (
    tag_type varchar,
    speaker varchar,
    line text,
    line_begin int,
    line_end int,
    movie_name varchar
    );
    –upload the data
    \COPY script_ferris_bueller FROM ‘ferris_buellers_day_off.txt’ CSV HEADER
    –more explicit but useful if your columns aren’t in order
    \COPY script_ferris_bueller(tag_type,speaker,line,line_begin,line_end,movie_name)
    FROM ‘ferris_buellers_day_off.txt’ CSV HEADER
    –turn on extended display
    \x
    –take a peak at the data
    select * from script_ferris_bueller order by random() limit 1;
    —[ RECORD 1 ]———————————–
    –tag_type | dialogue
    –speaker | FERRIS
    –line | This is optomism. It’s a common
    — | trait with my age group. Adults
    — | think it’s cute, it’s like a charming
    — | quick that infests youth. But it’s
    — | a cool thing and I think, deep down,
    — | crusty old shits wish they had some.
    — | They wish they had her, too.
    — | (points to Sloane)
    — | Sorry. She’s taken.
    –line_begin | 2188
    –line_end | 2197
    –movie_name | ferris_bueller’s_day_off

    And there you have it, you can now query Ferris Bueller’s Day Off!

    Sometimes you’ll get just one row that has some strange comma or hyphon that messes up your \COPY statement. To skirt around this, create a staging table where every column is varchar. Once the data is in your database, create a query that casts columns to the appropriate data type and deal w/ isolated issues at that point.

  6. Generating sequences

    generate_series is a great Postgres built in function for generating sequences of numbers. You can use it for creating ranges of dates and times, handling time series, funnels, etc. It also provides an easy to way enumerate over tables and prevents you from having to write for loops in your SQL code (not fun).

    MySQL version

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
    –generate_series
    select
    *
    from
    generate_series(0,100,10);
    — generate_series
    —————–
    — 0
    — 10
    — 20
    — 30
    — 40
    — 50
    — 60
    — 70
    — 80
    — 90
    — 100
    –(11 rows)
    –using our median function
    select
    median(val)
    from
    generate_series(0,100,5) as val;
    — median
    ——–
    — 50
    –(1 row)
    select
    current_date + step.i as date_series
    from
    generate_series(0,14,2) as step(i);
    — date_series
    ————-
    — 2013-04-09
    — 2013-04-11
    — 2013-04-13
    — 2013-04-15
    — 2013-04-17
    — 2013-04-19
    — 2013-04-21
    — 2013-04-23
    –(8 rows)
  7. Assorted things you should know

    Some of these are pretty standard, but I couldn’t chance leaving any of these out. I use these features every day.

    order by random() is extremely helpful when you’re just trying to get a feel for a table or database. It does pretty much what it sounds like it does–it returns things in random order! Double colon :: is shorthand for casting columns to different data types. It works well with dates (and does a pretty good job of guessing what format you’re using), and does just as good with floats, ints, and other numerics. Select into syntax is by far the easiest way I’ve found for creating tables on the fly. It’s ruined me for pre-defining tables.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
    –returns rows in random order
    select * from sometable order by random();
    –use ‘::’ for casting
    select ‘2013-01-01’::date;
    — date
    ————–
    — 2013-01-01
    select 1 / 10 as non_casted, 1::float8 / 10 as casted;
    — non_casted | casted
    ————–+——–
    — 0 | 0.1
    –creating a new table on the fly
    select
    *
    into
    newtable
    from
    oldtable;
    –or with a temp table
    select
    *
    into
    temp new_temptable
    from
    oldtable;