Skip to content

SQL for machine learning

Yi Wang edited this page Oct 19, 2018 · 3 revisions

Fields, Columns, and the Label

Let us start with a very simple case.

Suppose that we want to regress the salary with respect the age and the gender, we'd train a model using the following SQL statement:

SELECT age, gender, salary
FROM   engineer_info, engineer_payment
WHERE  engineer_info.id = engieer_payment.id
TRAIN  DNNRegressor
WITH   hidden_units = [10, 30]
COLUMN clip(age, 18, 65), gender, cross(clip(age, 18, 65), gender)
LABEL  salary
INTO   my_first_model
;

This generates a table my_first_model, which encode

  1. inputs: age, gender
  2. columns: clipped age, gender, and the cross of clipped age and gender
  3. the label: salary

We see that we need both SELECT to specify the fields to retrieve and COLUMN for fields-to-feature mapping.


Given this model, we can infer the salary for any other group of people. For example, the execution of the following statement

SELECT id, age, sex
FROM   another_company_employee_info
INFER  my_first_model
COLUMN age, vocab(sex, ["Female", "Male"])
LABEL  expected_salary
INTO   a_new_table

should generate a new table a_new_table with fields:

  1. id, age, sex from SELECT, and
  2. expected_salary from LABEL

Again, we need COLUMN in addition to SELECT to map different field names, and even field values, to the features acceptable by the model.