Skip to content

Latest commit

 

History

History
465 lines (389 loc) · 17 KB

TUTORIAL.adoc

File metadata and controls

465 lines (389 loc) · 17 KB

Unquery tutorial

1. Introduction

Unquery is a language for querying and transforming structured documents. It was originally developed for XCiteDB, a fast and reliable document-oriented noSQL database for storing structured documents, suppoting XML and JSON, with temporal versionning and branches.

unq is a command line tool for running Unquery queries directly on json files, without loading it to a database. It can read one or multiple json files and an Unquery query, and produce a new json file with the result of the query. This tutorial shows how to write Unquery queries and use them with unq. It will not show parts of the language that are supported only by XCiteDB, such as reading XML, or querying across multiple versions and branches.

1.1. Unquery Language Overview

Unquery is a powerful and easy to use querying language. For JSON, it can be used to:

  • Extract information from a JSON file.

  • Collect information from multiple JSON files.

  • Transform a JSON file, change the way it is organized, add or remove information.

  • Perform calculations on json data, including aggregation.

One of the key properties of Unquery is that each query is formatted as json, and the structure of the result always corresponds to the structure of the query. You can think about Unquery (or at least simple Unquery queries) as a template for the result, and structure the query according to the way you want the result to look. As we would show you in this tutorial, this is one of Unquery’s key strengths.

2. Simple example: employee data

2.1. Data structure

As an introdction to Unquery, we’ll start with a simple dataset. We have multiple files with information about employees in some company. Those files are stored in the directory tutorial-samples/employees. Each file contains information about one employee in this fictituous company, and looks like:

{
    "FirstName":"John",
    "LastName":"Smith",
    "Title":"CTO",
    "Salary":320000,
    "Address": {
	"Street":"123 Oak Street",
	"City":"Austin",
	"State":"TX",
	"Zip":78705
    },
    "Dependants": [
	{"FirstName":"Jane",
	 "LastName":"Smith",
	 "Relation":"Wife"
	},
	{"FirstName":"Jake",
	 "LastName":"Smith",
	 "Relation":"Child"
	},
	{"FirstName":"Jenna",
	 "LastName":"Smith",
	 "Relation":"Child"
	}
    ]
}

2.2. Simple queries

Suppose we want to extract the field FirstName from employee1.json. All we need to do is to write a query file query1.unq (the extension is just a convention, and you can choose any name) with the following text:

"FirstNames"

Assuming unq is in your search path, you can write:

unq -f query1.unq employee1.json

And you’ll get the result:

"John"

For such a simple query, you may want to write the query directly in the command line, instead of storing the query in a file. You can do this with the -c option:

unq -c '"FirstName"' employee1.json

Note that we use single quotes around the query, so that all the special characters (including the double quotes in this case) would be passed to the program.

But wait! Didn’t I just say earlier that each query in Unquery is a json file? This doesn’t look very much like json. But in fact, a string literal (in double quotes) is a legal json value. In the simplest cases, where we just want to extract a single value, this is all we need. When we want to extract a specific field, all we need to do is to write the field name in quotes. If the field is in a nested object, we can use the syntax field1.field2. For example, try query2.unq:

"Address.City"

And you’ll get the response: "Austin".

Similarly, you can extract information from a specific element inside an array. For example, to get the first name of the first dependant, you can write:

"Dependants[0].FirstName"

And get the result: "Jane".

Note that in case we have any characters other than alphanumeric and underscore, we need to enclose it in backticks. For example, if the field name was "First Name", we should write:

"`First Name`"

Instead of just extracting a single field, we can specify multiple fields we want to extract, as an object:

{
  "fullname": "FirstName+' '+LastName",
  "title":"Title",
  "city": "Address.City"
 }

This looks much more like what we think of as json. And it shows how Unquery can express queries as templates. The result would be:

 {
    "fullname": "John Smith",
    "title": "CTO",
    "city": "Austin"
}

Two things to note here. First, the keys in this object don’t have to match the keys in the original json. We renamed Title as title, etc. Also, notice that we used an expression that connects first and last name into a single string, with the concatenation operator, +.

We can also create new nested objects. For example, if we want to group title and salary together inside an "employment" object, we can write the query:

{
  "name": "FirstName+' '+LastName",
  "employment": {
     "title":"Title",
     "salary":"Salary"
  }
}

2.3. Collecting information

Suppose we want to collect information from multiple json files. We can try, for example, to run the query "FirstName" on all the employees:

unq -c '"FirstName"' *.json

Oh no! This is not what we expected. The result is only "John" again. What happened to all the other names? Recall that the output structure should correspond to the query structure. In this case, we expect to get an array of names. So we need to specify an array in the query with […​]:

["FirstName"]

Now we get an array with all the first names. In general, an array in Unquery contains a single element, which could be either a string literal or a more complex value (an object, or another array etc.). Once the query is evluated, the array is expended to include all the values from all the files we process. Without square brackets, all we can expect is a single value (or a single object).

The array we get with the above query is unsorted. We can sort the results by adding a sorting directive. For example, to sort the first names in ascending order, we can write:

["FirstName@ascending"]

Similarly, we can use @descending and also @unique_ascending and @unique_descending to sort and remove duplicates.

We can also use conditions to filter the results (similar to WHERE clauses in SQL. There are multiple ways to specify a condition in Unquery. One of them is a predicate at the end of the value expression, with ? followed by a condition. For example, to get the last name of all employees earning over 200,000, we can write:

["LastName?Salary>200000"]

Now, suppose we want to get first and last name and title for all developers. We can write:

[{
	"FirstName":"FirstName",
	"LastName":"LastName",
	"Title":"Title contains 'Developer'"
}]

Note that in this case, we didn’t use the ?. When we have a constraint on a value we display, we can write the costraint directly on that value.

Another way to filter the results is using the #if directive. For example, suppose we want to list employees with three dependants or more. We can use the query:

[{
	"#if":"$size(Dependants)>=3",
	"FirstName":"FirstName",
	"LastName":"LastName"
}]

The function $size returns the size of an array. At this point, you might ask: what if we want to filter by the number of children, and not all dependants? Can we do this? Of course we can. We’ll get back to this later.

Suppose we want to collect all the information we have on employees, and not just specific fields. We can do this using the dot operator. "." stands for the current value we handle. If we didn’t use any context modifiers (more on this later), it’s the entire file. So the query:

["."]

Would create one big array, containing all employee data.

We can also use aggregation functions to calculate all sort of values such as min, max, sum, average etc. For example:

"$avg(Salary)"

Would return the average salary in the company.

"$count"

Would return the total number of employees. We can also combine aggregation functions with predicates. This would make the function do the aggregation only for elements where the condition is true. For example:

$count?Salary>200000

Would return the total number of employees with salary greater than 200,000.

2.4. Context modifiers

Context modifiers are one of the most powerful and versatile mechanisms in Unquery. The context is the path we use in different places in the query. By default, the context is an empty path, and all fields are relative to the top node, but we can change it with a context modifier. For example, suppose we want to get the full address in one string. We can do it with the expression:

[{
	"FullAddress":"Address.Street+' '+Address.City+' '+Address.State+', '+Address.Zip"
}]

But if we change the context to Address, we no longer have to write it over and over in the expression. We change the context using a context modifier, which is an expression following a colon in the key name:

[{
	"FullAddress:Address":"Street+' '+City+' '+State+', '+Zip"
}]

Context modifiers can do more than just changing the path. It can be used to iterate over multiple paths. We do that using Two square brackets without any index inside it. If, for example, we want to get all employees childrens' names, we can run the query:

{
	"Childen:Dependants[]?Relation='Child'" :
	[
		"FirstName+' '+LastName"
	]
}

In this case, we also used a predicate following the context modifier. In this query, it would have the same effect as adding a predicate after the value expression. Also note that the result here is a single array containing all children for all employees. If we added outer square brachets, this would change, and we’ll have a seperate array of chilren for each employee:

[{
	"Childen:Dependants[]?Relation='Child'" :
	[
		"FirstName+' '+LastName"
	]
}]

Predicates can be used to control which elements to traverse in other ways. For example, we can use the $index function to traverse only certain elements in the array:

[{
	"Dependants:Dependants[]?$index<2":
	[
		"FirstName+' '+LastName"
	]
}]

We can also combine multiple context modifiers with the || (context-or) operator. For example, if we want to collect both employee names and dependant names into a single array, we can run the query:

{
        "names:.||Dependants[]":
        [
                "FirstName+' '+LastName"
        ]
}

Note that the context here is either . (the current path) or Dependants[].

When aggregation functions are within a context modifier, those functions can be used to aggregate inside an array. For example, if we want the number of children for each employee, we can do:

[{
        "FirstName:" : ".",
        "LastName:" : ".",
        "NumOfChildren:Dependants[]":"$count?Relation='Child'"
}]

In the above example, note the use of : without any context modifier following the colon. In this case, the context modifier is the same as the field names, so "FirstName:" is the same as "FirstName:FirstName".

This brings us to the question we had earlier in this tutorial: how do we filter employees based on their number of children? We can do it using context modifiers and aggregation function, as in the following query:

[{
	"#if:Dependants[]?Relation='Child'": "$count>1",
	"FirstName:" : ".",
	"LastName:" : "."
}]

Though Unquery allows using aggregation functions in conditions, as shown in the above example, note that this should be done with caution, and it is not always possible. Unquery allows comparing aggregation functions with a constant literal (e.g. a number), but does not allow comparing aggregation function with some field value in the json file.

2.5. Dynamic key names

So far, the key names we used were constants. But Unquery allows us to use expressions for key names as well. The function $(expression) allows us to evaluate the expression and use it as a value. For example:

{
        "$(LastName)":"."
}

This query, for each json file, would add a field with the LastName as a key name, and would store the content of that json (i.e. the employee record) in that field. This would generate a "directionay style" object with all the last names, and for each last name, we would have the matching employee record. This would work if each employee have a unique last name. But what happens if there are two employees with the same last name? We need an array of employee records after each last name, but since we don’t have an array in the query, we would get only one employee record for each last name. We could also combine the last and first names for the key:

{
        "$(LastName+' '+FirstName)":"."
}

But we would still have a problem if there are two or more employees with the same first and last name. Or, we can simply add square brackets after the key, so that we would have an array of employee records for each name:

{
        "$(LastName)":["."]
}

This would add each employee record to the array of records with the same last name. The effect of such query is similar to GROUP BY in SQL. We can use this pattern to group elements by a field name or any other expression. We can use a similar query to group dependants by relation type (child, husband, wife, etc.):

{
        "dependants:Dependants[]": {
                "$(Relation)":["."]
        }
}

Similarly, we can perform aggregation on each group. For example, if we want to count dependants by each relation type:

{
	"dependants:Dependants[]": {
		"$(Relation)":"$count"
	}
}

2.6. Field iterators

We showed earlier we can use context modifiers to iterate over array elements. But what if we want to iterate over fields of an object, instead of an array? Fortunately, Unquery allows use to do it in the same way. We just need to use {} in the context modifier instead of [].

For example, we can create an dictionary-style object containing all employee records, with the ID as the key:

unq -c '{"$(ID)":"."}' employee*.json > dict-employees.json

If we want to go back from the dictionary-style object to an array of employee records, we can run:

unq -c '{"employees:{}":["."]}' dict-employees.json

We could use field iterators to do many other things. For example, to add a field to an object. Suppose we want to collect all employees into a dictionary-style object with ID as key, and add the filename for each employee record. We can run the query:

{
	"$(ID):{}": {
		    "Filename":"$filename",
		    "$key":"."
	}
}

Here, $key is the name of the last key in the pass. And in this case, since we iterate over all the fields, each time $key is the field selected in that iteration. Recall that Unquery doesn’t create duplicate keys with the same name. So we add Filename only once.

Similarly, we can also remove fields. Suppose, again, we create a dictionary, with the ID as the key. Since we already use ID as key, we want to avoid redundancy, and remove it from the employee record. To do this, we use field iterator, with a predicate that filters out the ID field:

{
        "$(ID):{}?$key!='ID'": {
                    "$key":"."
        }
}

There are cases where using context modifiers can be cumbersome. For example, suppose we want to collect all employee records into an array, and add the filename. We can do:

[{
	"employee:{}": {
		"Filename":"$filename",
		"$key":"."
	}
}]

If we want to use context modifiers, it is necessary to add another wrapping object with the employee key so that we would have a place for the field iterator as a context modifier. But this is not exactly the json structure we want. Fortunately, we have another way to iterate over fields, that doesn’t involve context modifiers. We can use {} as the field name, and this would be replaced with all the fields in the object:

[{
	"Filename":"$filename",
	"{}:": "."
}]

Similarly, we can also remove fields:

[{
	"{}:": ".?$key!='Dependants'"
}]

2.7. Using variables

For more advanced queries, Unquery allows the user to define variables and use them. The simplest way to use variable is to store some value and use it later (possibly multiple times). For example, we want to create an array of all dependants, and for each dependant object, add the employee name. We can use a variable to remember the employee name before we change the context to Dependants[]:

{
	"#var employee":"FirstName+' '+LastName",
	"all_dependants:Dependants[]": [{
		"{}:":".",
		"Employee":"$var(employee)"
	}]
}

There are many other uses for variables. For example, suppose we want to change state names to full name instead of abbrevations. We can define a variable with a dictionary that translates each abbrevaion with the state’s full name, and use it:

{
        "#var states": {
                "CA": "'California'",
                "NJ": "'New Jersey'",
                "NY": "'New York'",
                "TX": "'Texas'"
        },
        "employees": [{
                     "FirstName:":".",
                     "LastName:":".",
                     "State":"$var(states).$(Address.State)"
        }]
}