Joining your data
Examples in this section are based on three indices (company, person and article)
that can be created and loaded using the following request:
curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_bulk?pretty' -d '
{ "index": { "_index": "company", "_id": "1" } }
{ "id": 1, "name": "Acme", "ceo": "peo1", "ceo_full_name": "Alice Wonder" }
{ "index": { "_index": "company", "_id": "2" } }
{ "id": 2, "name": "Bueno"}
{ "index": { "_index": "company", "_id": "3" } }
{ "id": 3, "name": "Ark" }
{ "index": { "_index": "person", "_id": "peo1" } }
{ "id": 1, "first_name": "Alice", "last_name": "Wonder", "employed_by": "1", "age": 50 }
{ "index": { "_index": "person", "_id": "peo2" } }
{ "id": 2, "first_name": "Bob", "last_name": "Dylan", "employed_by": "2", "age": 20 }
{ "index": { "_index": "person", "_id": "peo3" } }
{ "id": 3, "first_name": "Carol", "last_name": "Aird", "employed_by": "1", "age": 25 }
{ "index": { "_index": "article", "_id": "art1" } }
{ "id": 1, "title": "ACME Raises $1 Billion in New Funding", "mentions_company": 1 }
{ "index": { "_index": "article", "_id": "art2" } }
{ "id": 2, "title": "Best Companies of 2014", "mentions_company": [1, 2] }
{ "index": { "_index": "article", "_id": "art3" } }
{ "id": 3, "title": "What Type of Pizza Are You ?", "mentions_company": 3 }
'
These documents are indexed using the dynamic field mapping feature provided by Elasticsearch.
Join based on the metadata field _id
The metadata field _id is supported as a join key field in semi-join queries.
Example
Having indexed the company and person indices and indexed documents, you can consider joining the two indices to fetch a list of companies whose employees are in the person index by using the following request:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"], (1)
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}'
| 1 | The metadata field _id of the index company is used as the left join key field |
The response should contain two hits, as follows:
{
"hits": [
{
"_index": "company",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_full_name": "Alice Wonder"
}
},
{
"_index": "company",
"_id": "2",
"_score": 1,
"_source": {
"id": 2,
"name": "Bueno"
}
}
]
}
Suppose that the two indices are joined in order to retrieve a list of companies using the following request:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["ceo.keyword", "_id"], (1)
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}'
| 1 | The metadata field _id of the index person is used as the right join key field |
The response should contain one hit, as follows:
{
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index": "company",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_full_name": "Alice Wonder"
}
}
]
}
}
Join based on a runtime field
Runtime fields can also be used as part of a join. They can be as part of the parent index or the child index, as shown in the following example:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["rt_field_company", "rt_field_person"], (1)
"request" : {
"runtime_mappings": {
"rt_field_person": {
"type": "long",
"script": {
"source": "<put your script here>" (2)
}
}
},
"query" : {
"match_all" : {}
}
}
}
},
"runtime_mappings": {
"rt_field_company": {
"type": "long",
"script": {
"source": "<put your script here>" (3)
}
}
}
}'
| 1 | Join on two runtime fields defined in the request |
| 2 | Runtime field for person index |
| 3 | Runtime field for company index |
In this example, the join operation is performed on two runtime fields: rt_field_company and rt_field_person. The runtime mappings are defined within the request for both the parent (company) and child (person) indices. Replace <put your script here> with the appropriate Painless script to compute the desired values for each runtime field.
Join based on a composite key
Runtime fields can be used to create composite keys, that can be used to join on multiple conditions. For example, if you have first_name and last_name fields, you often want to join on both fields at the same time. You can use runtime fields to concatenate multiple fields, and use this composite key as a join key.
Another use case example is if you have fields for IP address and network port, you can create a composite key <ip_address>:<port>.
Like in the previous section, they can be defined for the parent index, the child index, or both. An example involving the company and person indices could look like this:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["ceo_full_name.keyword", "rt_full_name"], (1)
"request" : {
"runtime_mappings": {
"rt_full_name": { (2)
"type": "keyword",
"script": {
"source": "emit(doc[\"first_name.keyword\"].value + \" \" + doc[\"last_name.keyword\"].value)"
}
}
},
"query" : {
"match_all" : {}
}
}
}
}
}'
| 1 | Join between an indexed field and a runtime field |
| 2 | Runtime field for the person index, containing the concatenation of multiple fields |
In this example, a join operation is performed between the ceo_full_name field of the company index and a runtime field rt_full_name. The runtime field is a composite key: a concatenation of the first_name and last_name of employees in the person index.
This allows to join indices even when they don’t have directly corresponding keys.
Retrieving a projected field
You can use the fields parameter to retrieve the values of a projected field for each hit.
For example, the two indices can be joined to retrieve a list of companies along with the ages of all their employees using the following request:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"project" : [
{ "field" : { "name" : "age", "alias" : "employees_age" } } (1)
],
"query" : {
"match_all" : {}
}
}
}
},
"fields" : ["employees_age"] (2)
}
'
| 1 | Projects the field age from index person as employees_age. |
| 2 | Returns the projected field employees_age using the fields parameter. |
The response contains two hits, one for each company, with the field employees_age as follows:
{
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "company",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_fullname": "Alice Wonder"
},
"fields": {
"employees_age": [
25,
50
]
}
},
{
"_index": "company",
"_id": "2",
"_score": 1,
"_source": {
"id": 2,
"name": "Bueno"
},
"fields": {
"employees_age": [
20
]
}
}
]
}
}
Retrieving a runtime field
The runtime field mapping defined in the search request allows creating a field that exists only as part of the query.
This new field can be projected like any other field by using the fields parameter.
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"runtime_mappings": {
"rt_field": {
"type": "long",
"script": {
"source": "<put your script here>" (1)
}
}
},
"project" : [
{ "field" : { "name" : "rt_field" } } (2)
],
"query" : {
"match_all" : {}
}
}
}
},
"fields" : ["rt_field"] (3)
}'
| 1 | The runtime field defined in the request |
| 2 | To project the new field |
| 3 | To fetch the new field into the response if needed |
In this example, a runtime field named rt_field is created with a Painless script.
Replace <put your script here> with the appropriate script to compute the desired value for the runtime field.
The new runtime field is projected and retrieved into the response using the fields parameter.
Retrieving a federate_object runtime field
This federate_object runtime field can be retrieved like any other runtime field by using the
fields parameter like described in the
query DSL
section.
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"runtime_mappings": {
"firstname": {
"type" : "keyword",
"script" : {
"source":"emit(doc[\"first_name.keyword\"].value)",
"lang":"painless"
}
},
"employee": { (1)
"type": "federate_object",
"fields": ["firstname", "last_name.keyword", "age"]
}
},
"project" : [
{ "field" : { "name" : "employee" } } (2)
],
"query" : {
"match_all" : {}
}
}
}
},
"fields" : ["employee"] (3)
}'
| 1 | The federate_object runtime field defined in the request |
| 2 | To project the new runtime field |
| 3 | To fetch the new field into the response if needed |
The response should contain two hits, as follows:
{
"hits": [
{
"_index": "company",
"_id": "1",
"_score": 1,
"fields": {
"employee": [
{
"first_name": "Alice",
"last_name.keyword": "Wonder",
"age": 50
},
{
"first_name": "Carol",
"last_name.keyword": "Aird",
"age": 25
}
]
}
},
{
"_index": "company",
"_id": "2",
"_score": 1,
"fields": {
"employee": [
{
"first_name": "Bob",
"last_name.keyword": "Dylan",
"age": 20
}
]
}
}
]
}
Projecting and retrieving nested federate_object fields
Runtime fields of type federate_object are designed to support nesting.
As an example, if you wanted to retrieve the following chain of nested `federate_object`s:
-
article→company→person
It can be done in the same way as using ordinary runtime fields as follows:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/article/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["company"],
"on" : ["mentions_company", "id"],
"request" : {
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"query" : {
"match_all" : {}
},
"runtime_mappings": {
"employees": { (1)
"type" : "federate_object",
"fields" : ["first_name.keyword", "last_name.keyword"]
}
},
"project" : [
{ "field" : { "name" : "employees" } } (2)
]
}
}
},
"runtime_mappings": {
"company" : { (1)
"type" : "federate_object",
"fields" : ["name.keyword", "employees"]
}
},
"project" : [
{ "field" : { "name" : "company", "alias": "companies_mentioned" } } (2)
]
}
}
},
"fields" : ["companies_mentioned"] (3)
}
'
| 1 | Defines a federate_object runtime field in the search request |
| 2 | Projects the federate_object runtime field from the child index |
| 3 | Returns in the response the federate_object runtime field projected from the child index |
The response should contain the following hit:
{
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"hits": [
{
"_index": "article",
"_id": "art1",
"_source": {
"id": 1,
"title": "ACME Raises $1 Billion in New Funding",
"mentions_company": 1
},
"fields": {
"companies_mentioned": [
{
"name.keyword": "Acme",
"employees": [
{
"last_name.keyword": "Wonder",
"first_name.keyword": "Alice"
},
{
"last_name.keyword": "Aird",
"first_name.keyword": "Carol"
}
]
}
]
}
},
{
"_index": "article",
"_id": "art2",
"_source": {
"id": 2,
"title": "Best Companies of 2014",
"mentions_company": [1, 2]
},
"fields": {
"companies_mentioned": [
{
"name.keyword": "Acme",
"employees": [
{
"first_name.keyword": "Alice",
"last_name.keyword": "Wonder"
},
{
"first_name.keyword": "Carol",
"last_name.keyword": "Aird"
}
]
},
{
"name.keyword": "Bueno",
"employees": {
"first_name.keyword": "Bob",
"last_name.keyword": "Dylan"
}
}
]
}
}
]
}
}
Projecting and retrieving fields of a nested data type
Runtime fields can be used to project fields of a nested data type. For instance, consider the following indices
curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/teachers' -d '
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"classes": {"type": "keyword"}
}
}
}
'
curl -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/classes' -d '
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"students": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"surname": {
"type": "keyword"
}
}
}
}
}
}
'
As an example, we want to retrieve the full names of the teachers' students. These can be projected using runtime fields, and retrieved using the fields parameter as follows
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/teachers/_search?pretty' -d '{
"query": {
"join": {
"indices": [
"classes"
],
"on": [
"classes",
"id"
],
"request": {
"runtime_mappings": {
"fullname": {
"type": "keyword",
"script": {
"source": "for (s in params._source.students) { emit(s.name + ' ' + s.surname) }" (1)
}
}
},
"project": {
"field": {
"name": "fullname" (2)
}
},
"query": {
"match_all": {}
}
}
}
},
"fields": ["fullname"] (3)
}'
| 1 | The runtime field defined in the request |
| 2 | To project the new field |
| 3 | To fetch the new field into the response if needed |
Filter using a projected field
Runtime fields can also be used to filter on projected data.
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"runtime_mappings": {
"employee_count": {
"type": "long",
"script": "emit(doc.employees.length)" (1)
}
},
"query": {
"bool": {
"must": [
{
"join": {
"indices": ["person"],
"on": ["_id", "employed_by.keyword"],
"request": {
"project": [
{
"field": {"name": "id", "alias": "employees"} (2)
}
],
"query": {
"match_all": {}
}
}
}
}
],
"filter": [
{
"range" : {
"employee_count" : { "gt" : 1 } (3)
}
}
]
}
}
}'
| 1 | The runtime field used to collect the projected fields |
| 2 | To project the id field for all matching documents in the person index |
| 3 | To filter the documents in the company index, using the value from the runtime field |
In this example (using the company and person indices), the runtime field employee_count collects all the projected values from the person index, and returns the number of values.
This runtime field is then used in the filter clause to return only the companies that have more than 1 employee.
Sorting based on a projected field
A script-based sorting method can be used to sort the hits based on the values of a projected field, for example (using the company and person indices):
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"project" : [
{ "field" : { "name" : "age", "alias" : "employee_age" } }
],
"query" : {
"match_all" : {}
}
}
}
},
"runtime_mappings": {
"employee_ages": {
"type": "long",
"script": "int sum = 0; for (value in doc.employee_age) { sum += value } emit(sum);"
}
},
"sort": [
{
"employee_ages": "desc"
}
]
}'
The response should contain two hits, one for each company, sorted by the sum of their employees age as follows:
{
"hits": [
{
"_index": "company",
"_id": "1",
"_score": null,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_fullname": "Alice Wonder"
},
"sort": [
75
]
},
{
"_index": "company",
"_id": "2",
"_score": null,
"_source": {
"id": 2,
"name": "Bueno"
},
"sort": [
20
]
}
]
}
In this example, the join operation is performed between the company and person indices, and the projected field employee_age is created. The script-based sorting method is used to sort the hits based on the sum of employees' ages in the employee_ages runtime field. The search results are sorted in descending order, with the company having the highest sum of employees' ages appearing first in the search results.
Spatial join using range fields
Federate enables spatial joins through the application of one or more spatial conditions on range fields. For example, documents with a date field in the parent index can be joined with the child index when the date is within the retention_dates field of the child index; where retention_dates is of type date_range.
This join can be expressed as follows:
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [ (1)
{"within": {"fields": ["date", "retention_dates"] }} (2)
],
}
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
| 1 | The join using range field data types. |
| 2 | Match documents where the date is within the ranges defined by retention_dates. |
In this example, a spatial join is performed between the parent_index and child_index based on the date and retention_dates fields. The join condition requires the parent index’s date field to be within the range defined by the retention_dates field.
If the range data type field is located on the parent_index, use the contains operator instead.
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [ (1)
{"contains": {"fields": ["retention_dates", "date"] }} (2)
],
}
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
| 1 | The join using range field data types. |
| 2 | Match documents where the range in retention_dates contains the date value. |
|
For information about accepted data types in a spatial join, see Requirements for a join and the known limitations section for spatial joins. |
Spatial join using simple fields
In addition to range fields, Federate enables spatial joins through the application of one or more spatial conditions on fields with basic data types such as numeric or date. For example, documents with a date field in the parent index can be joined with the child index where the date is within a start_date field and a end_date field of the child index. This translates to the Boolean expression date > start_date && date < end_date.
This spatial join can be expressed as follows:
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [ (1)
{"gt": {"fields": ["date", "start_date"] }}, (2)
{"lt": {"fields": ["date", "end_date"] }} (3)
]
},
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
| 1 | The join is the conjunction of 2 conditions. |
| 2 | Match documents where date is greater than start_date. |
| 3 | In addition, this matches documents where date is less than end_date. |
In this example, a spatial join is performed between the parent_index and child_index based on the date, start_date, and end_date fields. The join condition requires the parent index’s date field to be greater than the child index’s start_date field and less than the child index’s end_date field.
|
Instead of combining GT(E) and LT(E) conditions to join on ranges, you can use range data types and their specialized operators: |
Handling missing fields
Spatial join with range fields
natively support open-ended ranges. However, in scenarios where using range fields is not feasible, you can use the allow_missing_bound parameter to interpret a missing value for a field (if the field is not set, or set to null) as an open bound for the range.
The join operator in the previous example can be altered to also match documents missing a value for the end_date field:
curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/siren/parent_index/_search' -d '
{
"query" : {
"join" : {
"indices" : ["child_index"],
"on" :
{
"must": [
{
"gt": {
"fields": ["date", "start_date"] (1)
}
},
{
"lt": {
"fields": ["date", "end_date"], (2)
"allow_missing_bound": "right" (3)
}
}
]
},
"request" : {
"query" : {
"match_all" : {}
}
}
}
}
}
'
| 1 | Match documents where date is greater than start_date. |
| 2 | Also match documents where date is less than end_date. |
| 3 | Also match documents where the right field (end_date) is missing a value (null or missing). |
Compatible operators and accepted syntax are detailed in the
definition of allow_missing_bounds.
|
Note that when performing spatial join operations on simple fields using the |
Scoring Capabilities
By default, the join filter returns a constant score. Therefore, the scores of the matching documents from the child set
do not affect the scores of the matching documents from the parent set. However, you can
project the document’s score from the child set
and customize the scoring of the documents from the parent set with a
script score function.
Document Score
The score of a matching document from a set may be projected using a standard field object using
the special field name _score.
{
"field" : {
"name" : "_score",
"alias" : "employee_score"
}
}
Scoring based on a projected field
A script-based scoring method can be used to customize the scoring based on the values of a projected field. For example, you can project the score of the matching documents from the child set and aggregate them into the parent document as follows:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '{
"query": {
"function_score": {
"query": {
"join": {
"indices": [ "person" ],
"on": [ "_id", "employed_by.keyword" ],
"request": {
"project" : [
{ "field" : { "name" : "_score", "alias" : "child_score" } }
],
"query": {
"match_all": {}
}
}
}
},
"functions": [
{
"script_score": {
"script": {
"lang": "painless",
"source": "float sum = 0; for (value in doc.child_score) { sum += value } return sum;"
}
}
}
],
"score_mode": "multiply",
"boost_mode": "replace"
}
}
}'
The response should contain two hits, one for each company, sorted by the sum of their child scores as follows:
{
"hits": [
{
"_index": "company",
"_id": "1",
"_score": 2,
"_source": {
"id": 1,
"name": "Acme",
"ceo": "peo1",
"ceo_fullname": "Alice Wonder"
}
},
{
"_index": "company",
"_id": "2",
"_score": 1,
"_source": {
"id": 2,
"name": "Bueno"
}
}
]
}
Aggregating based on a projected field
A projected field can be used as part of the aggregation. For example, when using the company and person indices, you can project the values of
the field age of the matching documents from the person index and aggregate the documents from the company
index by using these values as follows:
curl -H 'Content-Type: application/json' 'http://localhost:9200/siren/company/_search?pretty' -d '
{
"query" : {
"join" : {
"indices" : ["person"],
"on" : ["_id", "employed_by.keyword"],
"request" : {
"project" : [
{ "field" : { "name" : "age", "alias" : "employee_ages" } }
],
"query" : {
"match_all" : {}
}
}
}
},
"aggs": {
"by_company": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"average_employee_age": {
"avg": {
"field": "employee_ages"
}
}
}
}
}
}'
The response contains an aggregation result with a bucket for each company, and a sub-aggregation average_employee_age corresponding to the average age of employees in this company.
{
"aggregations": {
"by_company": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Acme",
"doc_count": 1,
"average_employee_age": {
"value": 37.5
}
},
{
"key": "Bueno",
"doc_count": 1,
"average_employee_age": {
"value": 20
}
}
]
}
}
}
In this example, the query joins the company index with the person index using the id and employed_by fields. The query projects the age field from the person index using an alias employee_age. The aggregation is performed using a runtime field employee_ages, which iterates through the projected employee_age field. The documents are first aggregated by company name, then a sub-aggregation average_employee_age computes the average age of employees in each bucket.