Configuring NLP for Siren Investigate
Using the siren-nlp ingest processor in an Elasticsearch pipeline
The Siren NLP plugin provides an Elasticsearch ingest processor named siren-nlp.
You can create a pipeline that contains the siren-nlp ingest processor, index a document by using the pipeline, and view the enriched document by using the Dev Tools page in Siren Investigate.
Data can also be enriched by using the siren-nlp processor during Excel/CSV import or when using data reflection in the Elasticsearch pipeline definition step.
The resulting document will contain a new field called siren.nlp
, which contains data that represents the annotations that are added by Siren NLP.
Complete the following steps:
-
Create a pipeline in Elasticsearch to define the NLP processing.
PUT _ingest/pipeline/nlp-pipeline { "processors" : [ { "siren-nlp" : { "fields": ["title", "snippet"] } } ] }
-
Index a document by using the nlp-pipeline.
PUT testnlp/_doc/1?pipeline=nlp-pipeline { "title": "Bill Gates", "snippet": "Bill Gates is best known as the founder of the multi-national technology company Microsoft" }
-
View the enriched document.
GET testnlp/_source/1 Response: { "snippet" : "Bill Gates is best known as the founder, of the multi-national technology company Microsoft", "siren" : { "nlp" : { "instances" : { "snippet" : { "entity/person" : [ ...
Configuring the siren-nlp ingest processor
The siren-nlp ingest processor definition has only one compulsory field - fields
- which defines the source fields that you want to process. Other optional fields and their default values are described in the following table.
Name | Required | Default | Description |
---|---|---|---|
fields |
yes |
- |
The list of fields that will be processed. |
target_field |
no |
siren.nlp |
The new field that will be created to store the annotation. |
start_offset_field |
no |
start |
The name of the field that will hold the start offset for each annotation. |
end_offset_field |
no |
end |
The name of the field that will hold the start offset for each annotation. |
processors |
no |
Siren NLP Processors |
A list of NLP processors and their configurations. For more information, see Siren NLP processors. |
include_matches |
no |
true |
If set to |
include_ids |
no |
true |
If set to |
include_taxonomy_annotated |
no |
true |
If set to |
For example, you can configure the ingest processor as follows:
{ "processors" : [ { "siren-nlp" : { "fields": ["title", "snippet"], "start_offset_field": "start", "end_offset_field": "end", "include_matches": true, "include_ids": true, "include_taxonomy_annotated": true, "processors": [ { "class" : "Telephone" } ] } } ] }
Output
The new target_field that is created during NLP enrichment contains the following subfields:
instances
: Holds instance objects for each NLP match that is extracted, categorized first by the source field, then by the annotation type.
Each instance object contains the following fields:
-
match
: The exact text of the annotated span. -
start
: The position (zero-indexed) of the first character of the match within the source field (the name of this field is specified by thestart_offset_field
setting in the pipeline and defaults to "start"). -
end
: The position (zero-indexed) after the last character of the match within the source field. The name of this field is specified by theend_offset_field
setting in the pipeline and defaults to "end". -
type
: The type value of the entity. -
fromfield
: The source field that the match was found in. -
id
: An identifier for the annotation, which is specific to each processor. -
Additional fields may be included, which are specific to each processor.
Example of the instances
subfield and its contents:
"instances" : { "snippet" : { "entity/person" : [ { "nerType" : "Person", "probability" : 0.9501721598726716, "start" : 0, "match" : "Bill Gates", "end" : 10, "id" : "entity/person:bill gates", "type" : "entity/person", "fromfield" : "snippet" }, ...
matches: Specifies exact matches from all fields that are analyzed, categorized by entity type:
"matches" : { "entity/person" : [ "Bill Gates" ], "entity/organization" : [ "Microsoft" ] }
ids: all id values from any field analyzed, categorized by entity type:
"ids" : { "entity/person" : [ "entity/person:bill gates" ], "entity/organization" : [ "entity/organization:microsoft" ] }
taxonomy_annotated: A copy of each source field with annotated text. For more information, see Using Taxonomies.
Siren NLP processors
The processors directive in the siren-nlp configuration is a list of processor object configurations, each with a class
and, for some processors, a settings
object:
{ "class" : "Url", "settings":{ "lenient": "true" } }
The following processor classes are available:
Class | Default* | Settings |
---|---|---|
Telephone |
yes |
- |
USTelephone |
yes |
- |
yes |
- |
|
IPv4 |
yes |
- |
IPv6 |
yes |
- |
MacAddress |
yes |
- |
Url |
yes |
|
SortCode |
yes |
- |
HashTag |
yes |
- |
NER |
yes (one for each type) |
|
CustomRegex |
no |
|
Taxonomy |
no |
For more information, see Using Taxonomies. |
If you do not include a list of processors in the siren-nlp configuration, the default processors will be included. |
The following table lists the output of each processor within each instance object:
Class | Output type | Output id | Example Output Instance |
---|---|---|---|
Telephone |
entity/phonenumber |
entity/phonenumber:[match lowercased] |
{"start": 0, "match": "tel. 01229368123", "end": 16, "id": "entity/phonenumber:tel. 01229368123", "type": "entity/phonenumber", "fromfield": "title"} |
USTelephone |
entity/telephone |
entity/phonenumber:[match lowercased] |
{"start" : 0, "match" : "301-496-4000", "end" : 12, "id" : "entity/phonenumber:301-496-4000", "type" : "entity/phonenumber", "fromfield" : "title"} |
entity/email |
entity/email:[match lowercased] |
{"start": 0, "match": "email@example.com", "end": 17, "id": "entity/email:email@example.com", "type": "entity/email", "fromfield": "title"} |
|
IPv4 |
entity/ipaddress |
entity/ipaddress:[match lowercased] |
{"start":0, "match": "172.16.254.1", "end": 12, "id": "entity/ipaddress:172.16.254.1", "type": "entity/ipaddress", "fromfield": "title"} |
IPv6 |
entity/ipaddress |
entity/ipaddress:[match lowercased] |
{"start" : 0, "match": "0123:4567:89ab:cdef:0123:4567:89ab:cdef", "end": 39, "id": "entity/ipaddress:0123:4567:89ab:cdef:0123:4567:89ab:cdef", "type" : "entity/ipaddress", "fromfield" : "title"} |
MacAddress |
yes |
- |
{"start": 0, "match": "00-D0-56-F2-B5-12", "end": 17, "id": "entity/macAddress:00-d0-56-f2-b5-12", "type": "entity/macAddress", "fromfield": "snippet"} |
Url |
entity/url |
entity/url:[match lowercased] |
{"start": 0, "match": "www.google.com", "end": 14, "id": "entity/url:www.google.com", "type": "entity/url", "fromfield": "title"} |
SortCode |
entity/financialAccount |
entity/financialAccount:[match lowercased] |
{"start": 0, "match": "11-24-76", "end": 8, "id": "entity/financialAccount:11-24-76", "type": "entity/financialAccount", "fromfield": "title"} |
HashTag |
entity/hashtag |
entity/hashtag:[match lowercased] |
{"start": 0, "match": "#photooftheday", "end": 14, "id": "entity/hashtag:#photooftheday", "type": "entity/hashtag", "fromfield": "title"} |
NER |
entity/person, entity/organization, entity/location |
entity/organization:[match lowercased] etc |
{"nerType": "Organization", "probability": 0.8328421100140456, "start": 0, "match": "IBM", "end": 3, "id": "entity/organization:ibm", "type" : "entity/organization", "fromfield" : "title"} |
CustomRegex |
value given in type setting |
[value in type setting]:[match lowercased] |
{"start": 0, "match": "1984", "end": 4, "id": "year:1984", "type": "year", "fromfield": "title"} |
Taxonomy |
entity/telephone |
entity/telephone:[match lowercased] |
For more information, see Using Taxonomies. |
Using Taxonomies
Taxonomy indices
The Siren NLP Taxonomy processor can be used to match concepts and their synonyms, which are stored as a hierarchical classification with text in the source field. The Siren NLP plugin can read a taxonomy from an index before it is used in an indexing pipeline.
To use an index as a taxonomy in the Taxonomy processor, the index must have:
-
A field with a unique ID for each record and a field containing one or more parent;
-
A field listing the parent IDs, so that a hierarchy can be constructed connecting all of the records in the index; and
-
At least one field that contains a string or a list of strings (synonyms) to match against the source field.
Configuring the Taxonomy processor
The Taxonomy processor allows you to make the following settings:
Taxonomy Setting | Required | Default | Description |
---|---|---|---|
index |
yes |
- |
The name of the index that contains the taxonomy data. |
idField |
yes |
- |
A field name in the taxonomy index whose value is a unique identifier for the taxonomy node. |
preferredTermField |
yes |
- |
A field name in the taxonomy index whose value is a preferred term for the taxonomy node. |
synonymFields |
yes |
- |
A list of field names in the taxonomy index from which to collect synonyms for matching to the ingested document. |
parentsField |
yes |
- |
A field name in the taxonomy index whose value is a list of document IDs of the parent nodes of this one. This will be used to calculate the paths to each node by comparing with the values in the |
caseSensitive |
no |
false |
If set to |
exactWhitespace |
no |
false |
If set to |
plurals |
no |
true |
If set to |
additionalData |
no |
false |
If set to |
The following example shows a typical taxonomy, stored in the index cars
:
{ "id" : "Renault_Alpine_GTA/A610", "preferred_term" : "Renault Alpine GTA/A610", "synonyms" : [ "Renault Alpine GTA/A610" ], "parents" : [ "Renault", "Sports_car" ] }
The corresponding configuration for the Taxonomy processor might be as follows:
{ "class": "Taxonomy", "settings": { "index": "cars", "idField": "id", "preferredTermField": "preferred_term", "synonymFields": ["synonyms"], "parentsField": "parents", "caseSensitive": false, "exactWhitespace": false, "plurals": true, "type": "taxonomy-cars", "additionalData": true } }
When documents are indexed by using the Taxonomy processor, instance objects will be created in the target_field
for each match to a synonym in each source_field
.
Output
If you have set the additionalData
parameter to true
, the following fields are included in the Taxonomy instance objects:
-
preferredTerm: The value in the
preferredTermField
. -
synonyms: All synonyms that are collected from
synonymFields
. -
parents: The value in the
parentsField
. -
id_paths: A list of strings that represent paths to the matched taxonomy node from the root of the taxonomy. They are composed of node IDs, for example,
["|Car|Cars_by_Manufacturer|Volkswagen|Volkswagen_Golf"]
. -
pt_paths: The same as
id_paths
, but each path is composed ofpreferred_terms
instead of IDs. This is useful for display if node IDs are obscure. -
ancestors: All ancestor node IDs of this node, up to and including the root node.
The following is an example of the corresponding output:
{ "id_paths" : [ "|Car|Cars_by_Manufacturer|Ford|Ford_Focus", "|Car|Cars_by_Class|Compact_car|Ford_Focus" ], "synonyms" : [ "Ford Focus" ], "preferredTerm" : "Ford Focus", "start" : 0, "match" : "Ford Focus", "pt_paths" : [ "|Car|Cars by Manufacturer|Ford|Ford Focus", "|Car|Cars by Class|Compact car|Ford Focus" ], "end" : 10, "id" : "Ford_Focus", "type" : "taxonomy-cars", "fromfield" : "title", "ancestors" : [ "Car", "Cars_by_Manufacturer", "Cars_by_Class", "Ford", "Compact_car" ], "parents" : [ "Ford", "Compact_car" ] }
Search features using Taxonomies
Creating an index capable of taxonomy path hierarchy search and annotated text search
The command below can be used in Siren DevTools to create an index for which it will be possible to take advantage of search features on taxonomy annotation, once data is ingested into it using the siren-nlp Taxonomy processor:
PUT myindex { "settings": { "analysis": { "analyzer": { "siren_taxonomy_analyzer": { "tokenizer": "siren_taxonomy_tokenizer" } }, "tokenizer": { "siren_taxonomy_tokenizer": { "type": "path_hierarchy", "delimiter": "|" } } } }, "mappings": { "dynamic_templates": [ { "hierarchy": { "path_match": "*.instances.*.id_paths", "mapping": { "type": "text", "fields": { "tree": { "type": "text", "analyzer": "siren_taxonomy_analyzer", "search_analyzer": "keyword", "fielddata": true } } } } }, { "annotated_text": { "path_match": "*.taxonomy_annotated.*", "mapping": { "type": "annotated_text" } } } ] } }
Path Hierarchy Search
Indices created with the index creation command above will map taxonomy instance id_paths fields with a multifield, tree, tokenized using the path_hierarchy tokenizer. This will allow, for example, a record whose siren.nlp.instances.snippet.taxonomy-cars.id_paths field contains "|Car|Cars_by_Class|Sports_car|Alpine_A110" to be returned in a search for "|Car|Cars_by_Class|Sports_car", essentially searching for mention of any sports car.
Annotated Text Search
If the siren-nlp ingest processor has been used with include_taxonomy_annotated set to true, a new field taxonomy_annotated, in target_field is created. This will contain a subfield for each source_field, with Elastic annotated-text annotations for each ancestor term of its matching taxonomy node.
For example, if we have used a taxonomy with the structure:
Cars >Cars_by_Manufacturer >Alpine >Alpine_A110 >Cars_by_Class >Sports_car >Alpine_A110
the resulting taxonomy_annotated field might be:
"siren" : { "nlp" : { "taxonomy_annotated" : { "snippet" : "2017 [Alpine A110](Alpine_A110&Cars_by_Manufacturer&Car&Sports_car&Cars_by_Class&Alpine) for sale, 46000 miles, $44,000.", "title" : "[Alpine](Alpine&Cars_by_Manufacturer&Car) for sale" }, ...
In the siren.nlp.taxonomy_annotated.snippet field, the text span Alpine A110 has been annotated with all of the ancestor terms of the Alpine_A110 node in the cars taxonomy, and the span Alpine in the siren.nlp.taxonomy_annotated.title fieldd has been annotated with the ancestor terms of the Alpine node. (Note that where multiple matches are partially overlapping only the first match is annotated).
If the index creation command above was used to create the index, these subfields will have an annotated_text mapping, and the text will therefore be searchable by using the taxonomy node ID as well as plain text. (Note that use of the annotated_text ampping requires the mapper-annotated-text plugin to be installed - see Installing the mapper-annotated-text plugin).
The following proximity query illustrates the power of the annotated text search. It aims to find all sports cars for sale by searching the snippet_nlp.taxonomy_annotated.snippet field for mention of the word sale within 6 words of a span of text annotated with sports_car:
{ "query": { "span_near": { "slop": 6, "in_order": false, "clauses": [ { "span_term": { "snippet_nlp.taxonomy_annotated.snippet": "sale" } }, { "span_term": { "snippet_nlp.taxonomy_annotated.snippet": "Sports_car" } } ] } } }
Installing the mapper-annotated-text plugin
To install the mapper_annotated_text plugin:
$ ./elasticsearch/bin/elasticsearch-plugin install mapper-annotated-text
-> Downloading mapper-annotated-text from elastic
[=================================================] 100%
-> Installed mapper-annotated-text