Configuration

The Elasticsearch client in Invenio is configured using the two configuration variables SEARCH_CLIENT_CONFIG and SEARCH_ELASTIC_HOSTS.

Invenio-Search relies on the following two Python packages to integrate with Elasticsearch:

Hosts

The hosts which the Elasticsearch client in Invenio should use are configured using the configuration variable:

invenio_search.config.SEARCH_ELASTIC_HOSTS = None

Elasticsearch hosts.

By default, Invenio connects to localhost:9200.

The value of this variable is a list of dictionaries, where each dictionary represents a host. The available keys in each dictionary is determined by the connection class:

You can change the connection class via the SEARCH_CLIENT_CONFIG. If you specified the hosts key in SEARCH_CLIENT_CONFIG then this configuration variable will have no effect.

Clusters

Normally in a production environment, you will run an Elasticsearch cluster on one or more dedicated nodes. Following is an example of how you configure Invenio to use such a cluster:

SEARCH_ELASTIC_HOSTS = [
    dict(host='es1.example.org'),
    dict(host='es2.example.org'),
    dict(host='es3.example.org'),
]

Elasticsearch will manage a connection pool to all of these hosts, and will automatically take nodes out if they fail.

Basic authentication and SSL

By default all traffic to Elasticsearch is via unencrypted HTTP because Elasticsearch does not come with built-in support for SSL unless you pay for the enterprise X-Pack addition. A cheaper alternative to X-Pack is to simply setup a proxy (e.g. nginx) on each node with SSL and HTTP basic authentication support.

Following is an example of how you configure Invenio to use SSL and Basic authentication when connecting to Elasticsearch:

params = dict(
    port=443,
    http_auth=('myuser', 'mypassword'),
    use_ssl=True,
)
SEARCH_ELASTIC_HOSTS = [
    dict(host='node1', **params),
    dict(host='node2', **params),
    dict(host='node3', **params),
]

Self-signed certificates

In case you are using self-signed SSL certificates on proxies in front of Elasticsearch, you will need to provide the ca_certs option:

params = dict(
    port=443,
    http_auth=('myuser', 'mypassword'),
    use_ssl=True,
    ca_certs='/etc/pki/tls/mycert.pem',
)
SEARCH_ELASTIC_HOSTS = [
    dict(host='node1', **params),
    # ...
]

Disabling SSL certificate verification

Warning

We strongly discourage you to use this method. Instead, use the method with the ca_certs option documented above.

Disabling verification of SSL certificates will e.g. allow man-in-the-middle attacks and give you a false sense of security (thus you could simply use plain unencrypted HTTP instead).

If you are using a self-signed certificate, you may also disable verification of the SSL certificate, using the verify_certs option:

import urllib3
urllib3.disable_warnings(
    urllib3.exceptions.InsecureRequestWarning
)

params = dict(
    port=443,
    http_auth=('myuser', 'mypassword'),
    use_ssl=True,
    verify_certs=False,
    ssl_show_warn=False, # only from 7.x+
)
SEARCH_ELASTIC_HOSTS = [
    dict(host='node1', **params),
    # ...
]

The above example will also disable the two warnings (InsecureRequestWarning and a UserWarning) using the ssl_show_warn option and urllib3 feature. Again, we strongly discourage you from using this method. The warnings are there for a reason!

Other host options

For a full list of options for configuring the hosts, see the connection classes documentation:

Other options include e.g.:

  • url_prefix
  • client_cert
  • client_key

Client options

More advanced options for the Elasticsearch client are configured via the configuration variable:

invenio_search.config.SEARCH_CLIENT_CONFIG = None

Dictionary of options for the Elasticsearch client.

The value of this variable is passed to elasticsearch.Elasticsearch as keyword arguments and is used to configure the client. See the available keyword arguments in the two following classes:

If you specify the key hosts in this dictionary, the configuration variable SEARCH_ELASTIC_HOSTS will have no effect.

Timeouts

If you are running Elasticsearch on a smaller/slower machine (e.g. for development or CI) you might want to be a bit more relaxed in terms of timeouts and failure retries:

SEARCH_CLIENT_CONFIG = dict(
    timeout=30,
    max_retries=5,
)

Connection class

You can change the default connection class by setting the connection_class key (e.g. use requests library instead of urllib3):

from elasticsearch.connection import RequestsHttpConnection

SEARCH_CLIENT_CONFIG = dict(
    connection_class=RequestsHttpConnection
)

Note, that the default urllib3 connection class is more lightweight and performant than the requests library. Only use requests library for advanced features like e.g. custom authentication plugins.

Connection pooling

By default urllib3 will open up to 10 connections to each node. If your application calls for more parallelism, use the maxsize parameter to raise the limit:

SEARCH_CLIENT_CONFIG = dict(
    # allow up to 25 connections to each node
    maxsize=25,
)

Hosts via client config

Note, you may also use SEARCH_CLIENT_CONFIG instead of SEARCH_ELASTIC_HOSTS to configure the Elasticsearch hosts:

SEARCH_CLIENT_CONFIG = dict(
    hosts=[
        dict(host='es1.example.org'),
        dict(host='es2.example.org'),
        dict(host='es3.example.org'),
    ]
)

Other client options

For a full list of options for configuring the client, see the transport class documentation:

Other options include e.g.:

  • url_prefix
  • client_cert
  • client_key

Index prefixing

Elasticsearch does not provide the concept of virtual hosts, and thus the only way to use a single Elasticsearch cluster with multiple Invenio instances is via prefixing index, alias and template names. This is defined via the configuration variable:

Warning

Note that index prefixing is only prefixing. Multiple Invenio instances sharing the same Elasticsearch cluster all have access to each other’s indexes unless you use something like https://readonlyrest.com or the commercial X-Pack from Elasticsearch.

invenio_search.config.SEARCH_INDEX_PREFIX = ''

Any index, alias and templates will be prefixed with this string.

Useful to host multiple instances of the app on the same Elasticsearch cluster, for example on one app you can set it to dev- and on the other to prod-, and each will create non-colliding indices prefixed with the corresponding string.

Usage example:

# in your config.py
SEARCH_INDEX_PREFIX = 'prod-'

For templates, ensure that the prefix __SEARCH_INDEX_PREFIX__ is added to your index names. This pattern will be replaced by the prefix config value.

Usage example in your template.json:

{
    "index_patterns": ["__SEARCH_INDEX_PREFIX__myindex-name-*"]
}

Index creation

Invenio will by default create all aliases and indexes registered into the invenio_search.mappings entry point. If this is not desirable for some reason, you can control which indexes are being created via the configuration variable:

invenio_search.config.SEARCH_MAPPINGS = None

List of aliases for which, their search mappings should be created.

  • If None all aliases (and their search mappings) defined through the invenio_search.mappings entry point in setup.py will be created.
  • Provide an empty list [] if no aliases (or their search mappings) should be created.

For example if you don’t want to create aliases and their mappings for authors:

# in your `setup.py` you would specify:
entry_points={
    'invenio_search.mappings': [
        'records = invenio_foo_bar.mappings',
        'authors = invenio_foo_bar.mappings',
    ],
}

# and in your config.py
SEARCH_MAPPINGS = ['records']