ElasticSearch: advanced features

Blog

Stay updated

Let’s see how to use ElasticSearch advanced features for our application

Wednesday, April 01, 2020

In my last article, we talked about the use of ElasticSearch as a simple full-text search engine, how to install and configure it in a quick way and how to integrate it in our .NET Web application.

Today, we are going to show you, still in the context of an e-commerce website, how we used many features of ElasticSearch in order to improve our searches.

We have used a flat Product class without nested classes to manage the search easily but this approach has a lot of limitations. Then we introduced a new data model so that any object is an entity to model. A document can contain an indefinite number of related fields and values (arrays, simple and complex types), and it’s saved as a JSON document.

Our model Product class has become:

public class Product
{
    public int Id { get; set; }
    public string Ean { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
    public Brand Brand { get; set; }
    public Category Category { get; set; }
    public Store Store { get; set; }
    public decimal Price { get; set; }
    public string Currency { get; set; }
    public int Quantity { get; set; }
    public float Rating { get; set; }
    public DateTime ReleaseDate { get; set; }
    public string Image { get; set; }
    public List<Review> Reviews { get; set; }
}

where Brand, Category, Store e Review and User classes are respectively:

public class Brand
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
}
 
public class Category
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
}
 
public class Store
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
}
 
public class Review
{
    public int Id { get; set; }
    public short Rating { get; set; }
    public string Description { get; set; }
    public User User { get; set; }
}
 
public class User
{
    public int Id { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public string IPAddress { get; set; }
    public GeoIp GeoIp { get; set; }
}

GeoIp is a class from NEST library used for geographic data.

The index for products has been named simply products. We created and configured it in this way:

client.Indices.Create(“products”, index => index
    .Map<Product>(x => x.AutoMap())
    .Map<Brand>(x => x.AutoMap())
    .Map<Category>(x => x.AutoMap())
    .Map<Store>(x => x.AutoMap())
    .Map<Review>(x => x.AutoMap())
    .Map<User>(x =>
        x.AutoMap()
        .Properties(props => props
            .Keyword(t => t.Name("fullname"))
            .Ip(t => t.Name(dv => dv.IPAddress))
            .Object<GeoIp>(t => t.Name(dv => dv.GeoIp))
        )
    )
)

We create, specifically for the ElasticSearch index, a new property named fullname for User class named fullname, and we defined which are the geographic info that will be processed.

In order to make possible the processing of our products before indexing, a useful way is the ingest node, that is a node where the document pre-processing takes place. The ingest node intercepts all indexing requests, even the bulk ones, and applies all defined transformations on its content, then it gives back the documents to indexing API.

The ingest has to be enabled in the configuration file elasticsearch.yml, by this parameter:

node.ingest: true

In our example, we used the same node both for searching and for ingesting, we don’t need to write code for managing the ingest node, but, if we want to have a set of dedicated ingest nodes, we have to configure the ElasticSearch client as follows:

var pool = new StaticConnectionPool(new [] 
{
    new Uri("http://ingestnode1:9200"),
    new Uri("http://ingestnode2:9200"),
    new Uri("http://ingestnode3:9200")
});
var settings = new ConnectionSettings(pool);
var client = new ElasticClient(settings);

To pre-processing the document, before indexing it, you need to define a pipeline that specifies a set of processes able to transform that document. There are many default processes, ready to use. Some examples are: GeoIP gets geographic info from an IP address, JSON converts a string to a JSON object, Lowercase and Uppercase, Drop deletes document matching some parameter. You can also create a custom process.

The pipeline we used in our project is:

client.Ingest.PutPipeline("product-pipeline", p => p
                .Processors(ps => ps
                    .Uppercase<Brand>(s => s
                        .Field(t => t.Name)
                    )
                    .Uppercase<Category>(s => s
                        .Field(t => t.Name)
                    )
                    .Set<User>(s => s.Field("fullname")
                        .Value(s.Field(f => f.FirstName) + " " + 
                            s.Field(f => f.LastName)))
                    .GeoIp<User>(s => s
                        .Field(i => i.IPAddress)
                        .TargetField(i => i.GeoIp)
                    )
                )
            );

That pipeline processes documents so that:

Brand.Name and Category.Name will be indexed in uppercase by the Uppercase ingest;
User.fullname will contain FirstName and LastName (Set ingest);
User.IPAddress will become a geolocalized geographic address (GeoIp ingest).

Pipelines are saved in the ElasticSearch cluster state, and, to use them, you have to specify pipeline parameter in the indexing request, so that ingest node knows which pipeline has to use:

client.Bulk(b => b
    .Index("products")
    .Pipeline("product-pipeline")
    .Timeout("5m") 
    .Index<Person>(/*snip*/)
    .Index<Person>(/*snip*/)
    .Index<Person>(/*snip*/)
    .RequestConfiguration(rc => rc
        .RequestTimeout(TimeSpan.FromMinutes(5)) 
    )
);

In this way, we defined the indexing process so that we get a document list indexed as we want.

After indexing documents with the created pipeline, we can check them by visiting with a browser http://localhost:9200/products/_search. We obtain a result similar to this:

Searching process, as described in the last article, is based on document analysis. That is a process with the first phase of tokenization (splitting text into small chunks, called token) and another one of normalization (it allows you to find matches to tokens that are not equal to searched words, but similar enough to be relevant) of text indexed for search. An analyzer performs this process.

An analyzer is build up by three main components:

0 or more filters on characters
1 tokenizer
0 or more filters on tokens

There are some default analyzers ready to use but, to improve the accuracy of our searches based on our requirements, we created a custom analyzer.
A custom analyzer allows us to take control, during the analysis process, of any change to document before tokenizing, of how it’s been converted to a token, and how it’s normalized.
Here is our custom analyzer:

var an = new CustomAnalyzer();
an.CharFilter = new List<string>();
an.CharFilter.Add("html_strip");
an.Tokenizer = "edgeNGram";
an.Filter = new List<string>();
an.Filter.Add("standard");
an.Filter.Add("lowercase");
an.Filter.Add("stop");
 
settings.Analysis.Tokenizers.Add("edgeNGram", new Nest.EdgeNGramTokenizer
{
    MaxGram = 15,
    MinGram = 3
});
 
settings.Analysis.Analyzers.Add("product-analyzer", an);

Our analyzer creates a lowercase token, using the standard tokenization, from 3 to 15 characters. We can add our analyzer to the index both for one or more fields and as a standard analyzer.

client.CreateIndex("products", c => c
    // Analyzer aggiunto solo per la proprietà Description di Product
    .AddMapping<Product>(e => e
        .MapFromAttributes()
        .Properties(p => p.String(s => s.Name(f => f.Description)
        .Analyzer("product-analyzer")))
    )
    //Analyzer aggiunto come default
        .Analysis(analysis => analysis
            .Analyzers(a => a
            .Add("default", an)
        )
    )
)

When we create a custom analyzer, we can test it by using the testing API. These tests can be performed even for default analyzers.

var analyzeResponse = client.Indices.Analyze(a => a
    .Tokenizer("standard")
    .Filter("lowercase", "stop")
    .Text("Lorem ipsum dolor sit amet, consectetur...")
);

We can also use data aggregation that provides us data aggregated by a search query.It is based on simple blocks that can be composed to get complex aggregations. There are different types of aggregation, each one with a defined scope and output.
They can be categorized into:

Bucketing: containers which have key and criteria;
Metric: metrics calculated over a set of documents;
Matrix: a series of operations on different document fields, that produce data in a matrix style;
Pipeline: aggregation of more aggregations.

In our case, we used aggregations to get product count for brand, categories, price ranges. In the following example, we find an aggregation for the price of our products:

s => s
    .Query(...)
    .Aggregations(aggs => aggs
        .Average("average_price", avg => avg.Field(p => p.Price))
        .Max("max_price", avg => avg.Field(p => p.Price))
        .Min("min_price", avg => avg.Field(p => p.Price))
    )

Another useful aggregation is the grouping according to brand, store, or category:

s => s
     .Query(...)
     .Aggregations(aggs => aggs
         .ValueCount("products_for_category", avg => avg.Field(p => p.Category.Name))
         .ValueCount("products_for_brand", avg => avg.Field(p => p.Brand.Name))
         .ValueCount("products_for_store", avg => avg.Field(p => p.Store.Name))
     )

In this way, we can get, in real-time, how many products of our searches are there for category, brand, and store. Aggregated data are also useful to create dashboards or even to organize searches with dynamics filter (e-commerce like) and, obviously, for stats purposes.

Improving your searches

As you already know, we have a score for any search result. A rating is a number from 0 to 1 that determines how search parameters are close to that result. Score mainly depends on three parameters: frequency of search term, frequency of the inverted document, and field length.

To exclude from search results those with a score too low, we can use MinScore:

s => s
     .MinScore(0.5)
     .Query(...)

In this way, we can exclude all results with a score lower than 0.5.

Suggesters allow you to search on ElasticSearch index by using terms similar to the search text. The completion suggester, for instance, is useful for autocomplete, and it guides you to best and more relevant results while typing text. This completion suggester is optimized to give back results as fast as possible, but it uses structs enabled for fast lookup and requires resources.

In our case, we realized an autocomplete method based on the product name, that will be invoked when typing in the search box:

s => s
    .Query(...)
    .Suggest(su => su
        .Completion("name", cs => cs
            .Field(f => f.Name)
            .Fuzzy(f => f
                .Fuzziness(Fuzziness.Auto)
            )
            .Size(5)
        )
    )

Another useful method for better searches is the indices boost. When you are searching across more indexes, you can assign a multiplier to these indexes, so that results from an index will be shown more than another one. You can use it for commercial purposes, for agreement with suppliers, or maybe to stand out our products.
An example of indices boost is:

s => s
    .Query(...)
    .IndicesBoost(b => b
        .Add("products-1", 1.5)
        .Add("products-2", 1)
    )

In this example, we assigned a multiplier of 1.5 to the results of product-1 index and 1 to the products-2 ones, so that product-1 result will be shown more often.

Another way to improve our searches is by ordering them by some parameters. In our case, we can have:

s => s
    .Query()
    .Sort(ss => ss
        .Descending(SortSpecialField.Score)
        .Descending(p => p.Price)
        .Descending(p => p.ReleaseDate)
        .Ascending(SortSpecialField.DocumentIndexOrder)
    )

We set a higher priority to score, then to price, to release date, and, eventually, to indexing order.

Running the project

Our sample project is a .NET Core MVC WebApi application that provides a search box and a dashboard with automatically refreshed data according to typed text. When running our project for the first time, we can load n Product objects, created by Bogus plugin. There are other faker classes to build random objects for Brand, Category, Store, Review, and User. It allows you to have a database to perform our searches.

var productFaker = new Faker<Product>()
    .CustomInstantiator(f => new Product())
        .RuleFor(p => p.Id, f => f.IndexFaker)
        .RuleFor(p => p.Ean, f => f.Commerce.Ean13())
        .RuleFor(p => p.Name, f => f.Commerce.ProductName())
        .RuleFor(p => p.Description, f => f.Lorem.Sentence(f.Random.Int(5, 20)))
        .RuleFor(p => p.Brand, f => f.PickRandom(brands))
        .RuleFor(p => p.Category, f => f.PickRandom(categories))
        .RuleFor(p => p.Store, f => f.PickRandom(stores))
        .RuleFor(p => p.Price, f => f.Finance.Amount(1, 1000, 2))
        .RuleFor(p => p.Currency, "€")
        .RuleFor(p => p.Quantity, f => f.Random.Int(0, 1000))
        .RuleFor(p => p.Rating, f => f.Random.Float(0, 1))
        .RuleFor(p => p.ReleaseDate, f => f.Date.Past(2))
        .RuleFor(p => p.Image, f => f.Image.PicsumUrl())
        .RuleFor(p => p.Reviews, f => reviewFaker.Generate(f.Random.Int(0, 1000))
    )

In the middle of the page, there are the dashboard where we used filters, analyzers, and methods introduced in this article. While typing some text into the search box on the top, relevant products will be suggested, and dashboard content will be updated according to search text.

Conclusions

In this article, I showed you how to use Elasticsearch to process, analyze, and search data in an effective way for complex real-world scenarios. I wish I raised your interest in the topic.

The sample project with the code used in this post is available here.

See you to the next article!

Written by

Enrico Bencivenga

See author's posts

Blog

Improving your searches

Running the project

Conclusions

Enrico Bencivenga

Tag

News & Events

Discover more from Blexin