Improved? ElasticSearch query building

I’ve been wrestling with SuiteCRM global search for years and have never quite conquered it, until now.

I’ve created a couple of pull requests to fix a couple of bugs I ran into:

I also had issues with custom module I had added an ‘address’ field to (which actually adds a few fields with prefab naming). The ‘address’ field (street address) was throwing some ‘expected an object but got a value’ (or something like that) errors when trying to index. Apparently the indexing script makes some incorrect assumptions about fields named ‘address’. Instead of tracking down the bug I just created a new ‘address_line1’ column and manually copied the data over from the ‘address’ column before removing the ‘address’ field.

In addition, I’ve modified lib/Search/ElasticSearch/ElasticSearchEngine.php as such:

    private function createSearchParams(SearchQuery $query): array
    {
        $searchStr = $query->getSearchString();
        $searchModules = SearchWrapper::getModules();
        $indexes = implode(',', array_map('strtolower', $searchModules));

        // Wildcard character required for Elasticsearch
        $wildcardBe = "*";

        // Override frontend wildcard character
        if (isset($GLOBALS['sugar_config']['search_wildcard_char'])) {
            $wildcardFe = $GLOBALS['sugar_config']['search_wildcard_char'];
            if ($wildcardFe !== $wildcardBe && strlen($wildcardFe) === 1) {
                $searchStr = str_replace($wildcardFe, $wildcardBe, $searchStr);
            }
        }

        // Add wildcard at the beginning of the search string
        if (isset($GLOBALS['sugar_config']['search_wildcard_infront']) &&
            $GLOBALS['sugar_config']['search_wildcard_infront'] === true && $searchStrPt1[0] !== $wildcardBe) {
            $searchStrPt1 = $wildcardBe . $searchStr;
        }

        // Add wildcard at the end of search string
        if ((substr_compare($searchStrPt1, $wildcardBe, -strlen($wildcardBe))) !== 0) {
            $searchStrPt1 .= $wildcardBe;
        }

        // 2xaronl Modifications 20221208
        $searchStrPt1 = str_replace(" ", $wildcardBe, $searchStrPt1);
        $searchStrPt2 = str_replace(" ", "~1 AND ", $searchStr);
        if ((substr_compare($searchStrPt2, "~1", -strlen("~1"))) !== 0) {
            $searchStrPt2 .= "~1";
        }
        $searchStrPt3 = str_replace(" ", "", $searchStr);
        $searchStrFinal = "(" . $searchStrPt1 . ")^5 OR (" . $searchStrPt2 . ") OR (" . $searchStrPt3 . ")^3";

        return [
            'index' => $indexes,
            'body' => [
                'stored_fields' => [],
                'from' => $query->getFrom(),
                'size' => $query->getSize(),
                'query' => [
                    'query_string' => [
                        'query' => $searchStrFinal, // 2xaronl Modification 20221208
                        'fields' => ['name.*^3', 'email.*^4', 'phone.*^7','address.*^10', '*'],
                        'analyzer' => 'standard',
                        'default_operator' => 'OR',
                        'minimum_should_match' => '66%',
                    ],
                ],
            ],
        ];
    }

I applied the changes above because the out-of-the-box ElasticSearch query just wasn’t cutting it. My company is service-based tech company and so addresses, phone numbers, and email addresses needed to score higher so I added some boosts to those terms in the ‘fields’ string. I also found that searching for ‘Some Company’ might turn up the Account but not much else. By adding/modifying the query terms above it generates the following query when searching for ‘Some Company’:

    [body] => Array
        (
            [stored_fields] => Array
                (
                )

            [from] => 0
            [size] => 15
            [query] => Array
                (
                    [query_string] => Array
                        (
                            [query] => (*Some*Company*)^5 OR (Some~1 AND Company~1) OR (SomeCompany)^3
                            [fields] => Array
                                (
                                    [0] => name.*^3
                                    [1] => email.*^4
                                    [2] => phone.*^7
                                    [3] => address.*^10
                                    [4] => *
                                )

                            [analyzer] => standard
                            [default_operator] => OR
                            [minimum_should_match] => 66%
                        )

                )

        )

The three terms help find not only exact matches for the search but also email addresses, common misspellings, and common data entry issues.

I’ve only been using this search strategy for about 6 business hours so far but users seem pretty happy with the results and the tests I’ve run against it have been great. I really like how I can apply different weighting to the search terms and fields but it would be nice if there was a way to do this within the config files (because what I’ve currently got is not upgrade-safe). Perhaps someone can find a way to implement ‘elasticsearch query pattern’ and ‘elasticsearch query fields’ configuration parameters that are flexible enough (or adopt something similar to the code above so I don’t need to re-apply my ‘fix’ each time I upgrade).

I hope someone else finds this useful!

Note: I’m currently on SuiteCRM 7.12.6 but I’ve diff’d the above code against the current code in github and it’s still applicable).

This is really good work and I’m sure others will find it useful.

Thanks for sharing!

will this get incorporated into suitecrm in the future

Well, there are 351 PR’s waiting to be merged… :worried: so it’s anyone’s guess when (or if) this one will go in…

The two pull requests I created MAY be incorporated if/when they get reviewed and approved. They’re the first PRs I’ve submitted on Github but I believe I did everything correctly to get them in line for review.

As far as the code snippet I posted above, I think it MIGHT be general enough that most people could use it with good results. I think it’s better than the default. I know for me it made ElasticSearch go from ‘meh’ search results to (so far) excellent results. With the ‘fuzziness’ added in it does tend to give more results than you’d need but the results I want seem to score significantly higher and into the top ~15 search result records. I prefer it that way because I have many people inputting data into my instance and spelling variations are common.

Users may want to customize the query building to their own needs though and I don’t know the best way to do that in a way that is easily accessible to non-programmers.

The default, if you have leading wildcards enabled is '*foo bar*' (for a user input of ‘foo bar’) with an implied OR operator between terms. The query that gets constructed above is a bit more flexible and I think it does a better job of making the search more forgiving for humans.

To apply this snippet to try it out you can just edit the php file (save a copy of the original as a .php.bak file or something like that so you can revert the changes if you wish) and overwrite the applicable lines of code with the snippet. Once you save that php file the changes should take effect immediately. It’s not upgrade-safe so you’d need to reapply that snippet to the php file after any upgrades of SuiteCRM (or you’ll end up with the default search strategy again).

1 Like

Hi @alawrence

PR 9845 may be related to the issue below:

Hopefully we can investigate this further to understand the root cause. Your contribution is very much appreciated but if it turns out to be the same issue, some further work may be necessary to fix the issue at source.

I too have been looking into the Elasticsearch functionality in Suite and have achieved similar improvements as you, in much the same way :). The result will hopefully be a lot of options in the config to allow users to fine tune the results to suit their needs. For example you can choose from standard and ngram analysers to speed up query times, boost specific fields at query time and much more.

I’ve been working on this in my personal time so it’s not quite ready for a PR yet.

Most of the recent work has been around integrating Elasticsearch into the ListView as the performance improvements are substantial. There is still a lot to go on this so hopefully I can find the time to keep it moving forward.

1 Like

definitely not perfect, try searching

 *some company    *