![]() ![]() We could consider basing our search on zoekt somehow as well.Figure 6. Sourcegraph actually supports this and has become quite popular recently and some people are asking us why our search doesn't do this.They eventually took some time off and realised they needed to re-write everything from scratch It seems like at some point GitHub actually started building a specific "exact match" feature but it was sunset after a beta. It's also a very popular question in Stack Overflow because many people expect code search to behave this way. GitHub originally did a similar thing to GitLab as they are using Elasticsearch and they list a similar list of limitations at.I didn't find any features in Elasticsearch to do this style of searching but we might be able to implement something like this using a plugin. Using the trigram index approach above as well as the positions index_options we will probably have the necessary data in our index to do this. The zoekt architecture docs show a clever optimization that looks for trigrams at the correct distance apart. It should be possible to just make a regex that captures every 3 characters to make the trigram index. This is similar to the approach taken by Google's internal code search tool and will give all matches for the regex with fairly good performance and fairly low storage usage. In order to make the inverted index contain all the characters necessary for exact matching we will need to tokenize every 3 characters to create an "trigram index". ![]() The same technique will apply for substring matching (which is just a subset of regexes that don't use any regex features).Įfficient searching in Elasticsearch requires searching tokens in the inverted index. I described in #4175 (closed) a solution for regexes. Technical details in implementation on Elasticsearch We cannot use any out of the box features in Elasticsearch to do this so we'll need to build the indexing/searching logic ourselves on top of Elasticsearch OR we'll need to use a different technology for the code index. We need to populate a search index that can efficiently do partial substring matches. We've attempted to do this a lot with the iterations we took in &3621 (closed) but it's not perfect and we'll forever be chasing the long tail of patterns that a user may wish to search for and even then it will only solve problem (1) above so we need to take a drastically different approach. ![]() This design choice in Elasticsearch effectively means we need to know ahead of time what kinds of searches users will be performing before they perform them. The way it works is that, during indexing time, we are extracting a set of "tokens" from the text of the file being indexed using a series of regexes and then later when you search Elasticsearch turns your search term into tokens and tries to match those tokens. Today our advanced code search is powered by Elasticsearch. These almost all behave, by default, as an "exact substring match". Users are actually expecting code search to be more like a "grep" experience or the "find" function in their IDE. ![]() For the same reason we can't even reliably highlight the matching line and our line matching today is just a guess and often unhelpfulĪll these problems generally stem from the fact that Elasticsearch is designed for normal english text and normal english text search engines and not for code. We cannot match multiple lines (or text that appears only in a single line) in a file because Elasticsearch doesn't have a concept of "lines" and is actually matching multiple fields so your terms may span multiple lines (or appear in the filename for example).We do support wildcards at the end of a word but it would not be possible to search for a word from the middle You cannot search part of a word, but only whole words.Searching for samples of code including special characters (some special characters work some of the time but it's mostly only alphanumeric characters that are searchable).The following things are not possible today with code search and they are likely things that users want to do: You can follow the progress in that epic. UPDATE : We are working on replacing our code search engine with Zoekt in &9404 as a solution to this problem. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |