fzy

terminal fuzzy finder picker
ALGORITHM.md

(6787B)
      1 
      2 This document describes the scoring algorithm of fzy as well as the algorithm
      3 of other similar projects.
      4 
      5 # Matching vs Scoring
      6 
      7 I like to split the problem a fuzzy matchers into two subproblems: matching and scoring.
      8 
      9 Matching determines which results are eligible for the list.
     10 All the projects here consider this to be the same problem, matching the
     11 candidate strings against the search string with any number of gaps.
     12 
     13 Scoring determines the order in which the results are sorted.
     14 Since scoring is tasked with finding what the human user intended, there is no
     15 correct solution. As a result there are large variety in scoring strategies.
     16 
     17 # fzy's matching
     18 
     19 Generally, more time is taken in matching rather than scoring, so it is
     20 important that matching be as fast as possible. If this were case sensitive it
     21 would be a simple loop calling strchr, but since it needs to be case
     22 insensitive.
     23 
     24 # fzy's scoring
     25 
     26 fzy treats scoring as a modified [edit
     27 distance](https://en.wikipedia.org/wiki/Edit_distance) problem of calculating
     28 the
     29 [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance).
     30 Edit distance is the measure of how different two strings are in terms of
     31 insertions, deletions, and substitutions. This is the same problems as [DNA
     32 sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment). Fuzzy
     33 matching is a simpler problem which only accepts insertions, not deletions or
     34 substitutions.
     35 
     36 fzy's scoring is a dynamic programming algorithm similar to
     37 [Wagner–Fischer](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm)
     38 and
     39 [Needleman–Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm).
     40 
     41 Dynamic programming requires the observation that the result is based on the
     42 result of subproblems.
     43 
     44 Fzy borrows heavily from concepts in bioinformatics to performs scoring.
     45 
     46 Fzy builds a `n`-by-`m` matrix, where `n` is the length of the search string
     47 and `m` the length of the candidate string. Each position `(i,j)` in the matrix
     48 stores the score for matching the first `i` characters of the search with the
     49 first `j` characters of the candidate.
     50 
     51 Fzy calculates an affine gap penalty, this means simply that we assign a
     52 constant penalty for having a gap and a linear penalty for the length of the
     53 gap.
     54 Inspired by the [Gotoh algorithm
     55 (pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy
     56 computes a second `D` (for diagonal) matrix in parallel with the score matrix.
     57 The `D` matrix computes the best score which *ends* in a match. This allows
     58 both computation of the penalty for starting a gap and the score for a
     59 consecutive match.
     60 
     61 Using [this 
     62 algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L105) fzy 
     63 is able to score based on the optimal match.
     64 
     65 * Gaps (negative score)
     66   * at the start of the match
     67   * at the end of the match
     68   * within the match
     69 * Matches (positive score)
     70   * consecutive
     71   * following a slash
     72   * following a space, underscore, or dash (the start of a word)
     73   * capital letter (the start of a CamelCase word)
     74   * following a dot (often a file extension)
     75 
     76 
     77 
     78 # Other fuzzy finders
     79 
     80 ## TextMate
     81 
     82 TextMate deserves immense credit for popularizing fuzzy finding from inside
     83 text editors. It's influence can be found in the command-t project, various
     84 other editors use command-t for file finding, and the 't' command in the github
     85 web interface.
     86 
     87 * https://github.com/textmate/textmate/blob/master/Frameworks/text/src/ranker.cc
     88 
     89 ## command-t, ctrlp-cmatcher
     90 
     91 Command-t is a plugin first released in 2010 intending to bring TextMate's
     92 "Go to File" feature to vim.
     93 
     94 Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to 
     95 
     96 The wy `last_idx` is suspicious.
     97 
     98 * https://github.com/wincent/command-t/blob/master/ruby/command-t/match.c
     99 * https://github.com/JazzCore/ctrlp-cmatcher/blob/master/autoload/fuzzycomt.c
    100 
    101 ## Length of shortest first match: fzf
    102 https://github.com/junegunn/fzf/blob/master/src/algo/algo.go
    103 
    104 Fzy scores based on the size of the greedy shortest match. fzf finds its match
    105 by the first match appearing in the candidate string. It has some cleverness to
    106 find if there is a shorter match contained in that search, but it isn't
    107 guaranteed to find the shortest match in the string.
    108 
    109 Example results for the search "abc"
    110 
    111 * <tt>**AXXBXXC**xxabc</tt>
    112 * <tt>xxxxxxx**AXBXC**</tt>
    113 * <tt>xxxxxxxxx**ABC**</tt>
    114 
    115 ## Length of first match: ctrlp, pick, selecta (`<= 0.0.6`)
    116 
    117 These score based on the length of the first match in the candidate. This is
    118 probably the simplest useful algorithm. This has the advantage that the heavy
    119 lifting can be performed by the regex engine, which is faster than implementing
    120 anything natively in ruby or Vim script.
    121 
    122 ## Length of shortest match: pick
    123 
    124 Pick has a method, `min_match`, to find the absolute shortest match in a string.
    125 This will find better results than the finders, at the expense of speed, as backtracking is required.
    126 
    127 ## selecta (latest master)
    128 https://github.com/garybernhardt/selecta/commit/d874c99dd7f0f94225a95da06fc487b0fa5b9edc
    129 https://github.com/garybernhardt/selecta/issues/80
    130 
    131 Selecta doesn't compare all possible matches, but only the shortest match from the same start location.
    132 This can lead to inconsistent results.
    133 
    134 Example results for the search "abc"
    135 
    136 * <tt>x**AXXXXBC**</tt>
    137 * <tt>x**ABXC**x</tt>
    138 * <tt>x**ABXC**xbc</tt>
    139 
    140 The third result here should have been scored the same as the first, but the
    141 lower scoring but shorter match is what is measured.
    142 
    143 
    144 ## others
    145 
    146 * https://github.com/joshaven/string_score/blob/master/coffee/string_score.coffee (first match + heuristics)
    147 * https://github.com/atom/fuzzaldrin/blob/master/src/scorer.coffee (modified version of string_score)
    148 * https://github.com/jeancroy/fuzzaldrin-plus/blob/master/src/scorer.coffee (Smith Waterman)
    149 
    150 
    151 # Possible fzy Algorithm Improvements
    152 
    153 ## Case sensitivity
    154 
    155 fzy currently treats all searches as case-insensitive. However, scoring prefers
    156 matches on uppercase letters to help find CamelCase candidates. It might be
    157 desirable to support a case sensitive flag or "smart case" searching.
    158 
    159 ## Faster matching
    160 
    161 Matching is currently performed using the standard lib's `strpbrk`, which has a
    162 very simple implementation (at least in glibc).
    163 
    164 Glibc has an extremely clever `strchr` implementation which searches the haystack
    165 string by [word](https://en.wikipedia.org/wiki/Word_(computer_architecture)), a
    166 4 or 8 byte `long int`, instead of by byte. It tests if a word is likely to
    167 contain either the search char or the null terminator using bit twiddling.
    168 
    169 A similar method could probably be written to perform to find a character in a
    170 string case-insensitively.
    171 
    172 * https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strchr.c;h=f73891d439dcd8a08954fad4d4615acac4e0eb85;hb=HEAD
    173