fzy
terminal fuzzy finder picker
git clone https://9o.is/git/fzy.git
ALGORITHM.md
(6787B)
1
2 This document describes the scoring algorithm of fzy as well as the algorithm
3 of other similar projects.
4
5 # Matching vs Scoring
6
7 I like to split the problem a fuzzy matchers into two subproblems: matching and scoring.
8
9 Matching determines which results are eligible for the list.
10 All the projects here consider this to be the same problem, matching the
11 candidate strings against the search string with any number of gaps.
12
13 Scoring determines the order in which the results are sorted.
14 Since scoring is tasked with finding what the human user intended, there is no
15 correct solution. As a result there are large variety in scoring strategies.
16
17 # fzy's matching
18
19 Generally, more time is taken in matching rather than scoring, so it is
20 important that matching be as fast as possible. If this were case sensitive it
21 would be a simple loop calling strchr, but since it needs to be case
22 insensitive.
23
24 # fzy's scoring
25
26 fzy treats scoring as a modified [edit
27 distance](https://en.wikipedia.org/wiki/Edit_distance) problem of calculating
28 the
29 [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance).
30 Edit distance is the measure of how different two strings are in terms of
31 insertions, deletions, and substitutions. This is the same problems as [DNA
32 sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment). Fuzzy
33 matching is a simpler problem which only accepts insertions, not deletions or
34 substitutions.
35
36 fzy's scoring is a dynamic programming algorithm similar to
37 [Wagner–Fischer](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm)
38 and
39 [Needleman–Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm).
40
41 Dynamic programming requires the observation that the result is based on the
42 result of subproblems.
43
44 Fzy borrows heavily from concepts in bioinformatics to performs scoring.
45
46 Fzy builds a `n`-by-`m` matrix, where `n` is the length of the search string
47 and `m` the length of the candidate string. Each position `(i,j)` in the matrix
48 stores the score for matching the first `i` characters of the search with the
49 first `j` characters of the candidate.
50
51 Fzy calculates an affine gap penalty, this means simply that we assign a
52 constant penalty for having a gap and a linear penalty for the length of the
53 gap.
54 Inspired by the [Gotoh algorithm
55 (pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy
56 computes a second `D` (for diagonal) matrix in parallel with the score matrix.
57 The `D` matrix computes the best score which *ends* in a match. This allows
58 both computation of the penalty for starting a gap and the score for a
59 consecutive match.
60
61 Using [this
62 algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L105) fzy
63 is able to score based on the optimal match.
64
65 * Gaps (negative score)
66 * at the start of the match
67 * at the end of the match
68 * within the match
69 * Matches (positive score)
70 * consecutive
71 * following a slash
72 * following a space, underscore, or dash (the start of a word)
73 * capital letter (the start of a CamelCase word)
74 * following a dot (often a file extension)
75
76
77
78 # Other fuzzy finders
79
80 ## TextMate
81
82 TextMate deserves immense credit for popularizing fuzzy finding from inside
83 text editors. It's influence can be found in the command-t project, various
84 other editors use command-t for file finding, and the 't' command in the github
85 web interface.
86
87 * https://github.com/textmate/textmate/blob/master/Frameworks/text/src/ranker.cc
88
89 ## command-t, ctrlp-cmatcher
90
91 Command-t is a plugin first released in 2010 intending to bring TextMate's
92 "Go to File" feature to vim.
93
94 Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to
95
96 The wy `last_idx` is suspicious.
97
98 * https://github.com/wincent/command-t/blob/master/ruby/command-t/match.c
99 * https://github.com/JazzCore/ctrlp-cmatcher/blob/master/autoload/fuzzycomt.c
100
101 ## Length of shortest first match: fzf
102 https://github.com/junegunn/fzf/blob/master/src/algo/algo.go
103
104 Fzy scores based on the size of the greedy shortest match. fzf finds its match
105 by the first match appearing in the candidate string. It has some cleverness to
106 find if there is a shorter match contained in that search, but it isn't
107 guaranteed to find the shortest match in the string.
108
109 Example results for the search "abc"
110
111 * <tt>**AXXBXXC**xxabc</tt>
112 * <tt>xxxxxxx**AXBXC**</tt>
113 * <tt>xxxxxxxxx**ABC**</tt>
114
115 ## Length of first match: ctrlp, pick, selecta (`<= 0.0.6`)
116
117 These score based on the length of the first match in the candidate. This is
118 probably the simplest useful algorithm. This has the advantage that the heavy
119 lifting can be performed by the regex engine, which is faster than implementing
120 anything natively in ruby or Vim script.
121
122 ## Length of shortest match: pick
123
124 Pick has a method, `min_match`, to find the absolute shortest match in a string.
125 This will find better results than the finders, at the expense of speed, as backtracking is required.
126
127 ## selecta (latest master)
128 https://github.com/garybernhardt/selecta/commit/d874c99dd7f0f94225a95da06fc487b0fa5b9edc
129 https://github.com/garybernhardt/selecta/issues/80
130
131 Selecta doesn't compare all possible matches, but only the shortest match from the same start location.
132 This can lead to inconsistent results.
133
134 Example results for the search "abc"
135
136 * <tt>x**AXXXXBC**</tt>
137 * <tt>x**ABXC**x</tt>
138 * <tt>x**ABXC**xbc</tt>
139
140 The third result here should have been scored the same as the first, but the
141 lower scoring but shorter match is what is measured.
142
143
144 ## others
145
146 * https://github.com/joshaven/string_score/blob/master/coffee/string_score.coffee (first match + heuristics)
147 * https://github.com/atom/fuzzaldrin/blob/master/src/scorer.coffee (modified version of string_score)
148 * https://github.com/jeancroy/fuzzaldrin-plus/blob/master/src/scorer.coffee (Smith Waterman)
149
150
151 # Possible fzy Algorithm Improvements
152
153 ## Case sensitivity
154
155 fzy currently treats all searches as case-insensitive. However, scoring prefers
156 matches on uppercase letters to help find CamelCase candidates. It might be
157 desirable to support a case sensitive flag or "smart case" searching.
158
159 ## Faster matching
160
161 Matching is currently performed using the standard lib's `strpbrk`, which has a
162 very simple implementation (at least in glibc).
163
164 Glibc has an extremely clever `strchr` implementation which searches the haystack
165 string by [word](https://en.wikipedia.org/wiki/Word_(computer_architecture)), a
166 4 or 8 byte `long int`, instead of by byte. It tests if a word is likely to
167 contain either the search char or the null terminator using bit twiddling.
168
169 A similar method could probably be written to perform to find a character in a
170 string case-insensitively.
171
172 * https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strchr.c;h=f73891d439dcd8a08954fad4d4615acac4e0eb85;hb=HEAD
173