vis
a vi-like editor based on Plan 9's structural regular expressions
git clone https://9o.is/git/vis.git
lexer.lua
(91147B)
1 -- Copyright 2006-2025 Mitchell. See LICENSE.
2
3 --- Lexes Scintilla documents and source code with Lua and LPeg.
4 --
5 -- ### Contents
6 --
7 -- 1. [Writing Lua Lexers](#writing-lua-lexers)
8 -- 2. [Lexer Basics](#lexer-basics)
9 -- - [New Lexer Template](#new-lexer-template)
10 -- - [Tags](#tags)
11 -- - [Rules](#rules)
12 -- - [Summary](#summary)
13 -- 3. [Advanced Techniques](#advanced-techniques)
14 -- - [Line Lexers](#line-lexers)
15 -- - [Embedded Lexers](#embedded-lexers)
16 -- - [Lexers with Complex State](#lexers-with-complex-state)
17 -- 4. [Code Folding](#code-folding)
18 -- 5. [Using Lexers](#using-lexers)
19 -- 6. [Migrating Legacy Lexers](#migrating-legacy-lexers)
20 -- 7. [Considerations](#considerations)
21 -- 8. [API Documentation](#lexer.add_fold_point)
22 --
23 -- ### Writing Lua Lexers
24 --
25 -- Lexers recognize and tag elements of source code for syntax highlighting. Scintilla (the
26 -- editing component behind [Textadept][] and [SciTE][]) traditionally uses static, compiled C++
27 -- lexers which are difficult to create and/or extend. On the other hand, Lua makes it easy to
28 -- to rapidly create new lexers, extend existing ones, and embed lexers within one another. Lua
29 -- lexers tend to be more readable than C++ lexers too.
30 --
31 -- While lexers can be written in plain Lua, Scintillua prefers using Parsing Expression
32 -- Grammars, or PEGs, composed with the Lua [LPeg library][]. As a result, this document is
33 -- devoted to writing LPeg lexers. The following table comes from the LPeg documentation and
34 -- summarizes all you need to know about constructing basic LPeg patterns. This module provides
35 -- convenience functions for creating and working with other more advanced patterns and concepts.
36 --
37 -- Operator | Description
38 -- -|-
39 -- `lpeg.P`(*string*) | Matches string *string* literally.
40 -- `lpeg.P`(*n*) | Matches exactly *n* number of characters.
41 -- `lpeg.S`(*string*) | Matches any character in string set *string*.
42 -- `lpeg.R`("*xy*") | Matches any character between range *x* and *y*.
43 -- *patt*`^`*n* | Matches at least *n* repetitions of *patt*.
44 -- *patt*`^`-*n* | Matches at most *n* repetitions of *patt*.
45 -- *patt1* `*` *patt2* | Matches *patt1* followed by *patt2*.
46 -- *patt1* `+` *patt2* | Matches *patt1* or *patt2* (ordered choice).
47 -- *patt1* `-` *patt2* | Matches *patt1* if *patt2* does not also match.
48 -- `-`*patt* | Matches if *patt* does not match, consuming no input.
49 -- `#`*patt* | Matches *patt* but consumes no input.
50 --
51 -- The first part of this document deals with rapidly constructing a simple lexer. The next part
52 -- deals with more advanced techniques, such as embedding lexers within one another. Following
53 -- that is a discussion about code folding, or being able to tell Scintilla which code blocks
54 -- are "foldable" (temporarily hideable from view). After that are instructions on how to use
55 -- Lua lexers with the aforementioned Textadept and SciTE editors. Finally there are comments
56 -- on lexer performance and limitations.
57 --
58 -- [LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
59 -- [Textadept]: https://orbitalquark.github.io/textadept
60 -- [SciTE]: https://scintilla.org/SciTE.html
61 --
62 -- ### Lexer Basics
63 --
64 -- The *lexers/* directory contains all of Scintillua's Lua lexers, including any new ones you
65 -- write. Before attempting to write one from scratch though, first determine if your programming
66 -- language is similar to any of the 100+ languages supported. If so, you may be able to copy
67 -- and modify, or inherit from that lexer, saving some time and effort. The filename of your
68 -- lexer should be the name of your programming language in lower case followed by a *.lua*
69 -- extension. For example, a new Lua lexer has the name *lua.lua*.
70 --
71 -- #### New Lexer Template
72 --
73 -- There is a *lexers/template.txt* file that contains a simple template for a new lexer. Feel
74 -- free to use it, replacing the '?' with the name of your lexer. Consider this snippet from
75 -- the template:
76 --
77 -- ```lua
78 -- -- ? LPeg lexer.
79 --
80 -- local lexer = lexer
81 -- local P, S = lpeg.P, lpeg.S
82 --
83 -- local lex = lexer.new(...)
84 --
85 -- --[[... lexer rules ...]]
86 --
87 -- -- Identifier.
88 -- local identifier = lex:tag(lexer.IDENTIFIER, lexer.word)
89 -- lex:add_rule('identifier', identifier)
90 --
91 -- --[[... more lexer rules ...]]
92 --
93 -- return lex
94 -- ```
95 --
96 -- The first line of code is a Lua convention to store a global variable into a local variable
97 -- for quick access. The second line simply defines often used convenience variables. The third
98 -- and last lines [define](#lexer.new) and return the lexer object Scintillua uses; they are
99 -- very important and must be part of every lexer. Note the `...` passed to `lexer.new()` is
100 -- literal: the lexer will assume the name of its filename or an alternative name specified
101 -- by `lexer.load()` in embedded lexer applications. The fourth line uses something called a
102 -- "tag", an essential component of lexers. You will learn about tags shortly. The fifth line
103 -- defines a lexer grammar rule, which you will learn about later. (Be aware that it is common
104 -- practice to combine these two lines for short rules.) Note, however, the `local` prefix in
105 -- front of variables, which is needed so-as not to affect Lua's global environment. All in all,
106 -- this is a minimal, working lexer that you can build on.
107 --
108 -- #### Tags
109 --
110 -- Take a moment to think about your programming language's structure. What kind of key elements
111 -- does it have? Most languages have elements like keywords, strings, and comments. The
112 -- lexer's job is to break down source code into these elements and "tag" them for syntax
113 -- highlighting. Therefore, tags are an essential component of lexers. It is up to you how
114 -- specific your lexer is when it comes to tagging elements. Perhaps only distinguishing between
115 -- keywords and identifiers is necessary, or maybe recognizing constants and built-in functions,
116 -- methods, or libraries is desirable. The Lua lexer, for example, tags the following elements:
117 -- keywords, functions, constants, identifiers, strings, comments, numbers, labels, attributes,
118 -- and operators. Even though functions and constants are subsets of identifiers, Lua programmers
119 -- find it helpful for the lexer to distinguish between them all. It is perfectly acceptable
120 -- to just recognize keywords and identifiers.
121 --
122 -- In a lexer, LPeg patterns that match particular sequences of characters are tagged with a
123 -- tag name using the the `lexer.tag()` function. Let us examine the "identifier" tag used in
124 -- the template shown earlier:
125 --
126 -- ```lua
127 -- local identifier = lex:tag(lexer.IDENTIFIER, lexer.word)
128 -- ```
129 --
130 -- At first glance, the first argument does not appear to be a string name and the second
131 -- argument does not appear to be an LPeg pattern. Perhaps you expected something like:
132 --
133 -- ```lua
134 -- lex:tag('identifier', (lpeg.R('AZ', 'az') + '_') * (lpeg.R('AZ', 'az', '09') + '_')^0)
135 -- ```
136 --
137 -- The `lexer` module actually provides a convenient list of common tag names and common LPeg
138 -- patterns for you to use. Tag names for programming languages include (but are not limited
139 -- to) `lexer.DEFAULT`, `lexer.COMMENT`, `lexer.STRING`, `lexer.NUMBER`, `lexer.KEYWORD`,
140 -- `lexer.IDENTIFIER`, `lexer.OPERATOR`, `lexer.ERROR`, `lexer.PREPROCESSOR`, `lexer.CONSTANT`,
141 -- `lexer.CONSTANT_BUILTIN`, `lexer.VARIABLE`, `lexer.VARIABLE_BUILTIN`, `lexer.FUNCTION`,
142 -- `lexer.FUNCTION_BUILTIN`, `lexer.FUNCTION_METHOD`, `lexer.CLASS`, `lexer.TYPE`, `lexer.LABEL`,
143 -- `lexer.REGEX`, `lexer.EMBEDDED`, and `lexer.ANNOTATION`. Tag names for markup languages include
144 -- (but are not limited to) `lexer.TAG`, `lexer.ATTRIBUTE`, `lexer.HEADING`, `lexer.BOLD`,
145 -- `lexer.ITALIC`, `lexer.UNDERLINE`, `lexer.CODE`, `lexer.LINK`, `lexer.REFERENCE`, and
146 -- `lexer.LIST`. Patterns include `lexer.any`, `lexer.alpha`, `lexer.digit`, `lexer.alnum`,
147 -- `lexer.lower`, `lexer.upper`, `lexer.xdigit`, `lexer.graph`, `lexer.punct`, `lexer.space`,
148 -- `lexer.newline`, `lexer.nonnewline`, `lexer.dec_num`, `lexer.hex_num`, `lexer.oct_num`,
149 -- `lexer.bin_num`, `lexer.integer`, `lexer.float`, `lexer.number`, and `lexer.word`. You may use
150 -- your own tag names if none of the above fit your language, but an advantage to using predefined
151 -- tag names is that the language elements your lexer recognizes will inherit any universal syntax
152 -- highlighting color theme that your editor uses. You can also "subclass" existing tag names by
153 -- appending a '.*subclass*' string to them. For example, the HTML lexer tags unknown tags as
154 -- `lexer.TAG .. '.unknown'`. This gives editors the opportunity to highlight those subclassed
155 -- tags in a different way than normal tags, or fall back to highlighting them as normal tags.
156 --
157 -- ##### Example Tags
158 --
159 -- So, how might you recognize and tag elements like keywords, comments, and strings? Here are
160 -- some examples.
161 --
162 -- **Keywords**
163 --
164 -- Instead of matching *n* keywords with *n* `P('keyword_n')` ordered choices, use one
165 -- of of the following methods:
166 --
167 -- 1. Use the convenience function `lexer.word_match()` optionally coupled with
168 -- `lexer.set_word_list()`. It is much easier and more efficient to write word matches like:
169 --
170 -- ```lua
171 -- local keyword = lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD))
172 -- --[[...]]
173 -- lex:set_word_list(lexer.KEYWORD, {
174 -- 'keyword_1', 'keyword_2', ..., 'keyword_n'
175 -- })
176 --
177 -- local case_insensitive_word = lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD, true))
178 -- --[[...]]
179 -- lex:set_word_list(lexer.KEYWORD, {
180 -- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n'
181 -- })
182 --
183 -- local hyphenated_keyword = lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD))
184 -- --[[...]]
185 -- lex:set_word_list(lexer.KEYWORD, {
186 -- 'keyword-1', 'keyword-2', ..., 'keyword-n'
187 -- })
188 -- ```
189 --
190 -- The benefit of using this method is that other lexers that inherit from, embed, or embed
191 -- themselves into your lexer can set, replace, or extend these word lists. For example,
192 -- the TypeScript lexer inherits from JavaScript, but extends JavaScript's keyword and type
193 -- lists with more options.
194 --
195 -- This method also allows applications that use your lexer to extend or replace your word
196 -- lists. For example, the Lua lexer includes keywords and functions for the latest version
197 -- of Lua (5.4 at the time of writing). However, editors using that lexer might want to use
198 -- keywords from Lua version 5.1, which is still quite popular.
199 --
200 -- Note that calling `lex:set_word_list()` is completely optional. Your lexer is allowed to
201 -- expect the editor using it to supply word lists. Scintilla-based editors can do so via
202 -- Scintilla's `ILexer5` interface.
203 --
204 -- 2. Use the lexer-agnostic form of `lexer.word_match()`:
205 --
206 -- ```lua
207 -- local keyword = lex:tag(lexer.KEYWORD, lexer.word_match{
208 -- 'keyword_1', 'keyword_2', ..., 'keyword_n'
209 -- })
210 --
211 -- local case_insensitive_keyword = lex:tag(lexer.KEYWORD, lexer.word_match({
212 -- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n'
213 -- }, true))
214 --
215 -- local hyphened_keyword = lex:tag(lexer.KEYWORD, lexer.word_match{
216 -- 'keyword-1', 'keyword-2', ..., 'keyword-n'
217 -- })
218 -- ```
219 --
220 -- For short keyword lists, you can use a single string of words. For example:
221 --
222 -- ```lua
223 -- local keyword = lex:tag(lexer.KEYWORD, lexer.word_match('key_1 key_2 ... key_n'))
224 -- ```
225 --
226 -- You can use this method for static word lists that do not change, or where it does not
227 -- make sense to allow applications or other lexers to extend or replace a word list.
228 --
229 -- **Comments**
230 --
231 -- Line-style comments with a prefix character(s) are easy to express:
232 --
233 -- ```lua
234 -- local shell_comment = lex:tag(lexer.COMMENT, lexer.to_eol('#'))
235 -- local c_line_comment = lex:tag(lexer.COMMENT, lexer.to_eol('//', true))
236 -- ```
237 --
238 -- The comments above start with a '#' or "//" and go to the end of the line (EOL). The second
239 -- comment recognizes the next line also as a comment if the current line ends with a '\\'
240 -- escape character.
241 --
242 -- C-style "block" comments with a start and end delimiter are also easy to express:
243 --
244 -- ```lua
245 -- local c_comment = lex:tag(lexer.COMMENT, lexer.range('/*', '*/'))
246 -- ```
247 --
248 -- This comment starts with a "/\*" sequence and contains anything up to and including an ending
249 -- "\*/" sequence. The ending "\*/" is optional so the lexer can recognize unfinished comments
250 -- as comments and highlight them properly.
251 --
252 -- **Strings**
253 --
254 -- Most programming languages allow escape sequences in strings such that a sequence like
255 -- "\\"" in a double-quoted string indicates that the '"' is not the end of the
256 -- string. `lexer.range()` handles escapes inherently.
257 --
258 -- ```lua
259 -- local dq_str = lexer.range('"')
260 -- local sq_str = lexer.range("'")
261 -- local string = lex:tag(lexer.STRING, dq_str + sq_str)
262 -- ```
263 --
264 -- In this case, the lexer treats '\\' as an escape character in a string sequence.
265 --
266 -- **Numbers**
267 --
268 -- Most programming languages have the same format for integers and floats, so it might be as
269 -- simple as using a predefined LPeg pattern:
270 --
271 -- ```lua
272 -- local number = lex:tag(lexer.NUMBER, lexer.number)
273 -- ```
274 --
275 -- However, some languages allow postfix characters on integers:
276 --
277 -- ```lua
278 -- local integer = P('-')^-1 * (lexer.dec_num * S('lL')^-1)
279 -- local number = lex:tag(lexer.NUMBER, lexer.float + lexer.hex_num + integer)
280 -- ```
281 --
282 -- Other languages allow separaters within numbers for better readability:
283 --
284 -- ```lua
285 -- local number = lex:tag(lexer.NUMBER, lexer.number_('_')) -- recognize 1_000_000
286 -- ```
287 --
288 -- Your language may need other tweaks, but it is up to you how fine-grained you want your
289 -- highlighting to be. After all, you are not writing a compiler or interpreter!
290 --
291 -- #### Rules
292 --
293 -- Programming languages have grammars, which specify valid syntactic structure. For example,
294 -- comments usually cannot appear within a string, and valid identifiers (like variable names)
295 -- cannot be keywords. In Lua lexers, grammars consist of LPeg pattern rules, many of which
296 -- are tagged. Recall from the lexer template the `lexer.add_rule()` call, which adds a rule
297 -- to the lexer's grammar:
298 --
299 -- ```lua
300 -- lex:add_rule('identifier', identifier)
301 -- ```
302 --
303 -- Each rule has an associated name, but rule names are completely arbitrary and serve only to
304 -- identify and distinguish between different rules. Rule order is important: if text does not
305 -- match the first rule added to the grammar, the lexer tries to match the second rule added, and
306 -- so on. Right now this lexer simply matches identifiers under a rule named "identifier".
307 --
308 -- To illustrate the importance of rule order, here is an example of a simplified Lua lexer:
309 --
310 -- ```lua
311 -- lex:add_rule('keyword', lex:tag(lexer.KEYWORD, ...))
312 -- lex:add_rule('identifier', lex:tag(lexer.IDENTIFIER, ...))
313 -- lex:add_rule('string', lex:tag(lexer.STRING, ...))
314 -- lex:add_rule('comment', lex:tag(lexer.COMMENT, ...))
315 -- lex:add_rule('number', lex:tag(lexer.NUMBER, ...))
316 -- lex:add_rule('label', lex:tag(lexer.LABEL, ...))
317 -- lex:add_rule('operator', lex:tag(lexer.OPERATOR, ...))
318 -- ```
319 --
320 -- Notice how identifiers come _after_ keywords. In Lua, as with most programming languages,
321 -- the characters allowed in keywords and identifiers are in the same set (alphanumerics plus
322 -- underscores). If the lexer added the "identifier" rule before the "keyword" rule, all keywords
323 -- would match identifiers and thus would be incorrectly tagged (and likewise incorrectly
324 -- highlighted) as identifiers instead of keywords. The same idea applies to function names,
325 -- constants, etc. that you may want to distinguish between: their rules should come before
326 -- identifiers.
327 --
328 -- So what about text that does not match any rules? For example in Lua, the '!' character is
329 -- meaningless outside a string or comment. Normally the lexer skips over such text. If instead
330 -- you want to highlight these "syntax errors", add a final rule:
331 --
332 -- ```lua
333 -- lex:add_rule('keyword', keyword)
334 -- --[[...]]
335 -- lex:add_rule('error', lex:tag(lexer.ERROR, lexer.any))
336 -- ```
337 --
338 -- This identifies and tags any character not matched by an existing rule as a `lexer.ERROR`.
339 --
340 -- Even though the rules defined in the examples above contain a single tagged pattern, rules may
341 -- consist of multiple tagged patterns. For example, the rule for an HTML tag could consist of a
342 -- tagged tag followed by an arbitrary number of tagged attributes, separated by whitespace. This
343 -- allows the lexer to produce all tags separately, but in a single, convenient rule. That rule
344 -- might look something like this:
345 --
346 -- ```lua
347 -- local ws = lex:get_rule('whitespace') -- predefined rule for all lexers
348 -- lex:add_rule('tag', tag_start * (ws * attributes)^0 * tag_end^-1)
349 -- ```
350 --
351 -- Note however that lexers with complex rules like these are more prone to lose track of their
352 -- state, especially if they span multiple lines.
353 --
354 -- #### Summary
355 --
356 -- Lexers primarily consist of tagged patterns and grammar rules. These patterns match language
357 -- elements like keywords, comments, and strings, and rules dictate the order in which patterns
358 -- are matched. At your disposal are a number of convenience patterns and functions for rapidly
359 -- creating a lexer. If you choose to use predefined tag names (or perhaps even subclassed
360 -- names) for your patterns, you do not have to update your editor's theme to specify how to
361 -- syntax-highlight those patterns. Your language's elements will inherit the default syntax
362 -- highlighting color theme your editor uses.
363 --
364 -- ### Advanced Techniques
365 --
366 -- #### Line Lexers
367 --
368 -- By default, lexers match the arbitrary chunks of text passed to them by Scintilla. These
369 -- chunks may be a full document, only the visible part of a document, or even just portions of
370 -- lines. Some lexers need to match whole lines. For example, a lexer for the output of a file
371 -- "diff" needs to know if the line started with a '+' or '-' and then highlight the entire
372 -- line accordingly. To indicate that your lexer matches by line, create the lexer with an
373 -- extra parameter:
374 --
375 -- ```lua
376 -- local lex = lexer.new(..., {lex_by_line = true})
377 -- ```
378 --
379 -- Now the input text for the lexer is a single line at a time. Keep in mind that line lexers
380 -- do not have the ability to look ahead to subsequent lines.
381 --
382 -- #### Embedded Lexers
383 --
384 -- Scintillua lexers embed within one another very easily, requiring minimal effort. In the
385 -- following sections, the lexer being embedded is called the "child" lexer and the lexer a child
386 -- is being embedded in is called the "parent". For example, consider an HTML lexer and a CSS
387 -- lexer. Either lexer stands alone for tagging their respective HTML and CSS files. However, CSS
388 -- can be embedded inside HTML. In this specific case, the CSS lexer is the "child" lexer with
389 -- the HTML lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This sounds
390 -- a lot like the case with CSS, but there is a subtle difference: PHP _embeds itself into_
391 -- HTML while CSS is _embedded in_ HTML. This fundamental difference results in two types of
392 -- embedded lexers: a parent lexer that embeds other child lexers in it (like HTML embedding CSS),
393 -- and a child lexer that embeds itself into a parent lexer (like PHP embedding itself in HTML).
394 --
395 -- ##### Parent Lexer
396 --
397 -- Before embedding a child lexer into a parent lexer, the parent lexer needs to load the child
398 -- lexer. This is done with the `lexer.load()` function. For example, loading the CSS lexer
399 -- within the HTML lexer looks like:
400 --
401 -- ```lua
402 -- local css = lexer.load('css')
403 -- ```
404 --
405 -- The next part of the embedding process is telling the parent lexer when to switch over
406 -- to the child lexer and when to switch back. The lexer refers to these indications as the
407 -- "start rule" and "end rule", respectively, and are just LPeg patterns. Continuing with the
408 -- HTML/CSS example, the transition from HTML to CSS is when the lexer encounters a "style"
409 -- tag with a "type" attribute whose value is "text/css":
410 --
411 -- ```lua
412 -- local css_tag = P('<style') * P(function(input, index)
413 -- if input:find('^[^>]+type="text/css"', index) then return true end
414 -- end)
415 -- ```
416 --
417 -- This pattern looks for the beginning of a "style" tag and searches its attribute list for
418 -- the text "`type="text/css"`". (In this simplified example, the Lua pattern does not consider
419 -- whitespace between the '=' nor does it consider that using single quotes is valid.) If there
420 -- is a match, the functional pattern returns `true`. However, we ultimately want to tag the
421 -- "style" tag as an HTML tag, so the actual start rule looks like this:
422 --
423 -- ```lua
424 -- local css_start_rule = #css_tag * tag
425 -- ```
426 --
427 -- Now that the parent knows when to switch to the child, it needs to know when to switch
428 -- back. In the case of HTML/CSS, the switch back occurs when the lexer encounters an ending
429 -- "style" tag, though the lexer should still tag that tag as an HTML tag:
430 --
431 -- ```lua
432 -- local css_end_rule = #P('</style>') * tag
433 -- ```
434 --
435 -- Once the parent loads the child lexer and defines the child's start and end rules, it embeds
436 -- the child with the `lexer.embed()` function:
437 --
438 -- ```lua
439 -- lex:embed(css, css_start_rule, css_end_rule)
440 -- ```
441 --
442 -- ##### Child Lexer
443 --
444 -- The process for instructing a child lexer to embed itself into a parent is very similar to
445 -- embedding a child into a parent: first, load the parent lexer into the child lexer with the
446 -- `lexer.load()` function and then create start and end rules for the child lexer. However,
447 -- in this case, call `lexer.embed()` with switched arguments. For example, in the PHP lexer:
448 --
449 -- ```lua
450 -- local html = lexer.load('html')
451 -- local php_start_rule = lex:tag('php_tag', '<?php' * lexer.space)
452 -- local php_end_rule = lex:tag('php_tag', '?>')
453 -- html:embed(lex, php_start_rule, php_end_rule)
454 -- ```
455 --
456 -- Note that the use of a 'php_tag' tag will require the editor using the lexer to specify how
457 -- to highlight text with that tag. In order to avoid this, you could use the `lexer.PREPROCESSOR`
458 -- tag instead.
459 --
460 -- #### Lexers with Complex State
461 --
462 -- A vast majority of lexers are not stateful and can operate on any chunk of text in a
463 -- document. However, there may be rare cases where a lexer does need to keep track of some
464 -- sort of persistent state. Rather than using `lpeg.P` function patterns that set state
465 -- variables, it is recommended to make use of Scintilla's built-in, per-line state integers via
466 -- `lexer.line_state`. It was designed to accommodate up to 32 bit-flags for tracking state.
467 -- `lexer.line_from_position()` will return the line for any position given to an `lpeg.P`
468 -- function pattern. (Any positions derived from that position argument will also work.)
469 --
470 -- Writing stateful lexers is beyond the scope of this document.
471 --
472 -- ### Code Folding
473 --
474 -- When reading source code, it is occasionally helpful to temporarily hide blocks of code like
475 -- functions, classes, comments, etc. This is the concept of "folding". In the Textadept and
476 -- SciTE editors for example, little markers in the editor margins appear next to code that
477 -- can be folded at places called "fold points". When the user clicks on one of those markers,
478 -- the editor hides the code associated with the marker until the user clicks on the marker
479 -- again. The lexer specifies these fold points and what code exactly to fold.
480 --
481 -- The fold points for most languages occur on keywords or character sequences. Examples of
482 -- fold keywords are "if" and "end" in Lua and examples of fold character sequences are '{',
483 -- '}', "/\*", and "\*/" in C for code block and comment delimiters, respectively. However,
484 -- these fold points cannot occur just anywhere. For example, lexers should not recognize fold
485 -- keywords that appear within strings or comments. The `lexer.add_fold_point()` function allows
486 -- you to conveniently define fold points with such granularity. For example, consider C:
487 --
488 -- ```lua
489 -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
490 -- lex:add_fold_point(lexer.COMMENT, '/*', '*/')
491 -- ```
492 --
493 -- The first assignment states that any '{' or '}' that the lexer tagged as an `lexer.OPERATOR`
494 -- is a fold point. Likewise, the second assignment states that any "/\*" or "\*/" that the
495 -- lexer tagged as part of a `lexer.COMMENT` is a fold point. The lexer does not consider any
496 -- occurrences of these characters outside their tagged elements (such as in a string) as fold
497 -- points. How do you specify fold keywords? Here is an example for Lua:
498 --
499 -- ```lua
500 -- lex:add_fold_point(lexer.KEYWORD, 'if', 'end')
501 -- lex:add_fold_point(lexer.KEYWORD, 'do', 'end')
502 -- lex:add_fold_point(lexer.KEYWORD, 'function', 'end')
503 -- lex:add_fold_point(lexer.KEYWORD, 'repeat', 'until')
504 -- ```
505 --
506 -- If your lexer has case-insensitive keywords as fold points, simply add a
507 -- `case_insensitive_fold_points = true` option to `lexer.new()`, and specify keywords in
508 -- lower case.
509 --
510 -- If your lexer needs to do some additional processing in order to determine if a tagged element
511 -- is a fold point, pass a function to `lex:add_fold_point()` that returns an integer. A return
512 -- value of `1` indicates the element is a beginning fold point and a return value of `-1`
513 -- indicates the element is an ending fold point. A return value of `0` indicates the element
514 -- is not a fold point. For example:
515 --
516 -- ```lua
517 -- local function fold_strange_element(text, pos, line, s, symbol)
518 -- if ... then
519 -- return 1 -- beginning fold point
520 -- elseif ... then
521 -- return -1 -- ending fold point
522 -- end
523 -- return 0
524 -- end
525 --
526 -- lex:add_fold_point('strange_element', '|', fold_strange_element)
527 -- ```
528 --
529 -- Any time the lexer encounters a '|' that is tagged as a "strange_element", it calls the
530 -- `fold_strange_element` function to determine if '|' is a fold point. The lexer calls these
531 -- functions with the following arguments: the text to identify fold points in, the beginning
532 -- position of the current line in the text to fold, the current line's text, the position in
533 -- the current line the fold point text starts at, and the fold point text itself.
534 --
535 -- #### Fold by Indentation
536 --
537 -- Some languages have significant whitespace and/or no delimiters that indicate fold points. If
538 -- your lexer falls into this category and you would like to mark fold points based on changes
539 -- in indentation, create the lexer with a `fold_by_indentation = true` option:
540 --
541 -- ```lua
542 -- local lex = lexer.new(..., {fold_by_indentation = true})
543 -- ```
544 --
545 -- #### Custom Folding
546 --
547 -- Lexers with complex folding needs can implement their own folders by defining their own
548 -- [`lex:fold()`](#lexer.fold) method. Writing custom folders is beyond the scope of this document.
549 --
550 -- ### Using Lexers
551 --
552 -- **Textadept**
553 --
554 -- Place your lexer in your *~/.textadept/lexers/* directory so you do not overwrite it when
555 -- upgrading Textadept. Also, lexers in this directory override default lexers. Thus, Textadept
556 -- loads a user *lua* lexer instead of the default *lua* lexer. This is convenient for tweaking
557 -- a default lexer to your liking. Then add a [file extension](#lexer.detect_extensions) for
558 -- your lexer if necessary.
559 --
560 -- **SciTE**
561 --
562 -- Create a *.properties* file for your lexer and `import` it in either your *SciTEUser.properties*
563 -- or *SciTEGlobal.properties*. The contents of the *.properties* file should contain:
564 --
565 -- file.patterns.[lexer_name]=[file_patterns]
566 -- lexer.$(file.patterns.[lexer_name])=scintillua.[lexer_name]
567 -- keywords.$(file.patterns.[lexer_name])=scintillua
568 -- keywords2.$(file.patterns.[lexer_name])=scintillua
569 -- ...
570 -- keywords9.$(file.patterns.[lexer_name])=scintillua
571 --
572 -- where `[lexer_name]` is the name of your lexer (minus the *.lua* extension) and
573 -- `[file_patterns]` is a set of file extensions to use your lexer for. The `keyword` settings are
574 -- only needed if another SciTE properties file has defined keyword sets for `[file_patterns]`.
575 -- The `scintillua` keyword setting instructs Scintillua to use the keyword sets defined within
576 -- the lexer. You can override a lexer's keyword set(s) by specifying your own in the same order
577 -- that the lexer calls `lex:set_word_list()`. For example, the Lua lexer's first set of keywords
578 -- is for reserved words, the second is for built-in global functions, the third is for library
579 -- functions, the fourth is for built-in global constants, and the fifth is for library constants.
580 --
581 -- SciTE assigns styles to tag names in order to perform syntax highlighting. Since the set of
582 -- tag names used for a given language changes, your *.properties* file should specify styles
583 -- for tag names instead of style numbers. For example:
584 --
585 -- scintillua.styles.my_tag=$(scintillua.styles.keyword),bold
586 --
587 -- ### Migrating Legacy Lexers
588 --
589 -- Legacy lexers are of the form:
590 --
591 -- ```lua
592 -- local lexer = require('lexer')
593 -- local token, word_match = lexer.token, lexer.word_match
594 -- local P, S = lpeg.P, lpeg.S
595 --
596 -- local lex = lexer.new('?')
597 --
598 -- -- Whitespace.
599 -- lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1))
600 --
601 -- -- Keywords.
602 -- lex:add_rule('keyword', token(lexer.KEYWORD, word_match{
603 -- --[[...]]
604 -- }))
605 --
606 -- --[[... other rule definitions ...]]
607 --
608 -- -- Custom.
609 -- lex:add_rule('custom_rule', token('custom_token', ...))
610 -- lex:add_style('custom_token', lexer.styles.keyword .. {bold = true})
611 --
612 -- -- Fold points.
613 -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
614 --
615 -- return lex
616 -- ```
617 --
618 -- While Scintillua will mostly handle such legacy lexers just fine without any changes, it is
619 -- recommended that you migrate yours. The migration process is fairly straightforward:
620 --
621 -- 1. `lexer` exists in the default lexer environment, so `require('lexer')` should be replaced
622 -- by simply `lexer`. (Keep in mind `local lexer = lexer` is a Lua idiom.)
623 -- 2. Every lexer created using `lexer.new()` should no longer specify a lexer name by string,
624 -- but should instead use `...` (three dots), which evaluates to the lexer's filename or
625 -- alternative name in embedded lexer applications.
626 -- 3. Every lexer created using `lexer.new()` now includes a rule to match whitespace. Unless
627 -- your lexer has significant whitespace, you can remove your legacy lexer's whitespace
628 -- token and rule. Otherwise, your defined whitespace rule will replace the default one.
629 -- 4. The concept of tokens has been replaced with tags. Instead of calling a `token()` function,
630 -- call [`lex:tag()`](#lexer.tag) instead.
631 -- 5. Lexers now support replaceable word lists. Instead of calling `lexer.word_match()` with
632 -- large word lists, call it as an instance method with an identifier string (typically
633 -- something like `lexer.KEYWORD`). Then at the end of the lexer (before `return lex`), call
634 -- [`lex:set_word_list()`](#lexer.set_word_list) with the same identifier and the usual
635 -- list of words to match. This allows users of your lexer to call `lex:set_word_list()`
636 -- with their own set of words should they wish to.
637 -- 6. Lexers no longer specify styling information. Remove any calls to `lex:add_style()`. You
638 -- may need to add styling information for custom tags to your editor's theme.
639 -- 7. `lexer.last_char_includes()` has been deprecated in favor of the new `lexer.after_set()`.
640 -- Use the character set and pattern as arguments to that new function.
641 --
642 -- As an example, consider the following sample legacy lexer:
643 --
644 -- ```lua
645 -- local lexer = require('lexer')
646 -- local token, word_match = lexer.token, lexer.word_match
647 -- local P, S = lpeg.P, lpeg.S
648 --
649 -- local lex = lexer.new('legacy')
650 --
651 -- lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1))
652 -- lex:add_rule('keyword', token(lexer.KEYWORD, word_match('foo bar baz')))
653 -- lex:add_rule('custom', token('custom', 'quux'))
654 -- lex:add_style('custom', lexer.styles.keyword .. {bold = true})
655 -- lex:add_rule('identifier', token(lexer.IDENTIFIER, lexer.word))
656 -- lex:add_rule('string', token(lexer.STRING, lexer.range('"')))
657 -- lex:add_rule('comment', token(lexer.COMMENT, lexer.to_eol('#')))
658 -- lex:add_rule('number', token(lexer.NUMBER, lexer.number))
659 -- lex:add_rule('operator', token(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}')))
660 --
661 -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
662 --
663 -- return lex
664 -- ```
665 --
666 -- Following the migration steps would yield:
667 --
668 -- ```lua
669 -- local lexer = lexer
670 -- local P, S = lpeg.P, lpeg.S
671 --
672 -- local lex = lexer.new(...)
673 --
674 -- lex:add_rule('keyword', lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD)))
675 -- lex:add_rule('custom', lex:tag('custom', 'quux'))
676 -- lex:add_rule('identifier', lex:tag(lexer.IDENTIFIER, lexer.word))
677 -- lex:add_rule('string', lex:tag(lexer.STRING, lexer.range('"')))
678 -- lex:add_rule('comment', lex:tag(lexer.COMMENT, lexer.to_eol('#')))
679 -- lex:add_rule('number', lex:tag(lexer.NUMBER, lexer.number))
680 -- lex:add_rule('operator', lex:tag(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}')))
681 --
682 -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
683 --
684 -- lex:set_word_list(lexer.KEYWORD, {'foo', 'bar', 'baz'})
685 --
686 -- return lex
687 -- ```
688 --
689 -- Any editors using this lexer would have to add a style for the 'custom' tag.
690 --
691 -- ### Considerations
692 --
693 -- #### Performance
694 --
695 -- There might be some slight overhead when initializing a lexer, but loading a file from disk
696 -- into Scintilla is usually more expensive. Actually painting the syntax highlighted text to
697 -- the screen is often more expensive than the lexing operation. On modern computer systems,
698 -- I see no difference in speed between Lua lexers and Scintilla's C++ ones. Optimize lexers for
699 -- speed by re-arranging `lexer.add_rule()` calls so that the most common rules match first. Do
700 -- keep in mind that order matters for similar rules.
701 --
702 -- In some cases, folding may be far more expensive than lexing, particularly in lexers with a
703 -- lot of potential fold points. If your lexer is exhibiting signs of slowness, try disabling
704 -- folding in your text editor first. If that speeds things up, you can try reducing the number
705 -- of fold points you added, overriding `lexer.fold()` with your own implementation, or simply
706 -- eliminating folding support from your lexer.
707 --
708 -- #### Limitations
709 --
710 -- Embedded preprocessor languages like PHP cannot completely embed themselves into their parent
711 -- languages because the parent's tagged patterns do not support start and end rules. This
712 -- mostly goes unnoticed, but code like
713 --
714 -- ```php
715 -- <div id="<?php echo $id; ?>">
716 -- ```
717 --
718 -- will not be tagged correctly. Also, these types of languages cannot currently embed themselves
719 -- into their parent's child languages either.
720 --
721 -- A language cannot embed itself into something like an interpolated string because it is
722 -- possible that if lexing starts within the embedded entity, it will not be detected as such,
723 -- so a child to parent transition cannot happen. For example, the following Ruby code will
724 -- not be tagged correctly:
725 --
726 -- ```ruby
727 -- sum = "1 + 2 = #{1 + 2}"
728 -- ```
729 --
730 -- Also, there is the potential for recursion for languages embedding themselves within themselves.
731 --
732 -- #### Troubleshooting
733 --
734 -- Errors in lexers can be tricky to debug. Lexers print Lua errors to `io.stderr` and `_G.print()`
735 -- statements to `io.stdout`. Running your editor from a terminal is the easiest way to see
736 -- errors as they occur.
737 --
738 -- #### Risks
739 --
740 -- Poorly written lexers have the ability to crash Scintilla (and thus its containing application),
741 -- so unsaved data might be lost. However, I have only observed these crashes in early lexer
742 -- development, when syntax errors or pattern errors are present. Once the lexer actually
743 -- starts processing and tagging text (either correctly or incorrectly, it does not matter),
744 -- I have not observed any crashes.
745 --
746 -- #### Acknowledgements
747 --
748 -- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list that provided inspiration,
749 -- and thanks to Roberto Ierusalimschy for LPeg.
750 --
751 -- [lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html
752 -- @module lexer
753 local M = {}
754
755 --- The tag name for default elements.
756 -- @field DEFAULT
757
758 --- The tag name for comment elements.
759 -- @field COMMENT
760
761 --- The tag name for string elements.
762 -- @field STRING
763
764 --- The tag name for number elements.
765 -- @field NUMBER
766
767 --- The tag name for keyword elements.
768 -- @field KEYWORD
769
770 --- The tag name for identifier elements.
771 -- @field IDENTIFIER
772
773 --- The tag name for operator elements.
774 -- @field OPERATOR
775
776 --- The tag name for error elements.
777 -- @field ERROR
778
779 --- The tag name for preprocessor elements.
780 -- @field PREPROCESSOR
781
782 --- The tag name for constant elements.
783 -- @field CONSTANT
784
785 --- The tag name for variable elements.
786 -- @field VARIABLE
787
788 --- The tag name for function elements.
789 -- @field FUNCTION
790
791 --- The tag name for class elements.
792 -- @field CLASS
793
794 --- The tag name for type elements.
795 -- @field TYPE
796
797 --- The tag name for label elements.
798 -- @field LABEL
799
800 --- The tag name for regex elements.
801 -- @field REGEX
802
803 --- The tag name for embedded elements.
804 -- @field EMBEDDED
805
806 --- The tag name for builtin function elements.
807 -- @field FUNCTION_BUILTIN
808
809 --- The tag name for builtin constant elements.
810 -- @field CONSTANT_BUILTIN
811
812 --- The tag name for function method elements.
813 -- @field FUNCTION_METHOD
814
815 --- The tag name for function tag elements, typically in markup.
816 -- @field TAG
817
818 --- The tag name for function attribute elements, typically in markup.
819 -- @field ATTRIBUTE
820
821 --- The tag name for builtin variable elements.
822 -- @field VARIABLE_BUILTIN
823
824 --- The tag name for heading elements, typically in markup.
825 -- @field HEADING
826
827 --- The tag name for bold elements, typically in markup.
828 -- @field BOLD
829
830 --- The tag name for builtin italic elements, typically in markup.
831 -- @field ITALIC
832
833 --- The tag name for underlined elements, typically in markup.
834 -- @field UNDERLINE
835
836 --- The tag name for code elements, typically in markup.
837 -- @field CODE
838
839 --- The tag name for link elements, typically in markup.
840 -- @field LINK
841
842 --- The tag name for reference elements, typically in markup.
843 -- @field REFERENCE
844
845 --- The tag name for annotation elements.
846 -- @field ANNOTATION
847
848 --- The tag name for list item elements, typically in markup.
849 -- @field LIST
850
851 --- The initial (root) fold level.
852 -- @field FOLD_BASE
853
854 --- Bit-flag indicating that the line is blank.
855 -- @field FOLD_BLANK
856
857 --- Bit-flag indicating the line is fold point.
858 -- @field FOLD_HEADER
859
860 -- This comment is needed for LDoc to process the previous field.
861
862 if not lpeg then lpeg = require('lpeg') end -- Scintillua's Lua environment defines _G.lpeg
863 local lpeg = lpeg
864 local P, R, S, V, B = lpeg.P, lpeg.R, lpeg.S, lpeg.V, lpeg.B
865 local Ct, Cc, Cp, Cmt, C = lpeg.Ct, lpeg.Cc, lpeg.Cp, lpeg.Cmt, lpeg.C
866
867 lpeg.setmaxstack(2048) -- the default of 400 is too low for complex grammars
868
869 --- Default tags.
870 local default = {
871 'whitespace', 'comment', 'string', 'number', 'keyword', 'identifier', 'operator', 'error',
872 'preprocessor', 'constant', 'variable', 'function', 'class', 'type', 'label', 'regex', 'embedded',
873 'function.builtin', 'constant.builtin', 'function.method', 'tag', 'attribute', 'variable.builtin',
874 'heading', 'bold', 'italic', 'underline', 'code', 'link', 'reference', 'annotation', 'list'
875 }
876 for _, name in ipairs(default) do M[name:upper():gsub('%.', '_')] = name end
877 --- Names for predefined Scintilla styles.
878 -- Having these here simplifies style number handling between Scintillua and Scintilla.
879 local predefined = {
880 'default', 'line.number', 'brace.light', 'brace.bad', 'control.char', 'indent.guide', 'call.tip',
881 'fold.display.text'
882 }
883 for _, name in ipairs(predefined) do M[name:upper():gsub('%.', '_')] = name end
884
885 --- Returns a tagged pattern.
886 -- @param lexer Lexer to tag the pattern in.
887 -- @param name String name to use for the tag. If it is not a predefined tag name
888 -- (`lexer.[A-Z_]+`), its Scintilla style will likely need to be defined by the editor or
889 -- theme using this lexer.
890 -- @param patt LPeg pattern to tag.
891 -- @usage local number = lex:tag(lexer.NUMBER, lexer.number)
892 -- @usage local addition = lex:tag('addition', '+' * lexer.word)
893 function M.tag(lexer, name, patt)
894 if not lexer._TAGS then
895 -- Create the initial maps for tag names to style numbers and styles.
896 local tags = {}
897 for i, name in ipairs(default) do tags[name], tags[i] = i, name end
898 for i, name in ipairs(predefined) do tags[name], tags[i + 32] = i + 32, name end
899 lexer._TAGS, lexer._num_styles = tags, #default + 1
900 lexer._extra_tags = {}
901 end
902 if not assert(lexer._TAGS, 'not a lexer instance')[name] then
903 local num_styles = lexer._num_styles
904 if num_styles == 33 then num_styles = num_styles + 8 end -- skip predefined
905 assert(num_styles <= 256, 'too many styles defined (256 MAX)')
906 lexer._TAGS[name], lexer._TAGS[num_styles], lexer._num_styles = num_styles, name, num_styles + 1
907 lexer._extra_tags[name] = true
908 -- If the lexer is a proxy or a child that embedded itself, make this tag name known to
909 -- the parent lexer.
910 if lexer._lexer then lexer._lexer:tag(name, false) end
911 end
912 return Cc(name) * (P(patt) / 0) * Cp()
913 end
914
915 --- Returns a unique grammar rule name for one of the word lists in a lexer.
916 -- @param lexer Lexer to use.
917 -- @param i *i*th word list to get.
918 local function word_list_id(lexer, i) return lexer._name .. '_wordlist' .. i end
919
920 --- Returns a pattern that matches a word in a word list.
921 -- This is a convenience function for simplifying a set of ordered choice word patterns and
922 -- potentially allowing downstream users to configure word lists.
923 -- @param[opt] lexer Lexer to match a word in a word list for. This parameter may be omitted
924 -- for lexer-agnostic matching.
925 -- @param word_list Either a string name of the word list to match from if *lexer* is given, or,
926 -- if *lexer* is omitted, a table of words or a string list of words separated by spaces. If a
927 -- word list name was given and there is ultimately no word list set via `lex:set_word_list()`,
928 -- no error will be raised, but the returned pattern will not match anything.
929 -- @param[opt=false] case_insensitive Match the word case-insensitively.
930 -- @usage lex:add_rule('keyword', lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD)))
931 -- @usage local keyword = lex:tag(lexer.KEYWORD, lexer.word_match{'foo', 'bar', 'baz'})
932 -- @usage local keyword = lex:tag(lexer.KEYWORD, lexer.word_match({'foo-bar', 'foo-baz',
933 -- 'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar'}, true))
934 -- @usage local keyword = lex:tag(lexer.KEYWORD, lexer.word_match('foo bar baz'))
935 function M.word_match(lexer, word_list, case_insensitive)
936 if type(lexer) == 'table' and getmetatable(lexer) then
937 if lexer._lexer then
938 -- If this lexer is a proxy (e.g. rails), get the true parent (ruby) in order to get the
939 -- parent's word list. If this lexer is a child embedding itself (e.g. php), continue
940 -- getting its word list, not the parent's (html).
941 local parent = lexer._lexer
942 if not parent._CHILDREN or not parent._CHILDREN[lexer] then lexer = parent end
943 end
944
945 if not lexer._WORDLISTS then lexer._WORDLISTS = {case_insensitive = {}} end
946 local i = lexer._WORDLISTS[word_list] or #lexer._WORDLISTS + 1
947 lexer._WORDLISTS[word_list], lexer._WORDLISTS[i] = i, '' -- empty placeholder word list
948 lexer._WORDLISTS.case_insensitive[i] = case_insensitive
949 return V(word_list_id(lexer, i))
950 end
951
952 -- Lexer-agnostic word match.
953 word_list, case_insensitive = lexer, word_list
954
955 if type(word_list) == 'string' then
956 local words = word_list -- space-separated list of words
957 word_list = {}
958 for word in words:gmatch('%S+') do word_list[#word_list + 1] = word end
959 end
960
961 local word_chars = M.alnum + '_'
962 local extra_chars = ''
963 for _, word in ipairs(word_list) do
964 word_list[case_insensitive and word:lower() or word] = true
965 for char in word:gmatch('[^%w_%s]') do
966 if not extra_chars:find(char, 1, true) then extra_chars = extra_chars .. char end
967 end
968 end
969 if extra_chars ~= '' then word_chars = word_chars + S(extra_chars) end
970
971 -- Optimize small word sets as ordered choice. "Small" is arbitrary.
972 if #word_list <= 6 and not case_insensitive then
973 local choice = P(false)
974 for _, word in ipairs(word_list) do choice = choice + word:match('%S+') end
975 return choice * -word_chars
976 end
977
978 return Cmt(word_chars^1, function(input, index, word)
979 if case_insensitive then word = word:lower() end
980 return word_list[word]
981 end)
982 end
983
984 --- Sets the words in a lexer's word list.
985 -- This only has an effect if the lexer uses `lexer.word_match()` to reference the given list.
986 -- @param lexer Lexer to add a word list to.
987 -- @param name String name or number of the word list to set.
988 -- @param word_list Table of words or a string list of words separated by
989 -- spaces. Case-insensitivity is specified by a `lexer.word_match()` reference to this list.
990 -- @param[opt=false] append Append *word_list* to an existing word list (if any).
991 function M.set_word_list(lexer, name, word_list, append)
992 if word_list == 'scintillua' then return end -- for SciTE
993 if lexer._lexer then
994 -- If this lexer is a proxy (e.g. rails), get the true parent (ruby) in order to set the
995 -- parent's word list. If this lexer is a child embedding itself (e.g. php), continue
996 -- setting its word list, not the parent's (html).
997 local parent = lexer._lexer
998 if not parent._CHILDREN or not parent._CHILDREN[lexer] then lexer = parent end
999 end
1000
1001 assert(lexer._WORDLISTS, 'lexer has no word lists')
1002 local i = tonumber(lexer._WORDLISTS[name]) or name -- lexer._WORDLISTS[name] --> i
1003 if type(i) ~= 'number' or i > #lexer._WORDLISTS then return end -- silently return
1004
1005 if type(word_list) == 'string' then
1006 local list = {}
1007 for word in word_list:gmatch('%S+') do list[#list + 1] = word end
1008 word_list = list
1009 end
1010
1011 if not append or lexer._WORDLISTS[i] == '' then
1012 lexer._WORDLISTS[i] = word_list
1013 else
1014 local list = lexer._WORDLISTS[i]
1015 for _, word in ipairs(word_list) do list[#list + 1] = word end
1016 end
1017
1018 lexer._grammar_table = nil -- invalidate
1019 end
1020
1021 --- Adds a rule to a lexer.
1022 -- @param lexer Lexer to add *rule* to.
1023 -- @param id String id associated with this rule. It does not have to be the same as the name
1024 -- passed to `lex:tag()`.
1025 -- @param rule LPeg pattern of the rule to add.
1026 function M.add_rule(lexer, id, rule)
1027 if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
1028 if not lexer._rules then lexer._rules = {} end
1029 if id == 'whitespace' and lexer._rules[id] then -- legacy
1030 lexer:modify_rule(id, rule)
1031 return
1032 end
1033 lexer._rules[#lexer._rules + 1], lexer._rules[id] = id, rule
1034 lexer._grammar_table = nil -- invalidate
1035 end
1036
1037 --- Replaces a lexer's existing rule.
1038 -- @param lexer Lexer to modify.
1039 -- @param id String id of the rule to replace.
1040 -- @param rule LPeg pattern of the new rule.
1041 function M.modify_rule(lexer, id, rule)
1042 if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
1043 assert(lexer._rules[id], 'rule does not exist')
1044 lexer._rules[id] = rule
1045 lexer._grammar_table = nil -- invalidate
1046 end
1047
1048 --- Returns a unique grammar rule name for one of the rule names in a lexer.
1049 local function rule_id(lexer, name) return lexer._name .. '.' .. name end
1050
1051 --- Returns a lexer's rule.
1052 -- @param lexer Lexer to fetch a rule from.
1053 -- @param id String id of the rule to fetch.
1054 function M.get_rule(lexer, id)
1055 if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
1056 if id == 'whitespace' then return V(rule_id(lexer, id)) end -- special case
1057 return assert(lexer._rules[id], 'rule does not exist')
1058 end
1059
1060 --- Embeds a child lexer into a parent lexer.
1061 -- @param lexer Parent lexer.
1062 -- @param child Child lexer.
1063 -- @param start_rule LPeg pattern matches the beginning of the child lexer.
1064 -- @param end_rule LPeg pattern that matches the end of the child lexer.
1065 -- @usage html:embed(css, css_start_rule, css_end_rule)
1066 -- @usage html:embed(lex, php_start_rule, php_end_rule) -- from php lexer
1067 function M.embed(lexer, child, start_rule, end_rule)
1068 if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
1069
1070 -- Add child rules.
1071 assert(child._rules, 'cannot embed lexer with no rules')
1072 if not child._start_rules then child._start_rules = {} end
1073 if not child._end_rules then child._end_rules = {} end
1074 child._start_rules[lexer], child._end_rules[lexer] = start_rule, end_rule
1075 if not lexer._CHILDREN then lexer._CHILDREN = {} end
1076 lexer._CHILDREN[#lexer._CHILDREN + 1], lexer._CHILDREN[child] = child, true
1077
1078 -- Add child tags.
1079 for name in pairs(child._extra_tags) do lexer:tag(name, true) end
1080
1081 -- Add child fold symbols.
1082 if child._fold_points then
1083 for tag_name, symbols in pairs(child._fold_points) do
1084 if tag_name ~= '_symbols' then
1085 for symbol, v in pairs(symbols) do lexer:add_fold_point(tag_name, symbol, v) end
1086 end
1087 end
1088 end
1089
1090 -- Add child word lists.
1091 if child._WORDLISTS then
1092 for name, i in pairs(child._WORDLISTS) do
1093 if type(name) == 'string' and type(i) == 'number' then
1094 name = child._name .. '.' .. name
1095 lexer:word_match(name) -- for side effects
1096 lexer:set_word_list(name, child._WORDLISTS[i])
1097 end
1098 end
1099 end
1100
1101 child._lexer = lexer -- use parent's rules if child is embedding itself
1102 end
1103
1104 --- Adds a fold point to a lexer.
1105 -- @param lexer Lexer to add a fold point to.
1106 -- @param tag_name String tag name of fold point text.
1107 -- @param start_symbol String fold point start text.
1108 -- @param end_symbol Either string fold point end text, or a function that returns whether or
1109 -- not *start_symbol* is a beginning fold point (1), an ending fold point (-1), or not a fold
1110 -- point at all (0). If it is a function, it is passed the following arguments:
1111 -- - `text`: The text being processed for fold points.
1112 -- - `pos`: The position in *text* of the beginning of the line currently being processed.
1113 -- - `line`: The text of the line currently being processed.
1114 -- - `s`: The position of *start_symbol* in *line*.
1115 -- - `symbol`: *start_symbol* itself.
1116 -- @usage lex:add_fold_point(lexer.OPERATOR, '{', '}')
1117 -- @usage lex:add_fold_point(lexer.KEYWORD, 'if', 'end')
1118 -- @usage lex:add_fold_point('custom', function(text, pos, line, s, symbol) ... end)
1119 function M.add_fold_point(lexer, tag_name, start_symbol, end_symbol)
1120 if not start_symbol and not end_symbol then return end -- from legacy fold_consecutive_lines()
1121 if not lexer._fold_points then lexer._fold_points = {_symbols = {}} end
1122 local symbols = lexer._fold_points._symbols
1123 if not lexer._fold_points[tag_name] then lexer._fold_points[tag_name] = {} end
1124 if lexer._case_insensitive_fold_points then
1125 start_symbol = start_symbol:lower()
1126 if type(end_symbol) == 'string' then end_symbol = end_symbol:lower() end
1127 end
1128
1129 if type(end_symbol) == 'string' then
1130 if not symbols[end_symbol] then symbols[#symbols + 1], symbols[end_symbol] = end_symbol, true end
1131 lexer._fold_points[tag_name][start_symbol] = 1
1132 lexer._fold_points[tag_name][end_symbol] = -1
1133 else
1134 lexer._fold_points[tag_name][start_symbol] = end_symbol -- function or int
1135 end
1136 if not symbols[start_symbol] then
1137 symbols[#symbols + 1], symbols[start_symbol] = start_symbol, true
1138 end
1139
1140 -- If the lexer is a proxy or a child that embedded itself, copy this fold point to the
1141 -- parent lexer.
1142 if lexer._lexer then lexer._lexer:add_fold_point(tag_name, start_symbol, end_symbol) end
1143 end
1144
1145 --- Recursively adds the rules for a lexer and its children to a grammar.
1146 -- @param g Grammar to add rules to.
1147 -- @param lexer Lexer whose rules to add.
1148 local function add_lexer(g, lexer)
1149 local rule = P(false)
1150
1151 -- Add this lexer's rules.
1152 for _, name in ipairs(lexer._rules) do
1153 local id = rule_id(lexer, name)
1154 g[id] = lexer._rules[name] -- ['lua.keyword'] = keyword_patt
1155 rule = rule + V(id) -- V('lua.keyword') + V('lua.function') + V('lua.constant') + ...
1156 end
1157 local any_id = lexer._name .. '_fallback'
1158 g[any_id] = lexer:tag(M.DEFAULT, M.any) -- ['lua_fallback'] = any_char
1159 rule = rule + V(any_id) -- ... + V('lua.operator') + V('lua_fallback')
1160
1161 -- Add this lexer's word lists.
1162 if lexer._WORDLISTS then
1163 for i = 1, #lexer._WORDLISTS do
1164 local id = word_list_id(lexer, i)
1165 local list, case_insensitive = lexer._WORDLISTS[i], lexer._WORDLISTS.case_insensitive[i]
1166 local patt = list ~= '' and M.word_match(list, case_insensitive) or P(false)
1167 g[id] = patt -- ['lua_wordlist.1'] = word_match_patt or P(false)
1168 end
1169 end
1170
1171 -- Add this child lexer's end rules.
1172 if lexer._end_rules then
1173 for parent, end_rule in pairs(lexer._end_rules) do
1174 local back_id = lexer._name .. '_to_' .. parent._name
1175 g[back_id] = end_rule -- ['css_to_html'] = css_end_rule
1176 rule = rule - V(back_id) + -- (V('css.property') + ... + V('css_fallback')) - V('css_to_html')
1177 V(back_id) * V(parent._name) -- V('css_to_html') * V('html')
1178 end
1179 end
1180
1181 -- Add this child lexer's start rules.
1182 if lexer._start_rules then
1183 for parent, start_rule in pairs(lexer._start_rules) do
1184 local to_id = parent._name .. '_to_' .. lexer._name
1185 g[to_id] = start_rule * V(lexer._name) -- ['html_to_css'] = css_start_rule * V('css')
1186 end
1187 end
1188
1189 -- Finish adding this lexer's rules.
1190 local rule_id = lexer._name .. '_rule'
1191 g[rule_id] = rule -- ['lua_rule'] = V('lua.keyword') + ... + V('lua_fallback')
1192 g[lexer._name] = V(rule_id)^0 -- ['lua'] = V('lua_rule')^0
1193
1194 -- Add this lexer's children's rules.
1195 -- TODO: preprocessor languages like PHP should also embed themselves into their parent's
1196 -- children like HTML's CSS and Javascript.
1197 if not lexer._CHILDREN then return end
1198 for _, child in ipairs(lexer._CHILDREN) do
1199 add_lexer(g, child)
1200 local to_id = lexer._name .. '_to_' .. child._name
1201 g[rule_id] = V(to_id) + g[rule_id] -- ['html_rule'] = V('html_to_css') + V('html.comment') + ...
1202
1203 -- Add a child's inherited parent's rules (e.g. rhtml parent with rails child inheriting ruby).
1204 if child._parent_name then
1205 local name = child._name
1206 child._name = child._parent_name -- ensure parent and transition rule names are correct
1207 add_lexer(g, child)
1208 child._name = name -- restore
1209 local to_id = lexer._name .. '_to_' .. child._parent_name
1210 g[rule_id] = V(to_id) + g[rule_id] -- ['html_rule'] = V('html_to_ruby') + V('html.comment') + ...
1211 end
1212 end
1213 end
1214
1215 --- Returns the grammar for a lexer and its initial rule, (re)constructing it if necessary.
1216 -- @param lexer Lexer to build a grammar for.
1217 -- @param init_style Current style number. Multiple-language lexers use this to determine which
1218 -- language to start lexing in.
1219 local function build_grammar(lexer, init_style)
1220 if not lexer._rules then return end
1221 if not lexer._initial_rule then lexer._initial_rule = lexer._parent_name or lexer._name end
1222 if not lexer._grammar_table then
1223 local grammar = {lexer._initial_rule}
1224 if not lexer._parent_name then
1225 add_lexer(grammar, lexer)
1226 -- {'lua',
1227 -- ['lua.keyword'] = patt, ['lua.function'] = patt, ...,
1228 -- ['lua_wordlist.1'] = patt, ['lua_wordlist.2'] = patt, ...,
1229 -- ['lua_rule'] = V('lua.keyword') + ... + V('lua_fallback'),
1230 -- ['lua'] = V('lua_rule')^0
1231 -- }
1232 -- {'html'
1233 -- ['html.comment'] = patt, ['html.doctype'] = patt, ...,
1234 -- ['html_wordlist.1'] = patt, ['html_wordlist.2'] = patt, ...,
1235 -- ['html_rule'] = V('html_to_css') * V('css') + V('html.comment') + ... + V('html_fallback'),
1236 -- ['html'] = V('html')^0,
1237 -- ['css.property'] = patt, ['css.value'] = patt, ...,
1238 -- ['css_wordlist.1'] = patt, ['css_wordlist.2'] = patt, ...,
1239 -- ['css_to_html'] = patt,
1240 -- ['css_rule'] = ((V('css.property') + ... + V('css_fallback')) - V('css_to_html')) +
1241 -- V('css_to_html') * V('html'),
1242 -- ['html_to_css'] = patt,
1243 -- ['css'] = V('css_rule')^0
1244 -- }
1245 else
1246 local name = lexer._name
1247 lexer._name = lexer._parent_name -- ensure parent and transition rule names are correct
1248 add_lexer(grammar, lexer)
1249 lexer._name = name -- restore
1250 -- {'html',
1251 -- ...
1252 -- ['html_rule'] = V('html_to_php') * V('php') + V('html_to_css') * V('css') +
1253 -- V('html.comment') + ... + V('html_fallback'),
1254 -- ...
1255 -- ['php.keyword'] = patt, ['php.type'] = patt, ...,
1256 -- ['php_wordlist.1'] = patt, ['php_wordlist.2'] = patt, ...,
1257 -- ['php_to_html'] = patt,
1258 -- ['php_rule'] = ((V('php.keyword') + ... + V('php_fallback')) - V('php_to_html')) +
1259 -- V('php_to_html') * V('html')
1260 -- ['html_to_php'] = patt,
1261 -- ['php'] = V('php_rule')^0
1262 -- }
1263 end
1264 lexer._grammar, lexer._grammar_table = Ct(P(grammar)), grammar
1265 end
1266
1267 -- For multilang lexers, build a new grammar whose initial rule is the current language
1268 -- if necessary. LPeg does not allow a variable initial rule.
1269 if lexer._CHILDREN then
1270 for style_num, tag in ipairs(lexer._TAGS) do
1271 if style_num == init_style then
1272 local lexer_name = tag:match('^whitespace%.(.+)$') or lexer._parent_name or lexer._name
1273 if lexer._initial_rule == lexer_name then break end
1274 if not lexer._grammar_table[lexer_name] then
1275 -- For proxy lexers like RHTML, the 'whitespace.rhtml' tag would produce the 'rhtml'
1276 -- lexer name, but there is no 'rhtml' rule. It should be the 'html' rule (parent)
1277 -- instead.
1278 lexer_name = lexer._parent_name or lexer._name
1279 end
1280 lexer._initial_rule = lexer_name
1281 lexer._grammar_table[1] = lexer._initial_rule
1282 lexer._grammar = Ct(P(lexer._grammar_table))
1283 return lexer._grammar
1284 end
1285 end
1286 end
1287
1288 return lexer._grammar
1289 end
1290
1291 --- Lexes a chunk of text.
1292 -- @param lexer Lexer to lex text with.
1293 -- @param text String text to lex, which may be a partial chunk, single line, or full text.
1294 -- @param init_style Number of the text's current style. Multiple-language lexers use this to
1295 -- determine which language to start lexing in.
1296 -- @return table of tag names and positions.
1297 -- @usage lex:lex(...) --> {'keyword', 2, 'whitespace.lua', 3, 'identifier', 7}
1298 function M.lex(lexer, text, init_style)
1299 local grammar = build_grammar(lexer, init_style)
1300 if not grammar then return {M.DEFAULT, #text + 1} end
1301 if M._standalone then M._text, M.line_state = text, {} end
1302
1303 if lexer._lex_by_line then
1304 local line_from_position = M.line_from_position
1305 local function append(tags, line_tags, offset)
1306 for i = 1, #line_tags, 2 do
1307 tags[#tags + 1], tags[#tags + 2] = line_tags[i], line_tags[i + 1] + offset
1308 end
1309 end
1310 local tags = {}
1311 local offset = 0
1312 rawset(M, 'line_from_position', function(pos) return line_from_position(pos + offset) end)
1313 for line in text:gmatch('[^\r\n]*\r?\n?') do
1314 local line_tags = grammar:match(line)
1315 if line_tags then append(tags, line_tags, offset) end
1316 offset = offset + #line
1317 -- Use the default tag to the end of the line if none was specified.
1318 if tags[#tags] ~= offset + 1 then
1319 tags[#tags + 1], tags[#tags + 2] = 'default', offset + 1
1320 end
1321 end
1322 rawset(M, 'line_from_position', line_from_position)
1323 return tags
1324 end
1325
1326 return grammar:match(text)
1327 end
1328
1329 --- Determines fold points in a chunk of text.
1330 -- @param lexer Lexer to fold text with.
1331 -- @param text String text to fold, which may be a partial chunk, single line, or full text.
1332 -- @param start_line Line number *text* starts on, counting from 1.
1333 -- @param start_level Fold level *text* starts with. It cannot be lower than `lexer.FOLD_BASE`
1334 -- (1024).
1335 -- @return table of line numbers mapped to fold levels
1336 -- @usage lex:fold(...) --> {[1] = 1024, [2] = 9216, [3] = 1025, [4] = 1025, [5] = 1024}
1337 function M.fold(lexer, text, start_line, start_level)
1338 if rawget(lexer, 'fold') then return rawget(lexer, 'fold')(lexer, text, start_line, start_level) end
1339 local folds = {}
1340 if text == '' then return folds end
1341 local fold = M.property_int['fold'] > 0
1342 local FOLD_BASE, FOLD_HEADER, FOLD_BLANK = M.FOLD_BASE, M.FOLD_HEADER, M.FOLD_BLANK
1343 if M._standalone then M._text, M.line_state = text, {} end
1344 if fold and lexer._fold_points then
1345 local lines = {}
1346 for p, l in (text .. '\n'):gmatch('()(.-)\r?\n') do lines[#lines + 1] = {p, l} end
1347 local fold_zero_sum_lines = M.property_int['fold.scintillua.on.zero.sum.lines'] > 0
1348 local fold_compact = M.property_int['fold.scintillua.compact'] > 0
1349 local fold_points = lexer._fold_points
1350 local fold_point_symbols = fold_points._symbols
1351 local style_at, fold_level = M.style_at, M.fold_level
1352 local line_num, prev_level = start_line, start_level
1353 local current_level = prev_level
1354 for _, captures in ipairs(lines) do
1355 local pos, line = captures[1], captures[2]
1356 if line ~= '' then
1357 if lexer._case_insensitive_fold_points then line = line:lower() end
1358 local ranges = {}
1359 local function is_valid_range(s, e)
1360 if not s or not e then return false end
1361 for i = 1, #ranges - 1, 2 do
1362 local range_s, range_e = ranges[i], ranges[i + 1]
1363 if s >= range_s and s <= range_e or e >= range_s and e <= range_e then
1364 return false
1365 end
1366 end
1367 ranges[#ranges + 1] = s
1368 ranges[#ranges + 1] = e
1369 return true
1370 end
1371 local level_decreased = false
1372 for _, symbol in ipairs(fold_point_symbols) do
1373 local word = not symbol:find('[^%w_]')
1374 local s, e = line:find(symbol, 1, true)
1375 while is_valid_range(s, e) do
1376 -- if not word or line:find('^%f[%w_]' .. symbol .. '%f[^%w_]', s) then
1377 local word_before = s > 1 and line:find('^[%w_]', s - 1)
1378 local word_after = line:find('^[%w_]', e + 1)
1379 if not word or not (word_before or word_after) then
1380 local style_name = style_at[pos + s - 1]
1381 local symbols = fold_points[style_name]
1382 if not symbols and style_name:find('%.') then
1383 symbols = fold_points[style_name:match('^[^.]+')]
1384 end
1385 local level = symbols and symbols[symbol]
1386 if type(level) == 'function' then
1387 level = level(text, pos, line, s, symbol)
1388 end
1389 if type(level) == 'number' then
1390 current_level = current_level + level
1391 if level < 0 and current_level < prev_level then
1392 -- Potential zero-sum line. If the level were to go back up on the same line,
1393 -- the line may be marked as a fold header.
1394 level_decreased = true
1395 end
1396 end
1397 end
1398 s, e = line:find(symbol, s + 1, true)
1399 end
1400 end
1401 folds[line_num] = prev_level
1402 if current_level > prev_level then
1403 folds[line_num] = prev_level + FOLD_HEADER
1404 elseif level_decreased and current_level == prev_level and fold_zero_sum_lines then
1405 if line_num > start_line then
1406 folds[line_num] = prev_level - 1 + FOLD_HEADER
1407 else
1408 -- Typing within a zero-sum line.
1409 local level = fold_level[line_num] - 1
1410 if level > FOLD_HEADER then level = level - FOLD_HEADER end
1411 if level > FOLD_BLANK then level = level - FOLD_BLANK end
1412 folds[line_num] = level + FOLD_HEADER
1413 current_level = current_level + 1
1414 end
1415 end
1416 if current_level < FOLD_BASE then current_level = FOLD_BASE end
1417 prev_level = current_level
1418 else
1419 folds[line_num] = prev_level + (fold_compact and FOLD_BLANK or 0)
1420 end
1421 line_num = line_num + 1
1422 end
1423 elseif fold and
1424 (lexer._fold_by_indentation or M.property_int['fold.scintillua.by.indentation'] > 0) then
1425 -- Indentation based folding.
1426 -- Calculate indentation per line.
1427 local indentation = {}
1428 for indent, line in (text .. '\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do
1429 indentation[#indentation + 1] = line ~= '' and #indent
1430 end
1431 -- Find the first non-blank line before start_line. If the current line is indented, make
1432 -- that previous line a header and update the levels of any blank lines inbetween. If the
1433 -- current line is blank, match the level of the previous non-blank line.
1434 local current_level = start_level
1435 for i = start_line, 1, -1 do
1436 local level = M.fold_level[i]
1437 if level >= FOLD_HEADER then level = level - FOLD_HEADER end
1438 if level < FOLD_BLANK then
1439 local indent = M.indent_amount[i]
1440 if indentation[1] and indentation[1] > indent then
1441 folds[i] = FOLD_BASE + indent + FOLD_HEADER
1442 for j = i + 1, start_line - 1 do folds[j] = start_level + FOLD_BLANK end
1443 elseif not indentation[1] then
1444 current_level = FOLD_BASE + indent
1445 end
1446 break
1447 end
1448 end
1449 -- Iterate over lines, setting fold numbers and fold flags.
1450 for i = 1, #indentation do
1451 if indentation[i] then
1452 current_level = FOLD_BASE + indentation[i]
1453 folds[start_line + i - 1] = current_level
1454 for j = i + 1, #indentation do
1455 if indentation[j] then
1456 if FOLD_BASE + indentation[j] > current_level then
1457 folds[start_line + i - 1] = current_level + FOLD_HEADER
1458 current_level = FOLD_BASE + indentation[j] -- for any blanks below
1459 end
1460 break
1461 end
1462 end
1463 else
1464 folds[start_line + i - 1] = current_level + FOLD_BLANK
1465 end
1466 end
1467 else
1468 -- No folding, reset fold levels if necessary.
1469 local current_line = start_line
1470 for _ in text:gmatch('\r?\n') do
1471 folds[current_line] = start_level
1472 current_line = current_line + 1
1473 end
1474 end
1475 return folds
1476 end
1477
1478 --- Creates a new lexer.
1479 -- @param name String lexer name. Use `...` to inherit from the file's name.
1480 -- @param[opt] opts Table of lexer options. Options currently supported:
1481 -- - `lex_by_line`: Only processes whole lines of text at a time (instead of arbitrary chunks
1482 -- of text). Line lexers cannot look ahead to subsequent lines. The default value is `false`.
1483 -- - `fold_by_indentation`: Calculate fold points based on changes in line indentation. The
1484 -- default value is `false`.
1485 -- - `case_insensitive_fold_points`: Fold points added via `lexer.add_fold_point()` should
1486 -- ignore case. The default value is `false`.
1487 -- - `no_user_word_lists`: Do not automatically allocate word lists that can be set by
1488 -- users. This should really only be set by non-programming languages like markup languages.
1489 -- - `inherit`: Lexer to inherit from. The default value is `nil`.
1490 -- @return lexer object
1491 -- @usage lexer.new(..., {inherit = lexer.load('html')}) -- name is 'rhtml' in rhtml.lua file
1492 function M.new(name, opts)
1493 local lexer = setmetatable({
1494 _name = assert(name, 'lexer name expected'), _lex_by_line = opts and opts['lex_by_line'],
1495 _fold_by_indentation = opts and opts['fold_by_indentation'],
1496 _case_insensitive_fold_points = opts and opts['case_insensitive_fold_points'],
1497 _no_user_word_lists = opts and opts['no_user_word_lists'], _lexer = opts and opts['inherit']
1498 }, {
1499 __index = {
1500 tag = M.tag, word_match = M.word_match, set_word_list = M.set_word_list,
1501 add_rule = M.add_rule, modify_rule = M.modify_rule, get_rule = M.get_rule,
1502 add_fold_point = M.add_fold_point, embed = M.embed, lex = M.lex, fold = M.fold, --
1503 add_style = function() end -- legacy
1504 }
1505 })
1506
1507 -- Add initial whitespace rule.
1508 -- Use a unique whitespace tag name since embedded lexing relies on these unique names.
1509 lexer:add_rule('whitespace', lexer:tag('whitespace.' .. name, M.space^1))
1510
1511 return lexer
1512 end
1513
1514 --- Creates a substitute for some Scintilla tables, functions, and fields that Scintillua
1515 -- depends on when using it as a standalone module.
1516 local function initialize_standalone_library()
1517 M.property = setmetatable({['scintillua.lexers'] = package.path:gsub('/%?%.lua', '/lexers')}, {
1518 __index = function() return '' end, __newindex = function(t, k, v) rawset(t, k, tostring(v)) end
1519 })
1520
1521 M.line_from_position = function(pos)
1522 local line = 1
1523 for s in M._text:gmatch('[^\n]*()') do
1524 if pos <= s then return line end
1525 line = line + 1
1526 end
1527 return line - 1 -- should not get to here
1528 end
1529
1530 M.text_range = function(pos, length) return M._text:sub(pos, pos + length - 1) end
1531
1532 --- Returns a line number's start and end positions.
1533 -- @param line Line number (1-based) to get the start and end positions of.
1534 local function get_line_range(line)
1535 local current_line = 1
1536 for s, e in M._text:gmatch('()[^\n]*()') do
1537 if current_line == line then return s, e end
1538 current_line = current_line + 1
1539 end
1540 return 1, 1 -- should not get to here
1541 end
1542
1543 M.line_start = setmetatable({}, {__index = function(_, line) return get_line_range(line) end})
1544 M.line_end = setmetatable({}, {
1545 __index = function(_, line) return select(2, get_line_range(line)) end
1546 })
1547
1548 M.indent_amount = setmetatable({}, {
1549 __index = function(_, line)
1550 local current_line = 1
1551 for s in M._text:gmatch('()[^\n]*') do
1552 if current_line == line then
1553 return #M._text:match('^[ \t]*', s):gsub('\t', string.rep(' ', 8))
1554 end
1555 current_line = current_line + 1
1556 end
1557 end
1558 })
1559
1560 M.FOLD_BASE, M.FOLD_HEADER, M.FOLD_BLANK = 0x400, 0x2000, 0x1000
1561
1562 M._standalone = true
1563 end
1564
1565 --- Searches for a lexer to load.
1566 -- This is a safe implementation of Lua 5.2's `package.searchpath()` function that does not
1567 -- require the package module to be loaded.
1568 -- @param name String lexer name to search for.
1569 -- @param path String list of ';'-separated paths to search for lexers in.
1570 -- @return path to a lexer or `nil` plus an error message
1571 local function searchpath(name, path)
1572 local tried = {}
1573 for part in path:gmatch('[^;]+') do
1574 local filename = part:gsub('%?', name)
1575 local ok, errmsg = loadfile(filename)
1576 if ok or not errmsg:find('cannot open') then return filename end
1577 tried[#tried + 1] = string.format("no file '%s'", filename)
1578 end
1579 return nil, table.concat(tried, '\n')
1580 end
1581
1582 --- Initializes or loads a lexer.
1583 -- Scintilla calls this function in order to load a lexer. Parent lexers also call this function
1584 -- in order to load child lexers and vice-versa. The user calls this function in order to load
1585 -- a lexer when using Scintillua as a Lua library.
1586 -- @param name String name of the lexing language.
1587 -- @param[opt] alt_name String alternate name of the lexing language. This is useful for
1588 -- embedding the same child lexer with multiple sets of start and end tags.
1589 -- @return lexer object
1590 function M.load(name, alt_name)
1591 assert(name, 'no lexer given')
1592 if not M.property then initialize_standalone_library() end
1593 if not M.property_int then
1594 -- Separate from initialize_standalone_library() so applications that choose to define
1595 -- M.property do not also have to define this.
1596 M.property_int = setmetatable({}, {
1597 __index = function(t, k) return tonumber(M.property[k]) or 0 end,
1598 __newindex = function() error('read-only property') end
1599 })
1600 end
1601
1602 -- Load the language lexer with its rules, tags, etc.
1603 local path = M.property['scintillua.lexers']:gsub(';', '/?.lua;') .. '/?.lua'
1604 local ro_lexer = setmetatable({
1605 WHITESPACE = 'whitespace.' .. (alt_name or name) -- legacy
1606 }, {__index = M})
1607 local env = {
1608 'assert', 'error', 'ipairs', 'math', 'next', 'pairs', 'print', 'select', 'string', 'table',
1609 'tonumber', 'tostring', 'type', 'utf8', '_VERSION', lexer = ro_lexer, lpeg = lpeg, --
1610 require = function() return ro_lexer end -- legacy
1611 }
1612 for _, name in ipairs(env) do env[name] = _G[name] end
1613 local lexer = assert(loadfile(assert(searchpath(name, path)), 't', env))(alt_name or name)
1614 assert(lexer, string.format("'%s.lua' did not return a lexer", name))
1615
1616 -- If the lexer is a proxy or a child that embedded itself, set the parent to be the main
1617 -- lexer. Keep a reference to the old parent name since embedded child start and end rules
1618 -- reference and use that name.
1619 if lexer._lexer then
1620 lexer = lexer._lexer
1621 lexer._parent_name, lexer._name = lexer._name, alt_name or name
1622 end
1623
1624 M.property['scintillua.comment.' .. (alt_name or name)] = M.property['scintillua.comment']
1625
1626 return lexer
1627 end
1628
1629 --- Returns a table of all known lexer names.
1630 -- This function is not available to lexers and requires the LuaFileSystem (`lfs`) module to
1631 -- be available.
1632 -- @param[opt] path String list of ';'-separated directories to search for lexers in. The
1633 -- default value is Scintillua's configured lexer path.
1634 function M.names(path)
1635 local lfs = require('lfs')
1636 if not path then path = M.property and M.property['scintillua.lexers'] end
1637 if not path or path == '' then
1638 for part in package.path:gmatch('[^;]+') do
1639 local dir = part:match('^(.-[/\\]?lexers)[/\\]%?%.lua$')
1640 if dir then
1641 path = dir
1642 break
1643 end
1644 end
1645 end
1646 local lexers = {}
1647 for dir in assert(path, 'lexer path not configured or found'):gmatch('[^;]+') do
1648 if lfs.attributes(dir, 'mode') == 'directory' then
1649 for file in lfs.dir(dir) do
1650 local name = file:match('^(.+)%.lua$')
1651 if name and name ~= 'lexer' and not lexers[name] then
1652 lexers[#lexers + 1], lexers[name] = name, true
1653 end
1654 end
1655 end
1656 end
1657 table.sort(lexers)
1658 return lexers
1659 end
1660
1661 --- Map of file extensions, without the '.' prefix, to their associated lexer names.
1662 -- @usage lexer.detect_extensions.luadoc = 'lua'
1663 M.detect_extensions = {}
1664
1665 --- Map of first-line patterns to their associated lexer names.
1666 -- These are Lua string patterns, not LPeg patterns.
1667 -- @usage lexer.detect_patterns['^#!.+/zsh'] = 'bash'
1668 M.detect_patterns = {}
1669
1670 --- Returns the name of the lexer often associated a particular filename and/or file content.
1671 -- @param[opt] filename String filename to inspect. The default value is read from the
1672 -- "lexer.scintillua.filename" property.
1673 -- @param[optchain] line String first content line, such as a shebang line. The default value
1674 -- is read from the "lexer.scintillua.line" property.
1675 -- @return string lexer name to pass to `lexer.load()`, or `nil` if none was detected
1676 function M.detect(filename, line)
1677 if not filename then filename = M.property and M.property['lexer.scintillua.filename'] or '' end
1678 if not line then line = M.property and M.property['lexer.scintillua.line'] or '' end
1679
1680 -- Locally scoped in order to avoid persistence in memory.
1681 local extensions = {
1682 as = 'actionscript', asc = 'actionscript', --
1683 adb = 'ada', ads = 'ada', --
1684 g = 'antlr', g4 = 'antlr', --
1685 ans = 'apdl', inp = 'apdl', mac = 'apdl', --
1686 apl = 'apl', --
1687 applescript = 'applescript', --
1688 asm = 'asm', ASM = 'asm', s = 'asm', S = 'asm', --
1689 asa = 'asp', asp = 'asp', hta = 'asp', --
1690 ahk = 'autohotkey', --
1691 au3 = 'autoit', a3x = 'autoit', --
1692 awk = 'awk', --
1693 bat = 'batch', cmd = 'batch', --
1694 bib = 'bibtex', --
1695 boo = 'boo', --
1696 cs = 'csharp', --
1697 c = 'c', C = 'c', cc = 'cpp', cpp = 'cpp', cxx = 'cpp', ['c++'] = 'cpp', h = 'cpp', hh = 'cpp',
1698 hpp = 'cpp', hxx = 'cpp', ['h++'] = 'cpp', --
1699 ck = 'chuck', --
1700 clj = 'clojure', cljs = 'clojure', cljc = 'clojure', edn = 'clojure', --
1701 ['CMakeLists.txt'] = 'cmake', cmake = 'cmake', ['cmake.in'] = 'cmake', ctest = 'cmake',
1702 ['ctest.in'] = 'cmake', --
1703 coffee = 'coffeescript', --
1704 cr = 'crystal', --
1705 css = 'css', --
1706 cu = 'cuda', cuh = 'cuda', --
1707 d = 'd', di = 'd', --
1708 dart = 'dart', --
1709 desktop = 'desktop', --
1710 diff = 'diff', patch = 'diff', --
1711 Dockerfile = 'dockerfile', --
1712 dot = 'dot', --
1713 e = 'eiffel', eif = 'eiffel', --
1714 ex = 'elixir', exs = 'elixir', --
1715 elm = 'elm', --
1716 erl = 'erlang', hrl = 'erlang', --
1717 fs = 'fsharp', --
1718 factor = 'factor', --
1719 fan = 'fantom', --
1720 dsp = 'faust', --
1721 fnl = 'fennel', --
1722 fish = 'fish', --
1723 forth = 'forth', frt = 'forth', --
1724 f = 'fortran', ['for'] = 'fortran', ftn = 'fortran', fpp = 'fortran', f77 = 'fortran',
1725 f90 = 'fortran', f95 = 'fortran', f03 = 'fortran', f08 = 'fortran', --
1726 fstab = 'fstab', --
1727 gd = 'gap', gi = 'gap', gap = 'gap', --
1728 gmi = 'gemini', --
1729 po = 'gettext', pot = 'gettext', --
1730 feature = 'gherkin', --
1731 gleam = 'gleam', --
1732 glslf = 'glsl', glslv = 'glsl', --
1733 dem = 'gnuplot', plt = 'gnuplot', --
1734 go = 'go', --
1735 groovy = 'groovy', gvy = 'groovy', --
1736 gtkrc = 'gtkrc', --
1737 ha = 'hare', --
1738 hs = 'haskell', --
1739 htm = 'html', html = 'html', shtm = 'html', shtml = 'html', xhtml = 'html', vue = 'html', --
1740 icn = 'icon', --
1741 idl = 'idl', odl = 'idl', --
1742 ni = 'inform', --
1743 cfg = 'ini', cnf = 'ini', inf = 'ini', ini = 'ini', reg = 'ini', --
1744 io = 'io_lang', --
1745 bsh = 'java', java = 'java', --
1746 js = 'javascript', jsfl = 'javascript', --
1747 jq = 'jq', --
1748 json = 'json', --
1749 jsp = 'jsp', --
1750 jl = 'julia', --
1751 bbl = 'latex', dtx = 'latex', ins = 'latex', ltx = 'latex', tex = 'latex', sty = 'latex', --
1752 ledger = 'ledger', journal = 'ledger', --
1753 less = 'less', --
1754 lily = 'lilypond', ly = 'lilypond', --
1755 cl = 'lisp', el = 'lisp', lisp = 'lisp', lsp = 'lisp', --
1756 litcoffee = 'litcoffee', --
1757 lgt = 'logtalk', --
1758 lua = 'lua', --
1759 GNUmakefile = 'makefile', iface = 'makefile', mak = 'makefile', makefile = 'makefile',
1760 Makefile = 'makefile', --
1761 md = 'markdown', markdown = 'markdown', --
1762 ['meson.build'] = 'meson', --
1763 moon = 'moonscript', --
1764 myr = 'myrddin', --
1765 n = 'nemerle', --
1766 link = 'networkd', network = 'networkd', netdev = 'networkd', --
1767 nim = 'nim', --
1768 nix = 'nix', --
1769 nsh = 'nsis', nsi = 'nsis', nsis = 'nsis', --
1770 obs = 'objeck', --
1771 m = 'objective_c', mm = 'objective_c', objc = 'objective_c', --
1772 caml = 'caml', ml = 'caml', mli = 'caml', mll = 'caml', mly = 'caml', --
1773 org = 'org', --
1774 dpk = 'pascal', dpr = 'pascal', p = 'pascal', pas = 'pascal', --
1775 al = 'perl', perl = 'perl', pl = 'perl', pm = 'perl', pod = 'perl', --
1776 inc = 'php', php = 'php', php3 = 'php', php4 = 'php', phtml = 'php', --
1777 p8 = 'pico8', --
1778 pike = 'pike', pmod = 'pike', --
1779 PKGBUILD = 'pkgbuild', --
1780 pony = 'pony', --
1781 eps = 'ps', ps = 'ps', --
1782 ps1 = 'powershell', --
1783 prolog = 'prolog', --
1784 props = 'props', properties = 'props', --
1785 proto = 'protobuf', --
1786 pure = 'pure', --
1787 sc = 'python', py = 'python', pyw = 'python', --
1788 R = 'r', Rout = 'r', Rhistory = 'r', Rt = 'r', ['Rout.save'] = 'r', ['Rout.fail'] = 'r', --
1789 re = 'reason', --
1790 r = 'rebol', reb = 'rebol', --
1791 rst = 'rest', --
1792 orx = 'rexx', rex = 'rexx', --
1793 erb = 'rhtml', rhtml = 'rhtml', --
1794 rsc = 'routeros', --
1795 spec = 'rpmspec', --
1796 Rakefile = 'ruby', rake = 'ruby', rb = 'ruby', rbw = 'ruby', --
1797 rs = 'rust', --
1798 sass = 'sass', scss = 'sass', --
1799 scala = 'scala', --
1800 sch = 'scheme', scm = 'scheme', --
1801 bash = 'bash', bashrc = 'bash', bash_profile = 'bash', configure = 'bash', csh = 'bash',
1802 ksh = 'bash', mksh = 'bash', sh = 'bash', zsh = 'bash', --
1803 changes = 'smalltalk', st = 'smalltalk', sources = 'smalltalk', --
1804 sml = 'sml', fun = 'sml', sig = 'sml', --
1805 sno = 'snobol4', SNO = 'snobol4', --
1806 spin = 'spin', --
1807 ddl = 'sql', sql = 'sql', --
1808 automount = 'systemd', device = 'systemd', mount = 'systemd', path = 'systemd',
1809 scope = 'systemd', service = 'systemd', slice = 'systemd', socket = 'systemd', swap = 'systemd',
1810 target = 'systemd', timer = 'systemd', --
1811 taskpaper = 'taskpaper', --
1812 tcl = 'tcl', tk = 'tcl', --
1813 texi = 'texinfo', --
1814 toml = 'toml', --
1815 ['1'] = 'troff', ['2'] = 'troff', ['3'] = 'troff', ['4'] = 'troff', ['5'] = 'troff',
1816 ['6'] = 'troff', ['7'] = 'troff', ['8'] = 'troff', ['9'] = 'troff', ['1x'] = 'troff',
1817 ['2x'] = 'troff', ['3x'] = 'troff', ['4x'] = 'troff', ['5x'] = 'troff', ['6x'] = 'troff',
1818 ['7x'] = 'troff', ['8x'] = 'troff', ['9x'] = 'troff', --
1819 t2t = 'txt2tags', --
1820 ts = 'typescript', --
1821 vala = 'vala', --
1822 vcf = 'vcard', vcard = 'vcard', --
1823 v = 'verilog', ver = 'verilog', --
1824 vh = 'vhdl', vhd = 'vhdl', vhdl = 'vhdl', --
1825 bas = 'vb', cls = 'vb', ctl = 'vb', dob = 'vb', dsm = 'vb', dsr = 'vb', frm = 'vb', pag = 'vb',
1826 vb = 'vb', vba = 'vb', vbs = 'vb', --
1827 wsf = 'wsf', --
1828 dtd = 'xml', svg = 'xml', xml = 'xml', xsd = 'xml', xsl = 'xml', xslt = 'xml', xul = 'xml', --
1829 xs = 'xs', xsin = 'xs', xsrc = 'xs', --
1830 xtend = 'xtend', --
1831 yaml = 'yaml', yml = 'yaml', --
1832 zig = 'zig'
1833 }
1834 local patterns = {
1835 ['^#!.+[/ ][gm]?awk'] = 'awk', ['^#!.+[/ ]lua'] = 'lua', ['^#!.+[/ ]octave'] = 'matlab',
1836 ['^#!.+[/ ]perl'] = 'perl', ['^#!.+[/ ]php'] = 'php', ['^#!.+[/ ]python'] = 'python',
1837 ['^#!.+[/ ]ruby'] = 'ruby', ['^#!.+[/ ]bash'] = 'bash', ['^#!.+/m?ksh'] = 'bash',
1838 ['^#!.+/sh'] = 'bash', ['^%s*class%s+%S+%s*<%s*ApplicationController'] = 'rails',
1839 ['^%s*class%s+%S+%s*<%s*ActionController::Base'] = 'rails',
1840 ['^%s*class%s+%S+%s*<%s*ActiveRecord::Base'] = 'rails',
1841 ['^%s*class%s+%S+%s*<%s*ActiveRecord::Migration'] = 'rails', ['^%s*<%?xml%s'] = 'xml',
1842 ['^#cloud%-config'] = 'yaml'
1843 }
1844
1845 for patt, name in pairs(M.detect_patterns) do if line:find(patt) then return name end end
1846 for patt, name in pairs(patterns) do if line:find(patt) then return name end end
1847 local name, ext = filename:match('[^/\\]+$'), filename:match('[^.]*$')
1848 return M.detect_extensions[name] or extensions[name] or M.detect_extensions[ext] or
1849 extensions[ext]
1850 end
1851
1852 -- The following are utility functions lexers will have access to.
1853
1854 -- Common patterns.
1855
1856 --- A pattern that matches any single character.
1857 M.any = P(1)
1858 --- A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z').
1859 M.alpha = R('AZ', 'az')
1860 --- A pattern that matches any digit ('0'-'9').
1861 M.digit = R('09')
1862 --- A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', '0'-'9').
1863 M.alnum = R('AZ', 'az', '09')
1864 --- A pattern that matches any lower case character ('a'-'z').
1865 M.lower = R('az')
1866 --- A pattern that matches any upper case character ('A'-'Z').
1867 M.upper = R('AZ')
1868 --- A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f').
1869 M.xdigit = R('09', 'AF', 'af')
1870 --- A pattern that matches any graphical character ('!' to '~').
1871 M.graph = R('!~')
1872 --- A pattern that matches any punctuation character ('!' to '/', ':' to '@', '[' to ''', '{'
1873 -- to '~').
1874 M.punct = R('!/', ':@', '[\'', '{~')
1875 --- A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', '\r', space).
1876 M.space = S('\t\v\f\n\r ')
1877
1878 --- A pattern that matches an end of line, either CR+LF or LF.
1879 M.newline = P('\r')^-1 * '\n'
1880 --- A pattern that matches any single, non-newline character.
1881 M.nonnewline = 1 - M.newline
1882
1883 --- Returns a pattern that matches a decimal number, whose digits may be separated by a particular
1884 -- character.
1885 -- @param c Digit separator character.
1886 function M.dec_num_(c) return M.digit * (P(c)^-1 * M.digit)^0 end
1887 --- Returns a pattern that matches a hexadecimal number, whose digits may be separated by
1888 -- a particular character.
1889 -- @param c Digit separator character.
1890 function M.hex_num_(c) return '0' * S('xX') * (P(c)^-1 * M.xdigit)^1 end
1891 --- Returns a pattern that matches an octal number, whose digits may be separated by a particular
1892 -- character.
1893 -- @param c Digit separator character.
1894 function M.oct_num_(c) return '0' * (P(c)^-1 * R('07'))^1 * -M.xdigit end
1895 --- Returns a pattern that matches a binary number, whose digits may be separated by a particular
1896 -- character.
1897 -- @param c Digit separator character.
1898 function M.bin_num_(c) return '0' * S('bB') * (P(c)^-1 * S('01'))^1 * -M.xdigit end
1899 --- Returns a pattern that matches either a decimal, hexadecimal, octal, or binary number,
1900 -- whose digits may be separated by a particular character.
1901 -- @param c Digit separator character.
1902 function M.integer_(c)
1903 return S('+-')^-1 * (M.hex_num_(c) + M.bin_num_(c) + M.oct_num_(c) + M.dec_num_(c))
1904 end
1905 local function exp_(c) return S('eE') * S('+-')^-1 * M.digit * (P(c)^-1 * M.digit)^0 end
1906 --- Returns a pattern that matches a floating point number, whose digits may be separated by a
1907 -- particular character.
1908 -- @param c Digit separator character.
1909 function M.float_(c)
1910 return S('+-')^-1 *
1911 ((M.dec_num_(c)^-1 * '.' * M.dec_num_(c) + M.dec_num_(c) * '.' * M.dec_num_(c)^-1 * -P('.')) *
1912 exp_(c)^-1 + (M.dec_num_(c) * exp_(c)))
1913 end
1914 --- Returns a pattern that matches a typical number, either a floating point, decimal, hexadecimal,
1915 -- octal, or binary number, and whose digits may be separated by a particular character.
1916 -- @param c Digit separator character.
1917 -- @usage lexer.number_('_') -- matches 1_000_000
1918 function M.number_(c) return M.float_(c) + M.integer_(c) end
1919
1920 --- A pattern that matches a decimal number.
1921 M.dec_num = M.dec_num_(false)
1922 --- A pattern that matches a hexadecimal number.
1923 M.hex_num = M.hex_num_(false)
1924 --- A pattern that matches an octal number.
1925 M.oct_num = M.oct_num_(false)
1926 --- A pattern that matches a binary number.
1927 M.bin_num = M.bin_num_(false)
1928 --- A pattern that matches either a decimal, hexadecimal, octal, or binary number.
1929 M.integer = M.integer_(false)
1930 --- A pattern that matches a floating point number.
1931 M.float = M.float_(false)
1932 --- A pattern that matches a typical number, either a floating point, decimal, hexadecimal,
1933 -- octal, or binary number.
1934 M.number = M.number_(false)
1935
1936 --- A pattern that matches a typical word. Words begin with a letter or underscore and consist
1937 -- of alphanumeric and underscore characters.
1938 M.word = (M.alpha + '_') * (M.alnum + '_')^0
1939
1940 --- Returns a pattern that matches a prefix until the end of its line.
1941 -- @param[opt] prefix String or pattern prefix to start matching at. The default value is any
1942 -- non-newline character.
1943 -- @param[optchain=false] escape Allow newline escapes using a '\\' character.
1944 -- @usage local line_comment = lexer.to_eol('//')
1945 -- @usage local line_comment = lexer.to_eol(S('#;'))
1946 function M.to_eol(prefix, escape)
1947 return (prefix or M.nonnewline) *
1948 (not escape and M.nonnewline or 1 - (M.newline + '\\') + '\\' * M.any)^0
1949 end
1950
1951 --- Returns a pattern that matches a bounded range of text.
1952 -- This is a convenience function for matching more complicated ranges like strings with escape
1953 -- characters, balanced parentheses, and block comments (nested or not).
1954 -- @param s String or LPeg pattern start of the range.
1955 -- @param[opt=s] e String or LPeg pattern end of the range. The default value is *s*.
1956 -- @param[optchain=false] single_line Restrict the range to a single line.
1957 -- @param[optchain] escapes Allow the range end to be escaped by a '\\' character. The default
1958 -- value is `false` unless *s* and *e* are identical, single-character strings. In that case,
1959 -- the default value is `true`.
1960 -- @param[optchain=false] balanced Match a balanced range, like the "%b" Lua pattern. This flag
1961 -- only applies if *s* and *e* are different.
1962 -- @usage local dq_str_escapes = lexer.range('"')
1963 -- @usage local dq_str_noescapes = lexer.range('"', false, false)
1964 -- @usage local unbalanced_parens = lexer.range('(', ')')
1965 -- @usage local balanced_parens = lexer.range('(', ')', false, false, true)
1966 function M.range(s, e, single_line, escapes, balanced)
1967 if type(e) ~= 'string' and type(e) ~= 'userdata' then
1968 e, single_line, escapes, balanced = s, e, single_line, escapes
1969 end
1970 local any = M.any - e
1971 if single_line then any = any - '\n' end
1972 if balanced then any = any - s end
1973 -- Only allow escapes by default for ranges with identical, single-character string delimiters.
1974 if escapes == nil then escapes = type(s) == 'string' and #s == 1 and s == e end
1975 if escapes then any = any - '\\' + '\\' * M.any end
1976 if balanced and s ~= e then return P{s * (any + V(1))^0 * P(e)^-1} end
1977 return s * any^0 * P(e)^-1
1978 end
1979
1980 --- Returns a pattern that only matches when it comes after certain characters (or when there
1981 -- are no characters behind it).
1982 -- @param set String character set like one passed to `lpeg.S()`.
1983 -- @param patt LPeg pattern to match after a character in *set*.
1984 -- @param skip String character set to skip over when looking backwards from *patt*. The default
1985 -- value is " \t\r\n\v\f" (whitespace).
1986 -- @usage local regex = lexer.after_set('+-*!%^&|=,([{', lexer.range('/'))
1987 -- -- matches "var re = /foo/;", but not "var x = 1 / 2 / 3;"
1988 function M.after_set(set, patt, skip)
1989 if not skip then skip = ' \t\r\n\v\f' end
1990 local set_chars, skip_chars = {}, {}
1991 -- Note: cannot use utf8.codes() because Lua 5.1 is still supported.
1992 for char in set:gmatch('.') do set_chars[string.byte(char)] = true end
1993 for char in skip:gmatch('.') do skip_chars[string.byte(char)] = true end
1994 return (B(S(set)) + -B(1)) * patt + Cmt(C(patt), function(input, index, match, ...)
1995 local pos = index - #match
1996 if #skip > 0 then while pos > 1 and skip_chars[input:byte(pos - 1)] do pos = pos - 1 end end
1997 if pos == 1 or set_chars[input:byte(pos - 1)] then return index, ... end
1998 return nil
1999 end)
2000 end
2001
2002 --- Returns a pattern that matches only at the beginning of a line.
2003 -- @param patt LPeg pattern to match at the beginning of a line.
2004 -- @param[opt=false] allow_indent Allow *patt* to match after line indentation.
2005 -- @usage local preproc = lex:tag(lexer.PREPROCESSOR, lexer.starts_line(lexer.to_eol('#')))
2006 function M.starts_line(patt, allow_indent)
2007 return M.after_set('\r\n\v\f', patt, allow_indent and ' \t' or '')
2008 end
2009
2010 M.colors = {} -- legacy
2011 M.styles = setmetatable({}, { -- legacy
2012 __index = function() return setmetatable({}, {__concat = function() return nil end}) end,
2013 __newindex = function() end
2014 })
2015 M.property_expanded = setmetatable({}, {__index = function() return '' end}) -- legacy
2016
2017 -- Legacy function for creates and returns a token pattern with token name *name* and pattern
2018 -- *patt*.
2019 -- Use `tag()` instead.
2020 -- @param name The name of token.
2021 -- @param patt The LPeg pattern associated with the token.
2022 -- @usage local number = token(lexer.NUMBER, lexer.number)
2023 -- @usage local addition = token('addition', '+' * lexer.word)
2024 function M.token(name, patt) return Cc(name) * (P(patt) / 0) * Cp() end
2025
2026 -- Legacy function that creates and returns a pattern that verifies the first non-whitespace
2027 -- character behind the current match position is in string set *s*.
2028 -- @param s String character set like one passed to `lpeg.S()`.
2029 -- @usage local regex = #P('/') * lexer.last_char_includes('+-*!%^&|=,([{') * lexer.range('/')
2030 function M.last_char_includes(s) return M.after_set(s, true) end
2031
2032 function M.fold_consecutive_lines() end -- legacy
2033
2034 -- The functions and fields below were defined in C.
2035
2036 --- Map of line numbers (starting from 1) to their fold level bit-masks. (Read-only)
2037 -- Fold level masks are composed of an integer level combined with any of the following bits:
2038 --
2039 -- - `lexer.FOLD_BASE`
2040 -- The initial fold level (1024).
2041 -- - `lexer.FOLD_BLANK`
2042 -- The line is blank.
2043 -- - `lexer.FOLD_HEADER`
2044 -- The line is a header, or fold point.
2045 -- @table fold_level
2046
2047 --- Map of line numbers (starting from 1) to their indentation amounts, measured in character
2048 -- columns. (Read-only)
2049 -- @table indent_amount
2050
2051 --- Map of line numbers (starting from 1) to their 32-bit integer line states.
2052 -- Line states can be used by lexers for keeping track of persistent states (up to 32 states
2053 -- with 1 state per bit). For example, the output lexer uses this to mark lines that have
2054 -- warnings or errors.
2055 -- @table line_state
2056
2057 --- Map of key-value string pairs.
2058 -- The contents of this map are application-dependant.
2059 -- @table property
2060
2061 --- Alias of `lexer.property`, but with values interpreted as numbers, or `0` if not
2062 -- found. (Read-only)
2063 -- @table property_int
2064
2065 --- Map of buffer positions (starting from 1) to their string style names. (Read-only)
2066 -- @table style_at
2067
2068 --- Returns a position's line number (starting from 1).
2069 -- @param pos Position (starting from 1) to get the line number of.
2070 -- @function line_from_position
2071
2072 --- Map of line numbers (starting from 1) to their start positions. (Read-only)
2073 -- @table line_start
2074
2075 --- Map of line numbers (starting from 1) to their end positions. (Read-only)
2076 -- @table line_end
2077
2078 --- Returns a range of buffer text.
2079 -- The current text being lexed or folded may be a subset of buffer text. This function can
2080 -- return any text in the buffer.
2081 -- @param pos Position (starting from 1) of the text range to get. It needs to be an absolute
2082 -- position. Use a combination of `lexer.line_from_position()` and `lexer.line_start`
2083 -- to get one.
2084 -- @param length Length of the text range to get.
2085 -- @function text_range
2086
2087 return M