Scintilla |
Scintilla contains lexers for various types of languages:
Some languages can be used in different ways. JavaScript is a programming language but also the basis of JSON data files. Similarly, Lisp s expressions can be used for both source code and data.
Each language type has common elements such as identifiers in programming languages. These common elements should be identified so that languages can be displayed with common styles for these elements. Style tags are used for this purpose in Scintilla.
Every style has a list of tags where a tag is a lower-case word containing only the common ASCII letters 'a'-'z' such as "comment" or "operator".
Tags are ordered from most important to least important.
While applications may assign visual attributes for tag lists in many different ways, one reasonable technique is to
apply tag-specific attributes in reverse order so that earlier and more important tags override less important tags.
For example, the tag list "error comment documentation keyword"
with
a set of tag attributes
{ comment=fore:green,back:very-light-green,font:Serif documentation=fore:light-green error=strikethrough keyword=bold }
could be rendered as
bold,fore:light-green,back:very-light-green,font:Serif,strikethrough
.
Alternative renderings could check for multi-tag combinations like
{ comment.documentation=fore:light-green comment.line=dark-green comment=green }.
Commonly, a tag list will contain an optional embedded language; optional statuses; a base type; and a set of type modifiers:
embedded-language? status* base-type modifiers*
The embedded language may be a source (client | server)
followed by a language name
(javascript | php | python | basic)
.
This may be extended in the future with other programming languages and style-definition languages like CSS.
The statuses may be (error | unused | predefined | inactive)
.
The error
status is used for lexical statuses that indicate errors in the source code such as unterminated quoted strings.
The unused
status may indicate a gap in the lexical states, possibly because an old lexical class is no longer used or an upcoming lexical class may fill that position.
The predefined
status indicates a style in the range 32.39 that is used for non-lexical purposes in Scintilla.
The inactive
status is used for text that is not currently interpreted such as C++ code that is contained within a '#if 0' preprocessor block.
The basic types for programming languages are (default | operator | keyword | identifier | literal | comment | preprocessor | label)
.
The default
type is commonly used for spaces and tabs between tokens although it may cover other characters in some languages.
Assembler languages add (instruction | register)
. to the basic types from programming languages.
The basic types for markup languages are (default | tag | attribute | comment | preprocessor)
.
The basic types for data languages are (default | key | data | comment)
.
Programming languages may differentiate between line and stream comments and treat documentation comments as distinct from other comments.
Documentation comments may be marked up with documentation keywords.
The additional attributes commonly used are (line | documentation | keyword | taskmarker)
.
Programming and assembler languages contain a rich set of literals including numbers like 7
and 3.89e23
; "string\n"
; and nullptr
and differentiating between these is often wanted.
The common literal types are (numeric | boolean | string | regex | date | time | uuid | nil | compound)
.
Numeric literal types are subdivided into (integer | real)
.
String literal types may add (perhaps multiple) further attributes from (heredoc | character | escapesequence | interpolated | multiline | raw)
.
An escape sequence within an interpolated heredoc may thus be literal string heredoc escapesequence
.
attribute | Markup attribute |
basic | Embedded Basic |
boolean | True or false literal |
character | Single character literal as opposed to a string literal |
client | Script executed on client |
comment | The standard comment type in a language: may be stream or line |
compound | Literal containing multiple subliterals such as a tuple or complex number |
data | A value in a data file |
date | Literal representing a data such as '19/November/1975' |
default | Starting state commonly also used for white space |
documentation | Comment that can be extracted into documentation |
error | State indicating an invalid or erroneous element |
escapesequence | Parts of a string that are not literal such as '\t' for tab in C |
heredoc | Lengthy text literal marked by a word at both ends |
identifier | Name that identifies an object or class of object |
inactive | Code that is not currently interpreted |
instruction | Mnemonic in assembler languages like 'addc' |
integer | Numeric literal with no fraction or exponent like '738' |
interpolated | String that can contain expressions |
javascript | Embedded Javascript |
key | Element which allows finding associated data |
keyword | Reserved word with special meaning like 'while' |
label | Destination for jumps in programming and assembler languages |
line | Differentiates between stream comments and line comments in languages that have both |
literal | Fixed value in source code |
multiline | Differentiates between single line and multiline elements, commonly strings |
nil | Literal for the null pointer such as nullptr in C++ or NULL in C |
numeric | Literal number like '16' |
operator | Punctuation character such as '&' or '[' |
php | Embedded PHP |
predefined | Style in the range 32.39 that is used for non-lexical purposes |
preprocessor | Element that is recognized in an early stage of translation |
python | Embedded Python |
raw | String type that avoids interpretation: may be used for regular expressions in languages without a specific regex type |
real | Numeric literal which may have a fraction or exponent like '3.84e-15' |
regex | Regular expression literal like '^[a-z]+' |
register | CPU register in assembler languages |
server | Script executed on server |
string | Sequence of characters |
tag | Markup tag like '<br />' |
taskmarker | Word in comment that marks future work like 'FIXME' |
time | Literal representing a time such as '9:34:31' |
unused | Style that is not currently used |
uuid | Universally unique identifier often used in interface definition files which may look like '{098f2470-bae0-11cd-b579-08002b30bfeb}' |
Each element in this scheme may be extended in the future. This may be done by revising this document to provide a common approach to new features. Individual lexers may also choose to expose unique language features through new tags.
Tags could be exposed directly in user interfaces or configuration languages.
However, an application may also translate these to match its naming schema.
Capitalization and punctuation could be different (like Here-Doc
instead of heredoc
),
terminology changed ("constant" instead of "literal"),
or human language changed from English to Chinese or Spanish.
Starting from a common set of tags makes these modifications tractable.
The C++ lexer (for example) has inactive states and dynamically allocated substyles. These should be exposed through the metadata mechanism but are not currently.