Scintilla icon Scintilla

Language Types

Scintilla contains lexers for various types of languages:

Some languages can be used in different ways. JavaScript is a programming language but also the basis of JSON data files. Similarly, Lisp s expressions can be used for both source code and data.

Each language type has common elements such as identifiers in programming languages. These common elements should be identified so that languages can be displayed with common styles for these elements. Style tags are used for this purpose in Scintilla.

Style Tags

Every style has a list of tags where a tag is a lower-case word containing only the common ASCII letters 'a'-'z' such as "comment" or "operator".

Tags are ordered from most important to least important.

While applications may assign visual attributes for tag lists in many different ways, one reasonable technique is to apply tag-specific attributes in reverse order so that earlier and more important tags override less important tags. For example, the tag list "error comment documentation keyword" with a set of tag attributes
{ comment=fore:green,back:very-light-green,font:Serif documentation=fore:light-green error=strikethrough keyword=bold }
could be rendered as
bold,fore:light-green,back:very-light-green,font:Serif,strikethrough.

Alternative renderings could check for multi-tag combinations like { comment.documentation=fore:light-green comment.line=dark-green comment=green }.

Commonly, a tag list will contain an optional embedded language; optional statuses; a base type; and a set of type modifiers:
embedded-language? status* base-type modifiers*

Embedded language

The embedded language may be a source (client | server) followed by a language name (javascript | php | python | basic). This may be extended in the future with other programming languages and style-definition languages like CSS.

Status

The statuses may be (error | unused | predefined | inactive).
The error status is used for lexical statuses that indicate errors in the source code such as unterminated quoted strings.
The unused status may indicate a gap in the lexical states, possibly because an old lexical class is no longer used or an upcoming lexical class may fill that position.
The predefined status indicates a style in the range 32.39 that is used for non-lexical purposes in Scintilla.
The inactive status is used for text that is not currently interpreted such as C++ code that is contained within a '#if 0' preprocessor block.

Basic Types

The basic types for programming languages are (default | operator | keyword | identifier | literal | comment | preprocessor | label).
The default type is commonly used for spaces and tabs between tokens although it may cover other characters in some languages.

Assembler languages add (instruction | register). to the basic types from programming languages.

The basic types for markup languages are (default | tag | attribute | comment | preprocessor).

The basic types for data languages are (default | key | data | comment).

Comments

Programming languages may differentiate between line and stream comments and treat documentation comments as distinct from other comments. Documentation comments may be marked up with documentation keywords.
The additional attributes commonly used are (line | documentation | keyword | taskmarker).

Literals

Programming and assembler languages contain a rich set of literals including numbers like 7 and 3.89e23; "string\n"; and nullptr and differentiating between these is often wanted.
The common literal types are (numeric | boolean | string | regex | date | time | uuid | nil | compound).
Numeric literal types are subdivided into (integer | real).
String literal types may add (perhaps multiple) further attributes from (heredoc | character | escapesequence | interpolated | multiline | raw).

An escape sequence within an interpolated heredoc may thus be literal string heredoc escapesequence.

List of known tags

attributeMarkup attribute
basicEmbedded Basic
booleanTrue or false literal
characterSingle character literal as opposed to a string literal
clientScript executed on client
commentThe standard comment type in a language: may be stream or line
compoundLiteral containing multiple subliterals such as a tuple or complex number
dataA value in a data file
dateLiteral representing a data such as '19/November/1975'
defaultStarting state commonly also used for white space
documentationComment that can be extracted into documentation
errorState indicating an invalid or erroneous element
escapesequenceParts of a string that are not literal such as '\t' for tab in C
heredocLengthy text literal marked by a word at both ends
identifierName that identifies an object or class of object
inactiveCode that is not currently interpreted
instructionMnemonic in assembler languages like 'addc'
integerNumeric literal with no fraction or exponent like '738'
interpolatedString that can contain expressions
javascriptEmbedded Javascript
keyElement which allows finding associated data
keywordReserved word with special meaning like 'while'
labelDestination for jumps in programming and assembler languages
lineDifferentiates between stream comments and line comments in languages that have both
literalFixed value in source code
multilineDifferentiates between single line and multiline elements, commonly strings
nilLiteral for the null pointer such as nullptr in C++ or NULL in C
numericLiteral number like '16'
operatorPunctuation character such as '&' or '['
phpEmbedded PHP
predefinedStyle in the range 32.39 that is used for non-lexical purposes
preprocessorElement that is recognized in an early stage of translation
pythonEmbedded Python
rawString type that avoids interpretation: may be used for regular expressions in languages without a specific regex type
realNumeric literal which may have a fraction or exponent like '3.84e-15'
regexRegular expression literal like '^[a-z]+'
registerCPU register in assembler languages
serverScript executed on server
stringSequence of characters
tagMarkup tag like '<br />'
taskmarkerWord in comment that marks future work like 'FIXME'
timeLiteral representing a time such as '9:34:31'
unusedStyle that is not currently used
uuidUniversally unique identifier often used in interface definition files which may look like '{098f2470-bae0-11cd-b579-08002b30bfeb}'

Extension

Each element in this scheme may be extended in the future. This may be done by revising this document to provide a common approach to new features. Individual lexers may also choose to expose unique language features through new tags.

Translation

Tags could be exposed directly in user interfaces or configuration languages. However, an application may also translate these to match its naming schema. Capitalization and punctuation could be different (like Here-Doc instead of heredoc), terminology changed ("constant" instead of "literal"), or human language changed from English to Chinese or Spanish.

Starting from a common set of tags makes these modifications tractable.

Open issues

The C++ lexer (for example) has inactive states and dynamically allocated substyles. These should be exposed through the metadata mechanism but are not currently.