1.1 Tokens & Lexemes

一种语言（language）是字符串（string）组成的集合。而字符串是 **符号（symbol）**的有限序列。而符号则来源于这个语言所定义的有限字母表（alphabet）；

Lexeme：它是一种语言的字符串子串，由一串字符构成（抽象概念），是语言的基本单元。例如在一段代码中，if、10、+ 等等，都是 lexeme；

Token：一种语言对于 lexeme 的抽象的分类表示，可以将其看作程序设计语言的文法单位。token 可以归类为有限的几组单词类型（称 token type），例如 ID、KEYWORD、RELATION、NUM、REAL 等。它通常使用 (token-type, token-value) 对表示。

点符号、操作符、分隔符等也算 token；
例外：注释、预处理指令、宏、空格符/指标符/换行符等空白字符不是 token；

其中，IF / RETURN 等由字母字符组成的单词称为保留字（reserved word），在多数语言中，它们不能作为标识符（ID）使用。

二者的比较如下：

Aspect	Lexeme	Token
Definition	A sequence of characters forming a basic unit of a language.	An abstract, categorized representation of a lexeme.
Representation	Concrete string (e.g., `123`, `if`, `+`).	Abstract pair: type + value (e.g., `(NUMBER, "123")`).
Role	The actual text or code input.	The classified, processed unit passed to the parser.
Generated by	Directly from the source code or text.	Lexical analyzer/tokenizer converts lexemes into tokens.

而实现一个词法分析器（Lexical Analyzer）就是在完成以下的工作：

扫描输入的内容，识别其中某些子串（lexeme）为 token；
返回一组 (token type, token value) 对；

这个过程被称为 Tokenization；

举例：对于下面的一个语句：
int num = 10;
int、num、=、10、; 都是 C 语言的 lexeme；

tokenization 的过程：

Lexeme int -> Token (KEYWORD, "int")；

Lexeme num -> Token (IDENTIFIER, "num")；

Lexeme = -> Token (OPERATOR, "=")；

Lexeme 10 -> Token (NUMBER, "10")；

Lexeme ; -> Token (SEPERATOR, ";")；

这个过程就是词法分析器需要做的事。

为了定义一类 token 的规则，我们引入正则表达式来描述这套词法规则。