April 2002 Draft
JavaScript 2.0
Core Language
Lexer
previousupnext

Monday, November 26, 2001

This section presents an informal overview of the JavaScript 2.0 lexer. See the stages and lexical semantics sections in the formal description chapter for the details.

Changes since JavaScript 1.5

The JavaScript 2.0 lexer behaves in the same way as the JavaScript 1.5 lexer except for the following:

Source Code

JavaScript 2.0 source text consists of a sequence of UTF-16 Unicode version 2.1 or later characters normalized to Unicode Normalized Form C (canonical composition), as described in the Unicode Technical Report #15.

Comments and White Space

Comments and white space behave just like in JavaScript 1.5.

Punctuators

The following JavaScript 1.5 punctuation tokens are recognized in JavaScript 2.0:

!   !=   !==   %   %=   &   &&   &=   (   )   *   *=   +   ++   +=   ,   -   --   -=   .   /   /=   :   ::   ;   <   <<   <<=   <=   =   ==   ===   >   >=   >>   >>=   >>>   >>>=   ?   [   ]   ^   ^=   {   |   |=   ||   }   ~

The following punctuation tokens are new in JavaScript 2.0:

&&=   ...   ^^   ^^=   ||=

Keywords

The following reserved words are used in JavaScript 2.0:

abstract   as   break   case   catch   class   const   continue   default   delete   do   else   export   extends   false   final   finally   for   function   if   implements   import   in   instanceof   interface   is   namespace   new   null   package   private   public   return   static   super   switch   this   throw   true   try   typeof   use   var   void   while   with

The following reserved words are reserved for future expansion:

debugger   enum   goto   implements   interface   native   protected   synchronized   throws   transient   volatile

The following words have special meaning in some contexts in JavaScript 2.0 but are not reserved and may be used as identifiers:

exclude   get   include   named   set

Any of the above keywords may be used as an identifier by including a \_ escape anywhere within the identifier, which strips it of any keyword meanings. The two and four-digit hexadecimal escapes \xdd and \udddd may also be used in identifiers; these strip the identifier of any keyword meanings as well.

Changes from JavaScript 1.5

The following words were reserved in JavaScript 1.5 but are not reserved in JavaScript 2.0:

boolean   byte   char   double   float   int   long   short

The following words were not reserved in JavaScript 1.5 but are reserved in JavaScript 2.0:

as   is   namespace   use

Semicolon Insertion

The JavaScript 2.0 syntactic grammar explicitly makes semicolons optional in the following situations:

Semicolons are optional in these situations even if they would construct empty statements. Strict mode has no effect on semicolon insertion in the above cases.

In addition, sometimes line breaks in the input stream are turned into VirtualSemicolon tokens. Specifically, if the first through the nth tokens of a JavaScript program form are grammatically valid but the first through the n+1st tokens are not and there is a line break (or a comment including a line break) between the nth tokens and the n+1st tokens, then the parser tries to parse the program again after inserting a VirtualSemicolon token between the nth and the n+1st tokens. This kind of VirtualSemicolon insertion does not occur in strict mode.

See also the semicolon insertion syntax rationale.

Regular Expression Literals

Regular expression literals begin with a slash (/) character not immediately followed by another slash (two slashes start a line comment). Like in JavaScript 1.5, regular expression literals are ambiguous with the division (/) or division-assignment (/=) tokens. The lexer treats a / or /= as a division or division-assignment token if either of these tokens would be allowed by the syntactic grammar as the next token; otherwise, the lexer treats a / or /= as starting a regular expression.

This unfortunate dependence of lexical parsing on grammatical parsing is inherited from JavaScript 1.5. See the regular expression syntax rationale for a discussion of the issues.

Units

When a numeric literal is immediately followed by an identifier, the lexer converts the identifier to a string literal. The parser then treats the number and string as a unit expression. The identifier cannot start with an underscore, but there are no reserved word restrictions on the identifier; any identifier that begins with a letter will work, even if it is a reserved word.

For example, 3in is converted to 3 "in". 5xena is converted to 5 "xena". On the other hand, 0xena is converted to 0xe "na". It is unwise to define unit names that begin with the letters e or E either alone or followed by a decimal digit, or x or X followed by a hexadecimal digit because of potential ambiguities with exponential or hexadecimal notation.


Waldemar Horwat
Last modified Monday, November 26, 2001
previousupnext