Like any other programming language, the Java programming language is defined by grammar rules that specify how syntactically legal constructs can be formed using the language elements, and by a semantic definition that specifies the meaning of syntactically legal constructs.
The low-level language elements are called lexical tokens (or just tokens for short) and are the building blocks for more complex constructs. Identifiers, numbers, operators, and special characters are all examples of tokens that can be used to build high-level constructs like expressions, statements, methods, and classes.
A name in a program is called an identifier. Identifiers can be used to denote classes, methods, variables, and labels.
In Java an identifier is composed of a sequence of characters, where each character can be either a letter, a digit, a connecting punctuation (such as underscore _), or any currency symbol (such as $, ¢, ¥, or £). However, the first character in an identifier cannot be a digit. Since Java programs are written in the Unicode character set (see p. 23), the definitions of letter and digit are interpreted according to this character set.
Identifiers in Java are case sensitive, for example, price and Price are two different identifiers.
number, Number, sum_$, bingo, $$_100, mål, grüß
48chevy, all@hands, grand-sum
The name 48chevy is not a legal identifier as it starts with a digit. The character @ is not a legal character in an identifier. It is also not a legal operator so that all@hands cannot not be interpreted as a legal expression with two operands. The character - is also not a legal character in an identifier. However, it is a legal operator so grand-sum could be interpreted as a legal expression with two operands.
Keywords are reserved identifiers that are predefined in the language and cannot be used to denote other entities. All the keywords are in lowercase, and incorrect usage results in compilation errors.
Keywords currently defined in the language are listed in Table 2.1. In addition, three identifiers are reserved as predefined literals in the language: the null reference and the Boolean literals true and false (see Table 2.2). Keywords currently reserved, but not in use, are listed in Table 2.3. All these reserved words cannot be used as identifiers. The index contains references to relevant sections where currently defined keywords are explained.
abstract | default | implements | protected | throw |
assert | do | import | public | throws |
boolean | double | instanceof | return | transient |
break | else | int | short | try |
byte | extends | interface | static | void |
case | final | long | strictfp | volatile |
catch | finally | native | super | while |
char | float | new | switch | |
class | for | package | synchronized | |
continue | if | private | this |
null | true | false |
const | goto |
A literal denotes a constant value, that is, the value a literal represents remains unchanged in the program. Literals represent numerical (integer or floating-point), character, boolean or string values. In addition, there is the literal null that represents the null reference.
Integer |
2000 0 -7 |
Floating-point |
3.14 -3.14 .5 0.5 |
Character |
'a' 'A' '0' ':' '-' ')' |
Boolean |
true false |
String |
"abba" "3.14" "for" "a piece of the action" |
Integer data types are comprised of the following primitive data types: int, long, byte, and short (see Section 2.2).
The default data type of an integer literal is always int, but it can be specified as long by appending the suffix L (or l) to the integer value. Without the suffix, the long literals 2000L and 0l will be interpreted as int literals. There is no direct way to specify a short or a byte literal.
In addition to the decimal number system, integer literals can also be specified in octal (base 8) and hexadecimal (base 16) number systems. Octal and hexadecimal numbers are specified with 0 and 0x (or 0X) prefix respectively. Examples of decimal, octal and hexadecimal literals are shown in Table 2.5. Note that the leading 0 (zero) digit is not the uppercase letter O. The hexadecimal digits from a to f can also be specified with the corresponding uppercase forms (A to F). Negative integers (e.g. -90) can be specified by prefixing the minus sign (-) to the magnitude of the integer regardless of number system (e.g., -0132 or -0X5A). Number systems and number representation are discussed in Appendix G. Java does not support literals in binary notation.
Decimal | Octal | Hexadecimal |
---|---|---|
8 | 010 | 0x8 |
10L | 012L | 0XaL |
16 | 020 | 0x10 |
27 | 033 | 0x1B |
90L | 0132L | 0x5aL |
-90 | -0132 | -0X5A |
2147483647 (i.e., 231-1) | 017777777777 | 0x7fffffff |
-2147483648 (i.e., -231) | -020000000000 | -0x80000000 |
1125899906842624L (i.e., 250) | 040000000000000000L | 0x4000000000000L |
Floating-point data types come in two flavors: float or double.
The default data type of a floating-point literal is double, but it can be explicitly designated by appending the suffix D (or d) to the value. A floating-point literal can also be specified to be a float by appending the suffix F (or f).
Floating-point literals can also be specified in scientific notation, where E (or e) stands for Exponent. For example, the double literal 194.9E-2 in scientific notation is interpreted as 194.9*10-2 (i.e., 1.949).
0.0 0.0d 0D 0.49 .49 .49D 49.0 49. 49D 4.9E+1 4.9E+1D 4.9e1d 4900e-2 .49E2
0.0F 0f 0.49F .49F 49.0F 49.F 49F 4.9E+1F 4900e-2f .49E2F
Note that the decimal point and the exponent are optional and that at least one digit must be specified.
The primitive data type boolean represents the truth-values true or false that are denoted by the reserved literals true or false, respectively.
A character literal is quoted in single-quotes ('). All character literals have the primitive data type char.
Characters in Java are represented by the 16-bit Unicode character set, which subsumes the 8-bit ISO-Latin-1 and the 7-bit ASCII characters. In Table 2.6, note that digits (0 to 9), upper-case letters (A to Z), and lower-case letters (a to z) have contiguous Unicode values. Any Unicode character can be specified as a four-digit hexadecimal number (i.e., 16 bits) with the prefix \u.
Character Literal | Character Literal using Unicode value | Character |
---|---|---|
' ' | '\u0020' | Space |
'0' | '\u0030' | 0 |
'1' | '\u0031' | 1 |
'9' | '\u0039' | 9 |
'A' | '\u0041' | A |
'B' | '\u0042' | B |
'Z' | '\u005a' | Z |
'a' | '\u0061' | a |
'b' | '\u0062' | b |
'z' | '\u007a' | z |
'Ñ' | '\u0084' | Ñ |
'å' | '\u008c' | å |
'ß' | '\u00a7' | ß |
Certain escape sequences define special character values as shown in Table 2.7. These escape sequences can be single-quoted to define character literals. For example, the character literals '\t' and '\u0009' are equivalent. However, the character literals '\u000a' and '\u000d' should not be used to represent newline and carriage return in the source code. These values are interpreted as line-terminator characters by the compiler, and will cause compile time errors. One should use the escape sequences '\n' and '\r', respectively, for correct interpretation of these characters in the source code.
Escape Sequence | Unicode Value | Character |
---|---|---|
\b | \u0008 | Backspace (BS) |
\t | \u0009 | Horizontal tab (HT or TAB) |
\n | \u000a | Linefeed (LF) a.k.a., Newline (NL) |
\f | \u000c | Form feed (FF) |
\r | \u000d | Carriage return (CR) |
\' | \u0027 | Apostrophe-quote |
\" | \u0022 | Quotation mark |
\\ | \u005c | Backslash |
We can also use the escape sequence \ddd to specify a character literal by octal value, where each digit d can be any octal digit (0?7), as shown in Table 2.8. The number of digits must be three or fewer, and the octal value cannot exceed \377, that is, only the first 256 characters can be specified with this notation.
Escape Sequence \ddd | Character Literal |
---|---|
'\141' | 'a' |
'\46' | '&' |
'\60' | '0' |
A string literal is a sequence of characters, which must be quoted in quotation marks and which must occur on a single line. All string literal are objects of the class String (see Section 10.5, p. 407).
Escape sequences as well as Unicode values can appear in string literals:
"Here comes a tab.\t And here comes another one\u0009! (1) "What's on the menu?" (2) "\"String literals are double-quoted.\"" (3) "Left!\nRight!" (4)
In (1), the tab character is specified using the escape sequence and the Unicode value respectively. In (2), the single apostrophe need not be escaped in strings, but it would be if specified as a character literal('\''). In (3), the double apostrophes in the string must be escaped. In (4), we use the escape sequence \n to insert a newline. Printing these strings would give the following result:
Here comes a tab. And here comes another one ! What's on the menu? "String literals are double-quoted." Left! Right!
One should also use the string literals "\n" and "\r", respectively, for correct interpretation of the characters "\u000a" and "\u000d" in the source code.
A white space is a sequence of spaces, tabs, form feeds, and line terminator characters in a Java source file. Line terminators can be newline, carriage return, or carriage return-newline sequence.
A Java program is a free-format sequence of characters that is tokenized by the compiler, that is, broken into a stream of tokens for further analysis. Separators and operators help to distinguish tokens, but sometimes white space has to be inserted explicitly as separators. For example, the identifier classRoom will be interpreted as a single token, unless white space is inserted to distinguish the keyword class from the identifier Room.
White space aids not only in separating tokens, but also in formatting the program so that it is easy for humans to read. The compiler ignores the white spaces once the tokens are identified.
A program can be documented by inserting comments at relevant places. These comments are for documentation purposes and are ignored by the compiler.
Java provides three types of comments to document a program:
A single-line comment: // ... to the end of the line
A multiple-line comment: /* ... */
A documentation (Javadoc) comment: /** ... */
All characters after the comment-start sequence // through to the end of the line constitute a single-line comment.
// This comment ends at the end of this line. int age; // From comment-start sequence to the end of the line is a comment.
A multiple-line comment, as the name suggests, can span several lines. Such a comment starts with /* and ends with */.
/* A comment on several lines. */
The comment-start sequences (//, /*, /**) are not treated differently from other characters when occurring within comments, and are thus ignored. This means trying to nest multiple-line comments will result in compile time error:
/* Formula for alchemy. gold = wizard.makeGold(stone); /* But it only works on Sundays. */ */
The second occurrence of the comment-start sequence /* is ignored. The last occurrence of the sequence */ in the code is now unmatched, resulting in a syntax error.
A documentation comment is a special-purpose comment that when placed before class or class member declarations can be extracted and used by the javadoc tool to generate HTML documentation for the program. Documentation comments are usually placed in front of classes, interfaces, methods and field definitions. Groups of special tags can be used inside a documentation comment to provide more specific information. Such a comment starts with /** and ends with */:
/** * This class implements a gizmo. * @author K.A.M. * @version 2.0 */
For details on the javadoc tool, see the documentation for the tools in the Java 2 SDK.