Characters
Characters
Character Sets
- A character set is a list of characters that a computer recognises from their binary representations.
- These character sets include standard printable characters (letters, numbers, symbols), as well as non-printing characters (like spaces and tabs).
- Each character in a character set is assigned a unique number, often represented in binary, which identifies that character.
ASCII
- The American Standard Code for Information Interchange (ASCII) is a widely used character set.
- ASCII originally used a 7-bit binary code to represent each character. This allowed for 128 characters (2^7) in total.
- There are two versions of ASCII: The basic ASCII set (including 95 printable characters and 33 control codes) and the extended ASCII set (which uses 8 bits per character and includes additional characters).
Unicode
- Unicode is another character set that was created to include characters from all languages across the world, as ASCII could only represent Western characters.
- Unicode uses a larger amount of bits to represent each character - up to 32 bits - allowing it to represent over a million different characters.
- Unicode can represent a wider range of characters, including those used in non-Western languages, emojis, and other special characters.
Encodings
- Encoding is the process of transforming a set of characters into a sequence of bytes.
- Common encoding systems include UTF-8, UTF-16, and UTF-32. UTF stands for Unicode Transformation Format.
- UTF-8 is widely used and can represent any character in the Unicode standard, yet it is backward-compatible with ASCII and supports multilingual text.
Importance of Character Sets and Encodings
- Understanding how data is represented as characters is crucial in computing. It helps to handle text correctly, including proper display, storage, and transmission.
- Different character sets and encoding schemes ensure that text data is interoperable across different platforms and supports global communication.
- Any Byte of data can be represented as a character using the correct character set and encoding. By knowing which set and encoding was used, we can interpret the data as intended.