4.8 KiB
Encoding and Decoding
Text encoding conversion between different UTF formats with BOM support and binary data encoding.
UTF Encoding with BOM Support
Use BomEncoder and BomDecoder for UTF encoding with BOM support.
BomEncoder and BomDecoder convert data between wchar_t and a specified UTF encoding with BOM added to the very beginning.
UTF Encoding without BOM
Use UtfGeneralEncoder<Native, Expect> and UtfGeneralDecoder<Native, Expect> for UTF conversion without BOM.
UtfGeneralEncoder<Native, Expect> encode from Expect to Native, UtfGeneralDecoder<Native, Expect> decode from Native to Expect. They should be one of wchar_t, char8_t, char16_t, char32_t and char16be_t.
Unlike BomEncoder and BomDecoder, UtfGeneralEncoder and UtfGeneralDecodes is without BOM.
char16be_t means UTF-16 Big Endian, which is not a C++ native type, it can't be used with any string literal.
Specific UTF Conversion Aliases
There are aliases for them to convert between wchar_t and any other UTF encoding:
- Use
Utf8EncoderandUtf8Decoderfor UTF-8 conversion - Use
Utf16EncoderandUtf16Decoderfor UTF-16 conversion - Use
Utf16BEEncoderandUtf16BEDecoderfor UTF-16 Big Endian conversion - Use
Utf32EncoderandUtf32Decoderfor UTF-32 conversion
ASCII/MBCS Encoding
Use MbcsEncoder and MbcsDecoder for ASCII/MBCS conversion.
MbcsEncoder and MbcsDecoder convert data between wchar_t and char, which is ASCII.
BomEncoder::Mbcs also handles ASCII meanwhile there is no BOM for ASCII. A BomEncoder(BomEncoder::Mbcs) works like a MbcsEncoder.
The actual encoding of char depends on the user setting in the running OS.
Automatic Encoding Detection
Use TestEncoding for automatic encoding detection.
There is a function TestEncoding to scan a binary data and guess the most possible UTF encoding.
Base64 Encoding
Use Utf8Base64Encoder and Utf8Base64Decoder for Base64 encoding in UTF-8.
Utf8Base64Encoder and Utf6Base64Decoder convert between binary data to Base64 in UTF8 encoding.
They can work with UtfGeneralEncoder and UtfGeneralDecoder to convert binary data to Base64 in a WString.
Example: Converting Binary Data to Base64 WString
MemoryStream memoryStream;
{
UtfGeneralEncoder<wchar_t, char8_t> u8towEncoder;
EncoderStream u8towStream(memoryStream, u8towEncoder);
Utf8Base64Encoder base64Encoder;
EncoderStream base64Stream(u8t0wStream, base64Encoder);
base64Stream.Write(binary ...);
}
memoryStream.SeekFromBegin(0);
{
StreamReader reader(memoryStream);
auto base64 = reader.ReadToEnd(reader);
}
Example: Converting Base64 WString to Binary Data
MemoryStream memoryStreamn;
{
StreamWriter writer(memoryStream);
writer.WriteString(base64);
}
memoryStream.SeekFromBegin(0);
{
UtfGeneralEncoder<wchar_t, char8_t> wtou8Decoder;
DecoderStream wtou8Stream(memoryStream, wtou8Decoder);
Utf8Base64Decoder base64Decoder;
DecoderStream base64Stream(wtou8Stream, base64Decoder);
base64Stream.Read(binary ...);
}
Data Compression
Use LzwEncoder and LzwDecoder for data compression.
LzwEncodercompress binary data.LzwDecoderdecompress binary data.
Helper Functions
Use CopyStream, CompressStream, DecompressStream helper functions.
There are help functions CopyStream, CompressStream and DecompressStream to make the code simpler.
Extra Content
Encoding Selection Guidelines
When choosing between different encoding methods:
- Use BOM encoders when you need to ensure proper encoding detection by other applications
- Use general UTF encoders for maximum compatibility and control over BOM presence
- Use MBCS encoders only when working with legacy systems that require ASCII compatibility
Performance Considerations
Different encoding operations have varying performance characteristics:
- ASCII/MBCS encoding is fastest but limited to basic character sets
- UTF-8 encoding provides good balance between space efficiency and Unicode support
- UTF-16 and UTF-32 provide different trade-offs between processing speed and memory usage
Error Handling
Encoding operations may fail when:
- Invalid byte sequences are encountered during decoding
- Characters cannot be represented in the target encoding
- BOM detection fails for corrupted files
Always handle potential encoding exceptions, especially when processing user-provided files.
Pipeline Design
The encoder/decoder system is designed for pipeline composition. You can chain multiple encoding operations together to create complex data transformation workflows, such as:
- Base64 decode → UTF-8 decode → String processing
- String processing → UTF-8 encode → Compression → File output
This design provides flexibility for handling various data transformation scenarios efficiently.