{ "cells": [ { "cell_type": "markdown", "id": "41177892", "metadata": {}, "source": [ "# 第4章 Unicode文本和字节序列" ] }, { "cell_type": "markdown", "id": "5f3824d9", "metadata": {}, "source": [ "## 4.1 字符问题" ] }, { "cell_type": "markdown", "id": "2a521e6c", "metadata": {}, "source": [ "- 字符的标识:码点,是0\\~1114111范围内的数(十进制),在Unicode标准中以4\\~6个十六进制数表示,前面加“U+”,取值范围是U+0000\\~U+10FFFF。\n", "- 字符的具体描述取决于所用的编码。编码是在码点和字节序列之间转换时使用的算法。" ] }, { "cell_type": "code", "execution_count": 6, "id": "8cd49bc3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 编码和解码\n", "s = 'café'\n", "len(s)" ] }, { "cell_type": "code", "execution_count": 7, "id": "31de7c8b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(b'caf\\xc3\\xa9', 5)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = s.encode('utf8')\n", "b, len(b)" ] }, { "cell_type": "code", "execution_count": 8, "id": "a84e5337", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'café'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.decode('utf8')" ] }, { "cell_type": "markdown", "id": "934d0a3b", "metadata": {}, "source": [ "## 4.2 字节概要" ] }, { "cell_type": "code", "execution_count": 9, "id": "29c31d79", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'caf\\xc3\\xa9'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 包含5个字节的bytes和bytearray对象\n", "cafe = bytes('café', encoding='utf_8')\n", "cafe" ] }, { "cell_type": "code", "execution_count": 10, "id": "4e690ee2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(99, b'c')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cafe[0], cafe[:1]" ] }, { "cell_type": "code", "execution_count": 11, "id": "c7aa6a88", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(bytearray(b'caf\\xc3\\xa9'), bytearray(b'\\xa9'))" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cafe_arr = bytearray(cafe)\n", "cafe_arr, cafe_arr[-1:]" ] }, { "cell_type": "markdown", "id": "0254a521", "metadata": {}, "source": [ "## 4.3 处理编码和解码问题" ] }, { "cell_type": "markdown", "id": "9438730d", "metadata": {}, "source": [ "1. 处理UnicodeEncodeError" ] }, { "cell_type": "code", "execution_count": 12, "id": "635fa4b8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'S\\xc4\\x81o Paulo'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 把str编码成字节序列,有些成功,有些需要处理错误\n", "city = 'Sāo Paulo'\n", "city.encode('utf_8')" ] }, { "cell_type": "code", "execution_count": 13, "id": "2122e9d6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'\\xff\\xfeS\\x00\\x01\\x01o\\x00 \\x00P\\x00a\\x00u\\x00l\\x00o\\x00'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "city.encode('utf_16')" ] }, { "cell_type": "code", "execution_count": 27, "id": "4d0c47bb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'So Paulo'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "city.encode('iso8859_1', errors='ignore')" ] }, { "cell_type": "code", "execution_count": 19, "id": "d0a9f85f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'S?o Paulo'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "city.encode('cp437', errors='replace')" ] }, { "cell_type": "code", "execution_count": 20, "id": "e0a008aa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'Sāo Paulo'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "city.encode('cp437', errors='xmlcharrefreplace')" ] }, { "cell_type": "markdown", "id": "1ee581ac", "metadata": {}, "source": [ "2. 处理UnicodeDecodeError" ] }, { "cell_type": "code", "execution_count": 22, "id": "d478cf99", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Montréal'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 把字节序列解码成str,有些成功,有些需要处理错误\n", "octets = b'Montr\\xe9al'\n", "octets.decode('cp1252')" ] }, { "cell_type": "code", "execution_count": 23, "id": "00d3eea8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Montrιal'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "octets.decode('iso8859_7')" ] }, { "cell_type": "code", "execution_count": 28, "id": "0bcb255c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'MontrИal'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "octets.decode('koi8_r')" ] }, { "cell_type": "code", "execution_count": 29, "id": "0d2235a6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Montr�al'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "octets.decode('utf_8', errors='replace')" ] }, { "cell_type": "markdown", "id": "fd2815ae", "metadata": {}, "source": [ "## 4.4 默认编码" ] }, { "cell_type": "code", "execution_count": 30, "id": "86d92162", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " locale.getpreferredencoding() -> 'cp936'\n", " type(my_file) -> \n", " my_file.encoding -> 'cp936'\n", " sys.stdout.isatty() -> False\n", " sys.stdout.encoding -> 'UTF-8'\n", " sys.stdin.isatty() -> False\n", " sys.stdin.encoding -> 'gbk'\n", " sys.stderr.isatty() -> False\n", " sys.stderr.encoding -> 'UTF-8'\n", " sys.getdefaultencoding() -> 'utf-8'\n", " sys.getfilesystemencoding() -> 'utf-8'\n" ] } ], "source": [ "import locale\n", "import sys\n", "\n", "expressions = \"\"\"\n", " locale.getpreferredencoding()\n", " type(my_file)\n", " my_file.encoding\n", " sys.stdout.isatty()\n", " sys.stdout.encoding\n", " sys.stdin.isatty()\n", " sys.stdin.encoding\n", " sys.stderr.isatty()\n", " sys.stderr.encoding\n", " sys.getdefaultencoding()\n", " sys.getfilesystemencoding()\n", " \"\"\"\n", "\n", "my_file = open('dummy', 'w')\n", "\n", "for expression in expressions.split():\n", " value = eval(expression)\n", " print(f'{expression:>30} -> {value!r}')" ] }, { "cell_type": "markdown", "id": "8356bb33", "metadata": {}, "source": [ "## 4.5 为了正确比较而规范化Unicode字符" ] }, { "cell_type": "markdown", "id": "af688878", "metadata": {}, "source": [ "- NFC(Normalization Form C):使用最少的码点构成等价的字符串。\n", "- NFD:把合成字符分解成基字符和单独的组合字符。" ] }, { "cell_type": "code", "execution_count": 31, "id": "ec3e4218", "metadata": {}, "outputs": [], "source": [ "from unicodedata import normalize" ] }, { "cell_type": "code", "execution_count": 32, "id": "a8353462", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 5)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = 'café'\n", "s2 = 'cafe\\N{COMBINING ACUTE ACCENT}'\n", "len(s1), len(s2)" ] }, { "cell_type": "code", "execution_count": 33, "id": "884bb2c9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 4)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(normalize('NFC', s1)), len(normalize('NFC', s2))" ] }, { "cell_type": "code", "execution_count": 34, "id": "7be91a4b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5, 5)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(normalize('NFD', s1)), len(normalize('NFD', s2))" ] }, { "cell_type": "markdown", "id": "080b1b9a", "metadata": {}, "source": [ "## 4.6 极端“规范化”:去掉变音符" ] }, { "cell_type": "code", "execution_count": 38, "id": "fe8f3e0a", "metadata": {}, "outputs": [], "source": [ "import unicodedata\n", "\n", "\n", "def shave_marks(txt):\n", " \"\"\"删除所有变音符\"\"\"\n", " # 把所有字符分解成基字符和组合记号\n", " norm_txt = unicodedata.normalize('NFD', txt)\n", " # 过滤所有组合记号\n", " shaved = ''.join(c for c in norm_txt\n", " if not unicodedata.combining(c))\n", " # 重组所有字符\n", " return unicodedata.normalize('NFC', shaved)" ] }, { "cell_type": "code", "execution_count": 39, "id": "1bd314e5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'\n", "shave_marks(order)" ] }, { "cell_type": "code", "execution_count": 40, "id": "1119f53d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ζεφυρος, Zefiro'" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "greek = 'Ζέφυρος, Zéfiro'\n", "shave_marks(greek)" ] }, { "cell_type": "code", "execution_count": 47, "id": "748fa3f2", "metadata": {}, "outputs": [], "source": [ "import string\n", "\n", "\n", "def shave_marks_latin(txt):\n", " \"\"\"删除所有拉丁基字符上的变音符\"\"\"\n", " norm_txt = unicodedata.normalize('NFD', txt)\n", " latin_base = False\n", " preserve = []\n", " for c in norm_txt:\n", " if unicodedata.combining(c) and latin_base:\n", " continue # 忽略拉丁基字符的变音符\n", " preserve.append(c)\n", " # 如果不是组合字符,那就是新的基字符\n", " if not unicodedata.combining(c):\n", " latin_base = c in string.ascii_letters\n", " shaved = ''.join(preserve)\n", " return unicodedata.normalize('NFC', shaved)" ] }, { "cell_type": "code", "execution_count": 48, "id": "8577c615", "metadata": {}, "outputs": [], "source": [ "single_map = str.maketrans(\"\"\"‚ƒ„ˆ‹‘’“”•–—˜›\"\"\", # <1>\n", " \"\"\"'f\"^<''\"\"---~>\"\"\")\n", "\n", "multi_map = str.maketrans({ # <2>\n", " '€': 'EUR',\n", " '…': '...',\n", " 'Æ': 'AE',\n", " 'æ': 'ae',\n", " 'Œ': 'OE',\n", " 'œ': 'oe',\n", " '™': '(TM)',\n", " '‰': '',\n", " '†': '**',\n", " '‡': '***',\n", "})\n", "\n", "multi_map.update(single_map) # <3>\n", "\n", "\n", "def dewinize(txt):\n", " \"\"\"把cp1252符号替换为ASCII字符或字符序列\"\"\"\n", " return txt.translate(multi_map)\n", "\n", "def asciize(txt):\n", " # 去掉变音符\n", " no_marks = shave_marks_latin(dewinize(txt))\n", " no_marks = no_marks.replace('ß', 'ss')\n", " # 使用NFKC规范化形式把字符和码点组合起来\n", " return unicodedata.normalize('NFKC', no_marks)" ] }, { "cell_type": "code", "execution_count": 49, "id": "32e2428e", "metadata": {}, "outputs": [], "source": [ "order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'" ] }, { "cell_type": "code", "execution_count": 50, "id": "c5fd5ea7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\"Herr Voß: - ½ cup of OEtker(TM) caffè latte - bowl of açaí.\"'" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dewinize(order)" ] }, { "cell_type": "code", "execution_count": 51, "id": "bca0566b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\"Herr Voss: - 1⁄2 cup of OEtker(TM) caffe latte - bowl of acai.\"'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "asciize(order)" ] }, { "cell_type": "markdown", "id": "ea72b7d8", "metadata": {}, "source": [ "## 4.7 Unicode文本排序" ] }, { "cell_type": "code", "execution_count": 53, "id": "694e4434", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['açaí', 'acerola', 'atemoia', 'cajá', 'caju']" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 巴西产水果的列表排序\n", "import locale\n", "my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')\n", "\n", "fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']\n", "sorted(fruits, key=locale.strxfrm)" ] }, { "cell_type": "markdown", "id": "64a0d14b", "metadata": {}, "source": [ "## 4.8 Unicode数据库" ] }, { "cell_type": "markdown", "id": "27c5cf92", "metadata": {}, "source": [ "Unicode标准提供了一个完整的数据库,不仅包括码点与字符名称之间的映射表,还包括各个字符的元数据,以及字符之间的关系。" ] }, { "cell_type": "markdown", "id": "da7913e1", "metadata": {}, "source": [ "- unicodedata.name():返回一个字符在标准中的官方名称\n", "- unicodedata.numeric():返回一个字符在标准中的数值" ] }, { "cell_type": "markdown", "id": "130a2eac", "metadata": {}, "source": [ "## 4.9 支持str和bytes的双模式API" ] }, { "cell_type": "markdown", "id": "2f2feb4b", "metadata": {}, "source": [ "1. 正则表达式中的str和bytes" ] }, { "cell_type": "code", "execution_count": 55, "id": "aff8aaf4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Text\n", " 'Ramanujan saw ௧௭௨௯ as 1729 = 1³ + 12³ = 9³ + 10³.'\n", "Numbers\n", " str : ['௧௭௨௯', '1729', '1', '12', '9', '10']\n", " bytes: [b'1729', b'1', b'12', b'9', b'10']\n", "Words\n", " str : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '1³', '12³', '9³', '10³']\n", " bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'1', b'12', b'9', b'10']\n" ] } ], "source": [ "import re\n", "\n", "# str类型\n", "re_numbers_str = re.compile(r'\\d+')\n", "re_words_str = re.compile(r'\\w+')\n", "# bytes类型\n", "re_numbers_bytes = re.compile(rb'\\d+')\n", "re_words_bytes = re.compile(rb'\\w+')\n", "\n", "text_str = (\"Ramanujan saw \\u0be7\\u0bed\\u0be8\\u0bef\"\n", " \" as 1729 = 1³ + 12³ = 9³ + 10³.\")\n", "\n", "# bytes正则表达式只能搜索bytes字符串\n", "text_bytes = text_str.encode('utf_8')\n", "\n", "print(f'Text\\n {text_str!r}')\n", "print('Numbers')\n", "# str模式r'\\d+'只能匹配泰米尔数值和ASCII数字\n", "print(' str :', re_numbers_str.findall(text_str)) \n", "# bytes模式rb'\\d+'只能匹配ASCII字节中的数字\n", "print(' bytes:', re_numbers_bytes.findall(text_bytes)) \n", "print('Words')\n", "# str模式r'\\w+'能匹配字母、上标、泰米尔数字和ASCII数字\n", "print(' str :', re_words_str.findall(text_str)) \n", "# bytes模式rb'\\w+'只能匹配ASCII字节中的字母和数字\n", "print(' bytes:', re_words_bytes.findall(text_bytes)) " ] }, { "cell_type": "markdown", "id": "f09cad22", "metadata": {}, "source": [ "2. os函数中的str和bytes" ] }, { "cell_type": "markdown", "id": "e25e6cbc", "metadata": {}, "source": [ "- `os`模块中所有接收文件名或路径名的函数,既可以传入`str`参数,也可以传入`bytes`参数。\n", "- 传入`str`参数时,使用`sys.getfilesystemencoding()`获得的编码解码器自动转换参数,操作系统回显时也使用编码解码器进行解码。\n", "- `os`模块提供了特殊的编码解码函数`os.fsencode(name_or_path)`和`os.fsdecode(name_or_path)`。" ] }, { "cell_type": "markdown", "id": "dc7f7498", "metadata": {}, "source": [ "## 4.10 杂谈" ] }, { "cell_type": "markdown", "id": "86b5f663", "metadata": {}, "source": [ "在源码中应该使用非ASCII名称吗?\n", " \n", "- Python3允许在源码中使用非ASCII标识符。\n", "- 让每个人都能轻松地阅读和编辑代码,需要分情况:\n", " - 如果在一家跨国企业,或者一个开源项目中,标识符推荐使用英语,也要使用ASCII字符。\n", " - 如果是一名教师,可以使用本地化语言进行变量和函数命名。\n", "- 任何人都应该选择让团队成员更容易理解代码的语言,并使用正确的字符拼写。 " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }