“Better code, better life. ”
python string类型踩坑记
非专业python程序员小张今天用python写了个脚本,不出所料又出岔子了→.→
-
问题描述
问题起源两个变量的对比
str1 = b'abcd' str2 = 'abcd'str1类型是bytes,str2类型是string,之前写
golang对于[]byte和string类型基本可以等同对待,所以我天真的以为pythonstring的底层是bytes,于是写下了这行代码if str(str1) == str2 : #do something显然我真的太天真了。。。
-
问题分析
先来看看
golang类似情况的处理var bf bytes.Buffer bf.WriteByte('a') var b []byte b = append(b, 'a') var str string str = "a" fmt.Println(str == bf.String()) fmt.Println(str == string(b)) fmt.Println(string(b) == bf.String())Output:
true true true后来我去了解了下,
golang里的string也不是简单的等于[]byte,这里不做深入讨论对于python2官方文档对string类型有如下说明:
* The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. * String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences. A prefix of 'u' or 'U' makes the string a Unicode string. * A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.python2中,除了b以外,字符串的prefix还包括r\R,u\U来分别标识该字符串是raw string和unicode string。而b在python2中是被忽略的。python3中是这么说的:
* Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.- 以
b开头的是型别是字节数组 - 一个字节只有8个bit,所以
Bytes只包括ASCII码
同样的c++中std::string底层的数据结构是char*,而char类型占2个字节
所以我们得到一个结论:A CHARACTER IS NOT A BYTE
- 以
-
总结
我们用
string来输出文本类型 ,比如:print('שלום עולם')Output:
שלום עולם我们用
bytes来输出更底层的信息,比如上面的字符串在计算机中是如何用01存储的:bytes('שלום עולם', 'utf-8')Output:
b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d \xd7\xa2\xd7\x95\xd7\x9c\xd7\x9d'但是
bytes和str之间的转换一定要加encode和decode的,我上面就是犯了这么一个愚蠢的错误,以下几段代码很能说明问题b'\xE2\x82\xAC'.decode('UTF-8')Output:
'€'但是不能直接做
append操作,因为不存在从bytes到str的隐式转换b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'Output:
Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't concat bytes to str由于
A的ASCII码是41所以这两种写法是b'A' == b'\x41'Output:
True但是
'A' == b'A'Output:
False