“Better code, better life. ”

python string类型踩坑记

非专业python程序员小张今天用python写了个脚本，不出所料又出岔子了→.→

问题描述

问题起源两个变量的对比
```
str1 = b'abcd'
str2 = 'abcd'
```
str1类型是bytes，str2类型是string，之前写golang对于[]byte和string类型基本可以等同对待，所以我天真的以为python string的底层是bytes，于是写下了这行代码
```
if str(str1) == str2 :
	#do something
```
显然我真的太天真了。。。

问题分析

先来看看golang类似情况的处理

var bf bytes.Buffer
bf.WriteByte('a')
var b []byte
b = append(b, 'a')
var str string
str = "a"
fmt.Println(str == bf.String())
fmt.Println(str == string(b))
fmt.Println(string(b) == bf.String())

Output:

true
true
true

后来我去了解了下，golang里的string也不是简单的等于[]byte，这里不做深入讨论

对于python2官方文档对string类型有如下说明：

* The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character.
* String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences. A prefix of 'u' or 'U' makes the string a Unicode string. 
* A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

python2中，除了b以外，字符串的prefix还包括r \ R ，u \ U 来分别标识该字符串是raw string 和unicode string 。而b 在python2中是被忽略的。

python3中是这么说的：

* Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

以b开头的是型别是字节数组
一个字节只有8个bit，所以Bytes只包括ASCII码

同样的c++中std::string底层的数据结构是char*，而char类型占2个字节

所以我们得到一个结论：A CHARACTER IS NOT A BYTE

总结

我们用string来输出文本类型 ，比如:
```
print('שלום עולם')
```
Output:
```
שלום עולם
```
我们用bytes来输出更底层的信息，比如上面的字符串在计算机中是如何用01存储的：
```
bytes('שלום עולם', 'utf-8')
```
Output:
```
b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d \xd7\xa2\xd7\x95\xd7\x9c\xd7\x9d'
```
但是bytes和str之间的转换一定要加encode和decode的，我上面就是犯了这么一个愚蠢的错误，以下几段代码很能说明问题
```
b'\xE2\x82\xAC'.decode('UTF-8')
```
Output:
```
'€'
```
但是不能直接做append操作，因为不存在从bytes到str的隐式转换
```
b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
```
Output:
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
```
由于A的ASCII码是41所以这两种写法是
```
b'A' == b'\x41'
```
Output:
```
True
```
但是
```
'A' == b'A'
```
Output:
```
False
```

工作问题记录--python str类型踩坑记

"Python"

python string类型踩坑记

问题描述

问题分析

总结

CATALOG

FEATURED TAGS

FRIENDS