工作问题记录--python str类型踩坑记

"Python"

Posted by Simon on March 5, 2020

“Better code, better life. ”

python string类型踩坑记

非专业python程序员小张今天用python写了个脚本,不出所料又出岔子了→.→

  • 问题描述

    问题起源两个变量的对比

    str1 = b'abcd'
    str2 = 'abcd'
    

    str1类型是bytes,str2类型是string,之前写golang对于[]byte和string类型基本可以等同对待,所以我天真的以为python string的底层是bytes,于是写下了这行代码

    if str(str1) == str2 :
    	#do something
    

    显然我真的太天真了。。。

  • 问题分析

    先来看看golang类似情况的处理

    var bf bytes.Buffer
    bf.WriteByte('a')
    var b []byte
    b = append(b, 'a')
    var str string
    str = "a"
    fmt.Println(str == bf.String())
    fmt.Println(str == string(b))
    fmt.Println(string(b) == bf.String())
    

    Output:

    true
    true
    true
    

    后来我去了解了下,golang里的string也不是简单的等于[]byte,这里不做深入讨论

    对于python2官方文档对string类型有如下说明:

    * The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character.
    * String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences. A prefix of 'u' or 'U' makes the string a Unicode string. 
    * A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.
    

    python2中,除了b以外,字符串的prefix还包括r \ Ru \ U 来分别标识该字符串是raw stringunicode string 。而bpython2中是被忽略的。

    python3中是这么说的:

    * Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
    
    1. b开头的是型别是字节数组
    2. 一个字节只有8个bit,所以Bytes只包括ASCII码

    同样的c++中std::string底层的数据结构是char*,而char类型占2个字节

    所以我们得到一个结论:A CHARACTER IS NOT A BYTE

  • 总结

    我们用string来输出文本类型 ,比如:

    print('שלום עולם')
    

    Output:

    שלום עולם
    

    我们用bytes来输出更底层的信息,比如上面的字符串在计算机中是如何用01存储的:

    bytes('שלום עולם', 'utf-8')
    

    Output:

    b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d \xd7\xa2\xd7\x95\xd7\x9c\xd7\x9d'
    

    但是bytesstr之间的转换一定要加encodedecode的,我上面就是犯了这么一个愚蠢的错误,以下几段代码很能说明问题

    b'\xE2\x82\xAC'.decode('UTF-8')
    

    Output:

    '€'
    

    但是不能直接做append操作,因为不存在从bytesstr的隐式转换

    b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
    

    Output:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: can't concat bytes to str
    

    由于A的ASCII码是41所以这两种写法是

    b'A' == b'\x41'
    

    Output:

    True
    

    但是

    'A' == b'A'
    

    Output:

    False