diff --git a/docs/advanced/cast/strings.rst b/docs/advanced/cast/strings.rst index c70fb0bec..9328d3674 100644 --- a/docs/advanced/cast/strings.rst +++ b/docs/advanced/cast/strings.rst @@ -3,14 +3,23 @@ Strings, bytes and Unicode conversions .. note:: - This section discusses string handling in terms of Python 3 strings. For Python 2.7, replace all occurrences of ``str`` with ``unicode`` and ``bytes`` with ``str``. Python 2.7 users may find it best to use ``from __future__ import unicode_literals`` to avoid unintentionally using ``str`` instead of ``unicode``. + This section discusses string handling in terms of Python 3 strings. For + Python 2.7, replace all occurrences of ``str`` with ``unicode`` and + ``bytes`` with ``str``. Python 2.7 users may find it best to use ``from + __future__ import unicode_literals`` to avoid unintentionally using ``str`` + instead of ``unicode``. Passing Python strings to C++ ============================= -When a Python ``str`` is passed from Python to a C++ function that accepts ``std::string`` or ``char *`` as arguments, pybind11 will encode the Python string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation does not fail. +When a Python ``str`` is passed from Python to a C++ function that accepts +``std::string`` or ``char *`` as arguments, pybind11 will encode the Python +string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation +does not fail. -The C++ language is encoding agnostic. It is the responsibility of the programmer to track encodings. It's often easiest to simply `use UTF-8 everywhere `_. +The C++ language is encoding agnostic. It is the responsibility of the +programmer to track encodings. It's often easiest to simply `use UTF-8 +everywhere `_. .. code-block:: c++ @@ -39,20 +48,27 @@ The C++ language is encoding agnostic. It is the responsibility of the programme .. note:: - Some terminal emulators do not support UTF-8 or emoji fonts and may not display the example above correctly. + Some terminal emulators do not support UTF-8 or emoji fonts and may not + display the example above correctly. -The results are the same whether the C++ function accepts arguments by value or reference, and whether or not ``const`` is used. +The results are the same whether the C++ function accepts arguments by value or +reference, and whether or not ``const`` is used. Passing bytes to C++ -------------------- -A Python ``bytes`` object will be passed to C++ functions that accept ``std::string`` or ``char*`` *without* conversion. +A Python ``bytes`` object will be passed to C++ functions that accept +``std::string`` or ``char*`` *without* conversion. Returning C++ strings to Python =============================== -When a C++ function returns a ``std::string`` or ``char*`` to a Python caller, **pybind11 will assume that the string is valid UTF-8** and will decode it to a native Python ``str``, using the same API as Python uses to perform ``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will raise a ``UnicodeDecodeError``. +When a C++ function returns a ``std::string`` or ``char*`` to a Python caller, +**pybind11 will assume that the string is valid UTF-8** and will decode it to a +native Python ``str``, using the same API as Python uses to perform +``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will +raise a ``UnicodeDecodeError``. .. code-block:: c++ @@ -68,16 +84,22 @@ When a C++ function returns a ``std::string`` or ``char*`` to a Python caller, * True -Because UTF-8 is inclusive of pure ASCII, there is never any issue with returning a pure ASCII string to Python. If there is any possibility that the string is not pure ASCII, it is necessary to ensure the encoding is valid UTF-8. +Because UTF-8 is inclusive of pure ASCII, there is never any issue with +returning a pure ASCII string to Python. If there is any possibility that the +string is not pure ASCII, it is necessary to ensure the encoding is valid +UTF-8. .. warning:: - Implicit conversion assumes that a returned ``char *`` is null-terminated. If there is no null terminator a buffer overrun will occur. + Implicit conversion assumes that a returned ``char *`` is null-terminated. + If there is no null terminator a buffer overrun will occur. Explicit conversions -------------------- -If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one can perform a explicit conversion and return a ``py::str`` object. Explicit conversion has the same overhead as implicit conversion. +If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one +can perform a explicit conversion and return a ``py::str`` object. Explicit +conversion has the same overhead as implicit conversion. .. code-block:: c++ @@ -95,15 +117,20 @@ If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one ca >>> str_output() 'Send your résumé to Alice in HR' -The `Python C API `_ provides several built-in codecs. +The `Python C API +`_ provides +several built-in codecs. -One could also use a third party encoding library such as libiconv to transcode to UTF-8. +One could also use a third party encoding library such as libiconv to transcode +to UTF-8. Return C++ strings without conversion ------------------------------------- -If the data in a C++ ``std::string`` does not represent text and should be returned to Python as ``bytes``, then one can return the data as a ``py::bytes`` object. +If the data in a C++ ``std::string`` does not represent text and should be +returned to Python as ``bytes``, then one can return the data as a +``py::bytes`` object. .. code-block:: c++ @@ -120,7 +147,8 @@ If the data in a C++ ``std::string`` does not represent text and should be retur b'\xba\xd0\xba\xd0' -Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly. +Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without +encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly. .. code-block:: c++ @@ -128,7 +156,7 @@ Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without e [](std::string s) { // Accepts str or bytes from Python return s; // Looks harmless, but implicitly converts to str } - ); + ); .. code-block:: python @@ -142,7 +170,12 @@ Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without e Wide character strings ====================== -When a Python ``str`` is passed to a C++ function expecting ``std::wstring``, ``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each type, in the platform's endian. When strings of these types are returned, they are assumed to contain valid UTF-16 or UTF-32, and will be decoded to Python ``str``. +When a Python ``str`` is passed to a C++ function expecting ``std::wstring``, +``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be +encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each +type, in the platform's endian. When strings of these types are returned, they +are assumed to contain valid UTF-16 or UTF-32, and will be decoded to Python +``str``. .. code-block:: c++ @@ -171,17 +204,23 @@ When a Python ``str`` is passed to a C++ function expecting ``std::wstring``, `` .. warning:: - Wide character strings may not work as described on Python 2.7 or Python 3.3 compiled with ``--enable-unicode=ucs2``. + Wide character strings may not work as described on Python 2.7 or Python + 3.3 compiled with ``--enable-unicode=ucs2``. -Strings in multibyte encodings such as Shift-JIS must transcoded to a UTF-8/16/32 before being returned to Python. +Strings in multibyte encodings such as Shift-JIS must transcoded to a +UTF-8/16/32 before being returned to Python. Character literals ================== -C++ functions that accept character literals as input will receive the first character of a Python ``str`` as their input. If the string is longer than one Unicode character, trailing characters will be ignored. +C++ functions that accept character literals as input will receive the first +character of a Python ``str`` as their input. If the string is longer than one +Unicode character, trailing characters will be ignored. -When a character literal is returned from C++ (such as a ``char`` or a ``wchar_t``), it will be converted to a ``str`` that represents the single character. +When a character literal is returned from C++ (such as a ``char`` or a +``wchar_t``), it will be converted to a ``str`` that represents the single +character. .. code-block:: c++ @@ -189,26 +228,34 @@ When a character literal is returned from C++ (such as a ``char`` or a ``wchar_t m.def("pass_wchar", [](wchar_t w) { return w; }); .. code-block:: python - + >>> example.pass_char('A') 'A' -While C++ will cast integers to character types (``char c = 0x65;``), pybind11 does not convert Python integers to characters implicitly. The Python function ``chr()`` can be used to convert integers to characters. +While C++ will cast integers to character types (``char c = 0x65;``), pybind11 +does not convert Python integers to characters implicitly. The Python function +``chr()`` can be used to convert integers to characters. .. code-block:: python - + >>> example.pass_char(0x65) TypeError >>> example.pass_char(chr(0x65)) 'A' -If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t`` as the argument type. +If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t`` +as the argument type. Grapheme clusters ----------------- -A single grapheme may be represented by two or more Unicode characters. For example 'é' is usually represented as U+00E9 but can also be expressed as the combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by a combining acute accent). The combining character will be lost if the two-character sequence is passed as an argument, even though it renders as a single grapheme. +A single grapheme may be represented by two or more Unicode characters. For +example 'é' is usually represented as U+00E9 but can also be expressed as the +combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by +a combining acute accent). The combining character will be lost if the +two-character sequence is passed as an argument, even though it renders as a +single grapheme. .. code-block:: python @@ -226,18 +273,22 @@ A single grapheme may be represented by two or more Unicode characters. For exam >>> example.pass_wchar(combining_e_acute) 'e' -Normalizing combining characters before passing the character literal to C++ may resolve *some* of these issues: +Normalizing combining characters before passing the character literal to C++ +may resolve *some* of these issues: .. code-block:: python >>> example.pass_wchar(unicodedata.normalize('NFC', combining_e_acute)) 'é' -In some languages (Thai for example), there are `graphemes that cannot be expressed as a single Unicode code point `_, so there is no way to capture them in a C++ character type. +In some languages (Thai for example), there are `graphemes that cannot be +expressed as a single Unicode code point +`_, so there is +no way to capture them in a C++ character type. References ========== -* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) `_ -* `C++ - Using STL Strings at Win32 API Boundaries `_ \ No newline at end of file +* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) ` +* `C++ - Using STL Strings at Win32 API Boundaries `_