PEP: 675 Title: Arbitrary Literal String Type Version: $Revision$ Last-Modified: $Date$ Author: Pradeep Kumar Srinivasan <gohanpra@gmail.com>, Graham Bleaney <gbleaney@gmail.com> Sponsor: Jelle Zijlstra <jelle.zijlstra@gmail.com> Discussions-To: https://mail.python.org/archives/list/typing-sig@python.org/thread/VB74EHNM4RODDFM64NEEEBJQVAUAWIAW/ Status: Accepted Type: Standards Track Content-Type: text/x-rst Created: 30-Nov-2021 Python-Version: 3.11 Post-History: 07-Feb-2022 Resolution: https://mail.python.org/archives/list/python-dev@python.org/message/XEOOSSPNYPGZ5NXOJFPLXG2BTN7EVRT5/ Abstract ======== There is currently no way to specify, using type annotations, that a function parameter can be of any literal string type. We have to specify a precise literal string type, such as ``Literal["foo"]``. This PEP introduces a supertype of literal string types: ``LiteralString``. This allows a function to accept arbitrary literal string types, such as ``Literal["foo"]`` or ``Literal["bar"]``. Motivation ========== Powerful APIs that execute SQL or shell commands often recommend that they be invoked with literal strings, rather than arbitrary user controlled strings. There is no way to express this recommendation in the type system, however, meaning security vulnerabilities sometimes occur when developers fail to follow it. For example, a naive way to look up a user record from a database is to accept a user id and insert it into a predefined SQL query: :: def query_user(conn: Connection, user_id: str) -> User: query = f"SELECT * FROM data WHERE user_id = {user_id}" conn.execute(query) query_user(conn, "user123") # OK. However, the user-controlled data ``user_id`` is being mixed with the SQL command string, which means a malicious user could run arbitrary SQL commands: :: # Delete the table. query_user(conn, "user123; DROP TABLE data;") # Fetch all users (since 1 = 1 is always true). query_user(conn, "user123 OR 1 = 1") To prevent such SQL injection attacks, SQL APIs offer parameterized queries, which separate the executed query from user-controlled data and make it impossible to run arbitrary queries. For example, with `sqlite3 <https://docs.python.org/3/library/sqlite3.html>`_, our original function would be written safely as a query with parameters: :: def query_user(conn: Connection, user_id: str) -> User: query = "SELECT * FROM data WHERE user_id = ?" conn.execute(query, (user_id,)) The problem is that there is no way to enforce this discipline. sqlite3's own `documentation <https://docs.python.org/3/library/sqlite3.html>`_ can only admonish the reader to not dynamically build the ``sql`` argument from external input; the API's authors cannot express that through the type system. Users can (and often do) still use a convenient f-string as before and leave their code vulnerable to SQL injection. Existing tools, such as the popular security linter `Bandit <https://github.com/PyCQA/bandit/blob/aac3f16f45648a7756727286ba8f8f0cf5e7d408/bandit/plugins/django_sql_injection.py#L102>`_, attempt to detect unsafe external data used in SQL APIs, by inspecting the AST or by other semantic pattern-matching. These tools, however, preclude common idioms like storing a large multi-line query in a variable before executing it, adding literal string modifiers to the query based on some conditions, or transforming the query string using a function. (We survey existing tools in the `Rejected Alternatives`_ section.) For example, many tools will detect a false positive issue in this benign snippet: :: def query_data(conn: Connection, user_id: str, limit: bool) -> None: query = """ SELECT user.name, user.age FROM data WHERE user_id = ? """ if limit: query += " LIMIT 1" conn.execute(query, (user_id,)) We want to forbid harmful execution of user-controlled data while still allowing benign idioms like the above and not requiring extra user work. To meet this goal, we introduce the ``LiteralString`` type, which only accepts string values that are known to be made of literals. This is a generalization of the ``Literal["foo"]`` type from :pep:`586`. A string of type ``LiteralString`` cannot contain user-controlled data. Thus, any API that only accepts ``LiteralString`` will be immune to injection vulnerabilities (with `pragmatic limitations <Appendix B: Limitations_>`_). Since we want the ``sqlite3`` ``execute`` method to disallow strings built with user input, we would make its `typeshed stub <https://github.com/python/typeshed/blob/1c88ceeee924ec6cfe05dd4865776b49fec299e6/stdlib/sqlite3/dbapi2.pyi#L153>`_ accept a ``sql`` query that is of type ``LiteralString``: :: from typing import LiteralString def execute(self, sql: LiteralString, parameters: Iterable[str] = ...) -> Cursor: ... This successfully forbids our unsafe SQL example. The variable ``query`` below is inferred to have type ``str``, since it is created from a format string using ``user_id``, and cannot be passed to ``execute``: :: def query_user(conn: Connection, user_id: str) -> User: query = f"SELECT * FROM data WHERE user_id = {user_id}" conn.execute(query) # Error: Expected LiteralString, got str. The method remains flexible enough to allow our more complicated example: :: def query_data(conn: Connection, user_id: str, limit: bool) -> None: # This is a literal string. query = """ SELECT user.name, user.age FROM data WHERE user_id = ? """ if limit: # Still has type LiteralString because we added a literal string. query += " LIMIT 1" conn.execute(query, (user_id,)) # OK Notice that the user did not have to change their SQL code at all. The type checker was able to infer the literal string type and complain only in case of violations. ``LiteralString`` is also useful in other cases where we want strict command-data separation, such as when building shell commands or when rendering a string into an HTML response without escaping (see `Appendix A: Other Uses`_). Overall, this combination of strictness and flexibility makes it easy to enforce safer API usage in sensitive code without burdening users. Usage statistics ---------------- In a sample of open-source projects using ``sqlite3``, we found that ``conn.execute`` was called `~67% of the time <https://grep.app/search?q=conn%5C.execute%5C%28%5Cs%2A%5B%27%22%5D®exp=true&filter[lang][0]=Python>`_ with a safe string literal and `~33% of the time <https://grep.app/search?current=3&q=conn%5C.execute%5C%28%5Ba-zA-Z_%5D%2B%5C%29®exp=true&filter[lang][0]=Python>`_ with a potentially unsafe, local string variable. Using this PEP's literal string type along with a type checker would prevent the unsafe portion of that 33% of cases (ie. the ones where user controlled data is incorporated into the query), while seamlessly allowing the safe ones to remain. Rationale ========= Firstly, why use *types* to prevent security vulnerabilities? Warning users in documentation is insufficient - most users either never see these warnings or ignore them. Using an existing dynamic or static analysis approach is too restrictive - these prevent natural idioms, as we saw in the `Motivation`_ section (and will discuss more extensively in the `Rejected Alternatives`_ section). The typing-based approach in this PEP strikes a user-friendly balance between strictness and flexibility. Runtime approaches do not work because, at runtime, the query string is a plain ``str``. While we could prevent some exploits using heuristics, such as regex-filtering for obviously malicious payloads, there will always be a way to work around them (perfectly distinguishing good and bad queries reduces to the halting problem). Static approaches, such as checking the AST to see if the query string is a literal string expression, cannot tell when a string is assigned to an intermediate variable or when it is transformed by a benign function. This makes them overly restrictive. The type checker, surprisingly, does better than both because it has access to information not available in the runtime or static analysis approaches. Specifically, the type checker can tell us whether an expression has a literal string type, say ``Literal["foo"]``. The type checker already propagates types across variable assignments or function calls. In the current type system itself, if the SQL or shell command execution function only accepted three possible input strings, our job would be done. We would just say: :: def execute(query: Literal["foo", "bar", "baz"]) -> None: ... But, of course, ``execute`` can accept *any* possible query. How do we ensure that the query does not contain an arbitrary, user-controlled string? We want to specify that the value must be of some type ``Literal[<...>]`` where ``<...>`` is some string. This is what ``LiteralString`` represents. ``LiteralString`` is the "supertype" of all literal string types. In effect, this PEP just introduces a type in the type hierarchy between ``Literal["foo"]`` and ``str``. Any particular literal string, such as ``Literal["foo"]`` or ``Literal["bar"]``, is compatible with ``LiteralString``, but not the other way around. The "supertype" of ``LiteralString`` itself is ``str``. So, ``LiteralString`` is compatible with ``str``, but not the other way around. Note that a ``Union`` of literal types is naturally compatible with ``LiteralString`` because each element of the ``Union`` is individually compatible with ``LiteralString``. So, ``Literal["foo", "bar"]`` is compatible with ``LiteralString``. However, recall that we don't just want to represent exact literal queries. We also want to support composition of two literal strings, such as ``query + " LIMIT 1"``. This too is possible with the above concept. If ``x`` and ``y`` are two values of type ``LiteralString``, then ``x + y`` will also be of type compatible with ``LiteralString``. We can reason about this by looking at specific instances such as ``Literal["foo"]`` and ``Literal["bar"]``; the value of the added string ``x + y`` can only be ``"foobar"``, which has type ``Literal["foobar"]`` and is thus compatible with ``LiteralString``. The same reasoning applies when ``x`` and ``y`` are unions of literal types; the result of pairwise adding any two literal types from ``x`` and ``y`` respectively is a literal type, which means that the overall result is a ``Union`` of literal types and is thus compatible with ``LiteralString``. In this way, we are able to leverage Python's concept of a ``Literal`` string type to specify that our API can only accept strings that are known to be constructed from literals. More specific details follow in the remaining sections. Specification ============= Runtime Behavior ---------------- We propose adding ``LiteralString`` to ``typing.py``, with an implementation similar to ``typing.NoReturn``. Note that ``LiteralString`` is a special form used solely for type checking. There is no expression for which ``type(<expr>)`` will produce ``LiteralString`` at runtime. So, we do not specify in the implementation that it is a subclass of ``str``. Valid Locations for ``LiteralString`` ----------------------------------------- ``LiteralString`` can be used where any other type can be used: :: variable_annotation: LiteralString def my_function(literal_string: LiteralString) -> LiteralString: ... class Foo: my_attribute: LiteralString type_argument: List[LiteralString] T = TypeVar("T", bound=LiteralString) It cannot be nested within unions of ``Literal`` types: :: bad_union: Literal["hello", LiteralString] # Not OK bad_nesting: Literal[LiteralString] # Not OK Type Inference -------------- .. _inferring_literal_string: Inferring ``LiteralString`` ''''''''''''''''''''''''''' Any literal string type is compatible with ``LiteralString``. For example, ``x: LiteralString = "foo"`` is valid because ``"foo"`` is inferred to be of type ``Literal["foo"]``. As per the `Rationale`_, we also infer ``LiteralString`` in the following cases: + Addition: ``x + y`` is of type ``LiteralString`` if both ``x`` and ``y`` are compatible with ``LiteralString``. + Joining: ``sep.join(xs)`` is of type ``LiteralString`` if ``sep``'s type is compatible with ``LiteralString`` and ``xs``'s type is compatible with ``Iterable[LiteralString]``. + In-place addition: If ``s`` has type ``LiteralString`` and ``x`` has type compatible with ``LiteralString``, then ``s += x`` preserves ``s``'s type as ``LiteralString``. + String formatting: An f-string has type ``LiteralString`` if and only if its constituent expressions are literal strings. ``s.format(...)`` has type ``LiteralString`` if and only if ``s`` and the arguments have types compatible with ``LiteralString``. + Literal-preserving methods: In `Appendix C <appendix_C_>`_, we have provided an exhaustive list of ``str`` methods that preserve the ``LiteralString`` type. In all other cases, if one or more of the composed values has a non-literal type ``str``, the composition of types will have type ``str``. For example, if ``s`` has type ``str``, then ``"hello" + s`` has type ``str``. This matches the pre-existing behavior of type checkers. ``LiteralString`` is compatible with the type ``str``. It inherits all methods from ``str``. So, if we have a variable ``s`` of type ``LiteralString``, it is safe to write ``s.startswith("hello")``. Some type checkers refine the type of a string when doing an equality check: :: def foo(s: str) -> None: if s == "bar": reveal_type(s) # => Literal["bar"] Such a refined type in the if-block is also compatible with ``LiteralString`` because its type is ``Literal["bar"]``. Examples '''''''' See the examples below to help clarify the above rules: :: literal_string: LiteralString s: str = literal_string # OK literal_string: LiteralString = s # Error: Expected LiteralString, got str. literal_string: LiteralString = "hello" # OK Addition of literal strings: :: def expect_literal_string(s: LiteralString) -> None: ... expect_literal_string("foo" + "bar") # OK expect_literal_string(literal_string + "bar") # OK literal_string2: LiteralString expect_literal_string(literal_string + literal_string2) # OK plain_string: str expect_literal_string(literal_string + plain_string) # Not OK. Join using literal strings: :: expect_literal_string(",".join(["foo", "bar"])) # OK expect_literal_string(literal_string.join(["foo", "bar"])) # OK expect_literal_string(literal_string.join([literal_string, literal_string2])) # OK xs: List[LiteralString] expect_literal_string(literal_string.join(xs)) # OK expect_literal_string(plain_string.join([literal_string, literal_string2])) # Not OK because the separator has type 'str'. In-place addition using literal strings: :: literal_string += "foo" # OK literal_string += literal_string2 # OK literal_string += plain_string # Not OK Format strings using literal strings: :: literal_name: LiteralString expect_literal_string(f"hello {literal_name}") # OK because it is composed from literal strings. expect_literal_string("hello {}".format(literal_name)) # OK expect_literal_string(f"hello") # OK username: str expect_literal_string(f"hello {username}") # NOT OK. The format-string is constructed from 'username', # which has type 'str'. expect_literal_string("hello {}".format(username)) # Not OK Other literal types, such as literal integers, are not compatible with ``LiteralString``: :: some_int: int expect_literal_string(some_int) # Error: Expected LiteralString, got int. literal_one: Literal[1] = 1 expect_literal_string(literal_one) # Error: Expected LiteralString, got Literal[1]. We can call functions on literal strings: :: def add_limit(query: LiteralString) -> LiteralString: return query + " LIMIT = 1" def my_query(query: LiteralString, user_id: str) -> None: sql_connection().execute(add_limit(query), (user_id,)) # OK Conditional statements and expressions work as expected: :: def return_literal_string() -> LiteralString: return "foo" if condition1() else "bar" # OK def return_literal_str2(literal_string: LiteralString) -> LiteralString: return "foo" if condition1() else literal_string # OK def return_literal_str3() -> LiteralString: if condition1(): result: Literal["foo"] = "foo" else: result: LiteralString = "bar" return result # OK Interaction with TypeVars and Generics '''''''''''''''''''''''''''''''''''''' TypeVars can be bound to ``LiteralString``: :: from typing import Literal, LiteralString, TypeVar TLiteral = TypeVar("TLiteral", bound=LiteralString) def literal_identity(s: TLiteral) -> TLiteral: return s hello: Literal["hello"] = "hello" y = literal_identity(hello) reveal_type(y) # => Literal["hello"] s: LiteralString y2 = literal_identity(s) reveal_type(y2) # => LiteralString s_error: str literal_identity(s_error) # Error: Expected TLiteral (bound to LiteralString), got str. ``LiteralString`` can be used as a type argument for generic classes: :: class Container(Generic[T]): def __init__(self, value: T) -> None: self.value = value literal_string: LiteralString = "hello" x: Container[LiteralString] = Container(literal_string) # OK s: str x_error: Container[LiteralString] = Container(s) # Not OK Standard containers like ``List`` work as expected: :: xs: List[LiteralString] = ["foo", "bar", "baz"] Interactions with Overloads ''''''''''''''''''''''''''' Literal strings and overloads do not need to interact in a special way: the existing rules work fine. ``LiteralString`` can be used as a fallback overload where a specific ``Literal["foo"]`` type does not match: :: @overload def foo(x: Literal["foo"]) -> int: ... @overload def foo(x: LiteralString) -> bool: ... @overload def foo(x: str) -> str: ... x1: int = foo("foo") # First overload. x2: bool = foo("bar") # Second overload. s: str x3: str = foo(s) # Third overload. Backwards Compatibility ======================= We propose adding ``typing_extensions.LiteralString`` for use in earlier Python versions. As :pep:`PEP 586 mentions <586#backwards-compatibility>`, type checkers "should feel free to experiment with more sophisticated inference techniques". So, if the type checker infers a literal string type for an unannotated variable that is initialized with a literal string, the following example should be OK: :: x = "hello" expect_literal_string(x) # OK, because x is inferred to have type 'Literal["hello"]'. This enables precise type checking of idiomatic SQL query code without annotating the code at all (as seen in the `Motivation`_ section example). However, like :pep:`586`, this PEP does not mandate the above inference strategy. In case the type checker doesn't infer ``x`` to have type ``Literal["hello"]``, users can aid the type checker by explicitly annotating it as ``x: LiteralString``: :: x: LiteralString = "hello" expect_literal_string(x) Rejected Alternatives ===================== Why not use tool X? ------------------- Tools to catch issues such as SQL injection seem to come in three flavors: AST based, function level analysis, and taint flow analysis. **AST-based tools**: `Bandit <https://github.com/PyCQA/bandit/blob/aac3f16f45648a7756727286ba8f8f0cf5e7d408/bandit/plugins/django_sql_injection.py#L102>`_ has a plugin to warn when SQL queries are not literal strings. The problem is that many perfectly safe SQL queries are dynamically built out of string literals, as shown in the `Motivation`_ section. At the AST level, the resultant SQL query is not going to appear as a string literal anymore and is thus indistinguishable from a potentially malicious string. To use these tools would require significantly restricting developers' ability to build SQL queries. ``LiteralString`` can provide similar safety guarantees with fewer restrictions. **Semgrep and pyanalyze**: Semgrep supports a more sophisticated function level analysis, including `constant propagation <https://semgrep.dev/docs/writing-rules/data-flow/#constant-propagation>`_ within a function. This allows us to prevent injection attacks while permitting some forms of safe dynamic SQL queries within a function. `pyanalyze <https://github.com/quora/pyanalyze/blob/afcb58cd3e967e4e3fea9e57bb18b6b1d9d42ed7/README.md#extending-pyanalyze>`_ has a similar extension. But neither handles function calls that construct and return safe SQL queries. For example, in the code sample below, ``build_insert_query`` is a helper function to create a query that inserts multiple values into the corresponding columns. Semgrep and pyanalyze forbid this natural usage whereas ``LiteralString`` handles it with no burden on the programmer: :: def build_insert_query( table: LiteralString insert_columns: Iterable[LiteralString], ) -> LiteralString: sql = "INSERT INTO " + table column_clause = ", ".join(insert_columns) value_clause = ", ".join(["?"] * len(insert_columns)) sql += f" ({column_clause}) VALUES ({value_clause})" return sql def insert_data( conn: Connection, kvs_to_insert: Dict[LiteralString, str] ) -> None: query = build_insert_query("data", kvs_to_insert.keys()) conn.execute(query, kvs_to_insert.values()) # Example usage data_to_insert = { "column_1": value_1, # Note: values are not literals "column_2": value_2, "column_3": value_3, } insert_data(conn, data_to_insert) **Taint flow analysis**: Tools such as `Pysa <https://pyre-check.org/docs/pysa-basics/>`_ or `CodeQL <https://codeql.github.com/>`_ are capable of tracking data flowing from a user controlled input into a SQL query. These tools are powerful but involve considerable overhead in setting up the tool in CI, defining "taint" sinks and sources, and teaching developers how to use them. They also usually take longer to run than a type checker (minutes instead of seconds), which means feedback is not immediate. Finally, they move the burden of preventing vulnerabilities on to library users instead of allowing the libraries themselves to specify precisely how their APIs must be called (as is possible with ``LiteralString``). One final reason to prefer using a new type over a dedicated tool is that type checkers are more widely used than dedicated security tooling; for example, MyPy was downloaded `over 7 million times <https://www.pypistats.org/packages/mypy>`_ in Jan 2022 vs `less than 2 million times <https://www.pypistats.org/packages/bandit>`_ for Bandit. Having security protections built right into type checkers will mean that more developers benefit from them. Why not use a ``NewType`` for ``str``? -------------------------------------- Any API for which ``LiteralString`` would be suitable could instead be updated to accept a different type created within the Python type system, such as ``NewType("SafeSQL", str)``: :: SafeSQL = NewType("SafeSQL", str) def execute(self, sql: SafeSQL, parameters: Iterable[str] = ...) -> Cursor: ... execute(SafeSQL("SELECT * FROM data WHERE user_id = ?"), user_id) # OK user_query: str execute(user_query) # Error: Expected SafeSQL, got str. Having to create a new type to call an API might give some developers pause and encourage more caution, but it doesn't guarantee that developers won't just turn a user controlled string into the new type, and pass it into the modified API anyway: :: query = f"SELECT * FROM data WHERE user_id = f{user_id}" execute(SafeSQL(query)) # No error! We are back to square one with the problem of preventing arbitrary inputs to ``SafeSQL``. This is not a theoretical concern either. Django uses the above approach with ``SafeString`` and `mark_safe <https://docs.djangoproject.com/en/dev/_modules/django/utils/safestring/#SafeString>`_. Issues such as `CVE-2020-13596 <https://github.com/django/django/commit/2dd4d110c159d0c81dff42eaead2c378a0998735>`_ show how this technique can `fail <https://nvd.nist.gov/vuln/detail/CVE-2020-13596>`_. Also note that this requires invasive changes to the source code (wrapping the query with ``SafeSQL``) whereas ``LiteralString`` requires no such changes. Users can remain oblivious to it as long as they pass in literal strings to sensitive APIs. Why not try to emulate Trusted Types? ------------------------------------- `Trusted Types <https://w3c.github.io/webappsec-trusted-types/dist/spec/>`_ is a W3C specification for preventing DOM-based Cross Site Scripting (XSS). XSS occurs when dangerous browser APIs accept raw user-controlled strings. The specification modifies these APIs to accept only the "Trusted Types" returned by designated sanitizing functions. These sanitizing functions must take in a potentially malicious string and validate it or render it benign somehow, for example by verifying that it is a valid URL or HTML-encoding it. It can be tempting to assume porting the concept of Trusted Types to Python could solve the problem. The fundamental difference, however, is that the output of a Trusted Types sanitizer is usually intended *to not be executable code*. Thus it's easy to HTML encode the input, strip out dangerous tags, or otherwise render it inert. With a SQL query or shell command, the end result *still needs to be executable code*. There is no way to write a sanitizer that can reliably figure out which parts of an input string are benign and which ones are potentially malicious. Runtime Checkable ``LiteralString`` ----------------------------------- The ``LiteralString`` concept could be extended beyond static type checking to be a runtime checkable property of ``str`` objects. This would provide some benefits, such as allowing frameworks to raise errors on dynamic strings. Such runtime errors would be a more robust defense mechanism than type errors, which can potentially be suppressed, ignored, or never even seen if the author does not use a type checker. This extension to the ``LiteralString`` concept would dramatically increase the scope of the proposal by requiring changes to one of the most fundamental types in Python. While runtime taint checking on strings, similar to Perl's `taint <https://metacpan.org/pod/Taint>`_, has been `considered <https://bugs.python.org/issue500698>`_ and `attempted <https://github.com/felixgr/pytaint>`_ in the past, and others may consider it in the future, such extensions are out of scope for this PEP. Rejected Names -------------- We considered a variety of names for the literal string type and solicited ideas on `typing-sig <https://mail.python.org/archives/list/typing-sig@python.org/thread/VB74EHNM4RODDFM64NEEEBJQVAUAWIAW/>`_. Some notable alternatives were: + ``Literal[str]``: This is a natural extension of the ``Literal["foo"]`` type name, but typing-sig `objected <https://mail.python.org/archives/list/typing-sig@python.org/message/2ZQO4NTJEI42KTRJDBL77MNANEXOW7UI/>`_ that users could mistake this for the literal type of the ``str`` class. + ``LiteralStr``: This is shorter than ``LiteralString`` but looks weird to the PEP authors. + ``LiteralDerivedString``: This (along with ``MadeFromLiteralString``) best captures the technical meaning of the type. It represents not just the type of literal expressions, such as ``"foo"``, but also that of expressions composed from literals, such as ``"foo" + "bar"``. However, both names seem wordy. + ``StringLiteral``: Users might confuse this with the existing concept of `"string literals" <https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals>`_ where the string exists as a syntactic token in the source code, whereas our concept is more general. + ``SafeString``: While this comes close to our intended meaning, it may mislead users into thinking that the string has been sanitized in some way, perhaps by escaping HTML tags or shell-related special characters. + ``ConstantStr``: This does not capture the idea of composing literal strings. + ``StaticStr``: This suggests that the string is statically computable, i.e., computable without running the program, which is not true. The literal string may vary based on runtime flags, as seen in the `Motivation`_ examples. + ``LiteralOnly[str]``: This has the advantage of being extensible to other literal types, such as ``bytes`` or ``int``. However, we did not find the extensibility worth the loss of readability. Overall, there was no clear winner on typing-sig over a long period, so we decided to tip the scales in favor of ``LiteralString``. ``LiteralBytes`` ---------------- We could generalize literal byte types, such as ``Literal[b"foo"]``, to ``LiteralBytes``. However, literal byte types are used much less frequently than literal string types and we did not find much user demand for ``LiteralBytes``, so we decided not to include it in this PEP. Others may, however, consider it in future PEPs. Reference Implementation ======================== This is implemented in Pyre v0.9.8 and is actively being used. The implementation simply extends the type checker with ``LiteralString`` as a supertype of literal string types. To support composition via addition, join, etc., it was sufficient to overload the stubs for ``str`` in Pyre's copy of typeshed. Appendix A: Other Uses ====================== To simplify the discussion and require minimal security knowledge, we focused on SQL injections throughout the PEP. ``LiteralString``, however, can also be used to prevent many other kinds of `injection vulnerabilities <https://owasp.org/www-community/Injection_Flaws>`_. Command Injection ----------------- APIs such as ``subprocess.run`` accept a string which can be run as a shell command: :: subprocess.run(f"echo 'Hello {name}'", shell=True) If user-controlled data is included in the command string, the code is vulnerable to "command injection"; i.e., an attacker can run malicious commands. For example, a value of ``' && rm -rf / #`` would result in the following destructive command being run: :: echo 'Hello ' && rm -rf / #' This vulnerability could be prevented by updating ``run`` to only accept ``LiteralString`` when used in ``shell=True`` mode. Here is one simplified stub: :: def run(command: LiteralString, *args: str, shell: bool=...): ... Cross Site Scripting (XSS) -------------------------- Most popular Python web frameworks, such as Django, use a templating engine to produce HTML from user data. These templating languages auto-escape user data before inserting it into the HTML template and thus prevent cross site scripting (XSS) vulnerabilities. But a common way to `bypass auto-escaping <https://django.readthedocs.io/en/stable/ref/templates/language.html#how-to-turn-it-off>`_ and render HTML as-is is to use functions like ``mark_safe`` in `Django <https://docs.djangoproject.com/en/dev/ref/utils/#django.utils.safestring.mark_safe>`_ or ``do_mark_safe`` in `Jinja2 <https://github.com/pallets/jinja/blob/077b7918a7642ff6742fe48a32e54d7875140894/src/jinja2/filters.py#L1264>`_, which cause XSS vulnerabilities: :: dangerous_string = django.utils.safestring.mark_safe(f"<script>{user_input}</script>") return(dangerous_string) This vulnerability could be prevented by updating ``mark_safe`` to only accept ``LiteralString``: :: def mark_safe(s: LiteralString) -> str: ... Server Side Template Injection (SSTI) ------------------------------------- Templating frameworks, such as Jinja, allow Python expressions which will be evaluated and substituted into the rendered result: :: template_str = "There are {{ len(values) }} values: {{ values }}" template = jinja2.Template(template_str) template.render(values=[1, 2]) # Result: "There are 2 values: [1, 2]" If an attacker controls all or part of the template string, they can insert expressions which execute arbitrary code and `compromise <https://www.onsecurity.io/blog/server-side-template-injection-with-jinja2/>`_ the application: :: malicious_str = "{{''.__class__.__base__.__subclasses__()[408]('rm - rf /',shell=True)}}" template = jinja2.Template(malicious_str) template.render() # Result: The shell command 'rm - rf /' is run Template injection exploits like this could be prevented by updating the ``Template`` API to only accept ``LiteralString``: :: class Template: def __init__(self, source: LiteralString): ... Logging Format String Injection ------------------------------- Logging frameworks often allow their input strings to contain formatting directives. At its worst, allowing users to control the logged string has led to `CVE-2021-44228 <https://nvd.nist.gov/vuln/detail/CVE-2021-44228>`_ (colloquially known as ``log4shell``), which has been described as the `"most critical vulnerability of the last decade" <https://www.theguardian.com/technology/2021/dec/10/software-flaw-most-critical-vulnerability-log-4-shell>`_. While no Python frameworks are currently known to be vulnerable to a similar attack, the built-in logging framework does provide formatting options which are vulnerable to Denial of Service attacks from externally controlled logging strings. The following example illustrates a simple denial of service scenario: :: external_string = "%(foo)999999999s" ... # Tries to add > 1GB of whitespace to the logged string: logger.info(f'Received: {external_string}', some_dict) This kind of attack could be prevented by requiring that the format string passed to the logger be a ``LiteralString`` and that all externally controlled data be passed separately as arguments (as proposed in `Issue 46200 <https://bugs.python.org/issue46200>`_): :: def info(msg: LiteralString, *args: object) -> None: ... Appendix B: Limitations ======================= There are a number of ways ``LiteralString`` could still fail to prevent users from passing strings built from non-literal data to an API: 1. If the developer does not use a type checker or does not add type annotations, then violations will go uncaught. 2. ``cast(LiteralString, non_literal_string)`` could be used to lie to the type checker and allow a dynamic string value to masquerade as a ``LiteralString``. The same goes for a variable that has type ``Any``. 3. Comments such as ``# type: ignore`` could be used to ignore warnings about non-literal strings. 4. Trivial functions could be constructed to convert a ``str`` to a ``LiteralString``: :: def make_literal(s: str) -> LiteralString: letters: Dict[str, LiteralString] = { "A": "A", "B": "B", ... } output: List[LiteralString] = [letters[c] for c in s] return "".join(output) We could mitigate the above using linting, code review, etc., but ultimately a clever, malicious developer attempting to circumvent the protections offered by ``LiteralString`` will always succeed. The important thing to remember is that ``LiteralString`` is not intended to protect against *malicious* developers; it is meant to protect against benign developers accidentally using sensitive APIs in a dangerous way (without getting in their way otherwise). Without ``LiteralString``, the best enforcement tool API authors have is documentation, which is easily ignored and often not seen. With ``LiteralString``, API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice. .. _appendix_C: Appendix C: ``str`` methods that preserve ``LiteralString`` =========================================================== The ``str`` class has several methods that would benefit from ``LiteralString``. For example, users might expect ``"hello".capitalize()`` to have the type ``LiteralString`` similar to the other examples we have seen in the `Inferring LiteralString <inferring_literal_string_>`_ section. Inferring the type ``LiteralString`` is correct because the string is not an arbitrary user-supplied string - we know that it has the type ``Literal["HELLO"]``, which is compatible with ``LiteralString``. In other words, the ``capitalize`` method preserves the ``LiteralString`` type. There are several other ``str`` methods that preserve ``LiteralString``. We propose updating the stub for ``str`` in typeshed so that the methods are overloaded with the ``LiteralString``-preserving versions. This means type checkers do not have to hardcode ``LiteralString`` behavior for each method. It also lets us easily support new methods in the future by updating the typeshed stub. For example, to preserve literal types for the ``capitalize`` method, we would change the stub as below: :: # before def capitalize(self) -> str: ... # after @overload def capitalize(self: LiteralString) -> LiteralString: ... @overload def capitalize(self) -> str: ... The downside of changing the ``str`` stub is that the stub becomes more complicated and can make error messages harder to understand. Type checkers may need to special-case ``str`` to make error messages understandable for users. Below is an exhaustive list of ``str`` methods which, when called with arguments of type ``LiteralString``, must be treated as returning a ``LiteralString``. If this PEP is accepted, we will update these method signatures in typeshed: :: @overload def capitalize(self: LiteralString) -> LiteralString: ... @overload def capitalize(self) -> str: ... @overload def casefold(self: LiteralString) -> LiteralString: ... @overload def casefold(self) -> str: ... @overload def center(self: LiteralString, __width: SupportsIndex, __fillchar: LiteralString = ...) -> LiteralString: ... @overload def center(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ... if sys.version_info >= (3, 8): @overload def expandtabs(self: LiteralString, tabsize: SupportsIndex = ...) -> LiteralString: ... @overload def expandtabs(self, tabsize: SupportsIndex = ...) -> str: ... else: @overload def expandtabs(self: LiteralString, tabsize: int = ...) -> LiteralString: ... @overload def expandtabs(self, tabsize: int = ...) -> str: ... @overload def format(self: LiteralString, *args: LiteralString, **kwargs: LiteralString) -> LiteralString: ... @overload def format(self, *args: str, **kwargs: str) -> str: ... @overload def join(self: LiteralString, __iterable: Iterable[LiteralString]) -> LiteralString: ... @overload def join(self, __iterable: Iterable[str]) -> str: ... @overload def ljust(self: LiteralString, __width: SupportsIndex, __fillchar: LiteralString = ...) -> LiteralString: ... @overload def ljust(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ... @overload def lower(self: LiteralString) -> LiteralString: ... @overload def lower(self) -> LiteralString: ... @overload def lstrip(self: LiteralString, __chars: LiteralString | None = ...) -> LiteralString: ... @overload def lstrip(self, __chars: str | None = ...) -> str: ... @overload def partition(self: LiteralString, __sep: LiteralString) -> tuple[LiteralString, LiteralString, LiteralString]: ... @overload def partition(self, __sep: str) -> tuple[str, str, str]: ... @overload def replace(self: LiteralString, __old: LiteralString, __new: LiteralString, __count: SupportsIndex = ...) -> LiteralString: ... @overload def replace(self, __old: str, __new: str, __count: SupportsIndex = ...) -> str: ... if sys.version_info >= (3, 9): @overload def removeprefix(self: LiteralString, __prefix: LiteralString) -> LiteralString: ... @overload def removeprefix(self, __prefix: str) -> str: ... @overload def removesuffix(self: LiteralString, __suffix: LiteralString) -> LiteralString: ... @overload def removesuffix(self, __suffix: str) -> str: ... @overload def rjust(self: LiteralString, __width: SupportsIndex, __fillchar: LiteralString = ...) -> LiteralString: ... @overload def rjust(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ... @overload def rpartition(self: LiteralString, __sep: LiteralString) -> tuple[LiteralString, LiteralString, LiteralString]: ... @overload def rpartition(self, __sep: str) -> tuple[str, str, str]: ... @overload def rsplit(self: LiteralString, sep: LiteralString | None = ..., maxsplit: SupportsIndex = ...) -> list[LiteralString]: ... @overload def rsplit(self, sep: str | None = ..., maxsplit: SupportsIndex = ...) -> list[str]: ... @overload def rstrip(self: LiteralString, __chars: LiteralString | None = ...) -> LiteralString: ... @overload def rstrip(self, __chars: str | None = ...) -> str: ... @overload def split(self: LiteralString, sep: LiteralString | None = ..., maxsplit: SupportsIndex = ...) -> list[LiteralString]: ... @overload def split(self, sep: str | None = ..., maxsplit: SupportsIndex = ...) -> list[str]: ... @overload def splitlines(self: LiteralString, keepends: bool = ...) -> list[LiteralString]: ... @overload def splitlines(self, keepends: bool = ...) -> list[str]: ... @overload def strip(self: LiteralString, __chars: LiteralString | None = ...) -> LiteralString: ... @overload def strip(self, __chars: str | None = ...) -> str: ... @overload def swapcase(self: LiteralString) -> LiteralString: ... @overload def swapcase(self) -> str: ... @overload def title(self: LiteralString) -> LiteralString: ... @overload def title(self) -> str: ... @overload def upper(self: LiteralString) -> LiteralString: ... @overload def upper(self) -> str: ... @overload def zfill(self: LiteralString, __width: SupportsIndex) -> LiteralString: ... @overload def zfill(self, __width: SupportsIndex) -> str: ... @overload def __add__(self: LiteralString, __s: LiteralString) -> LiteralString: ... @overload def __add__(self, __s: str) -> str: ... @overload def __iter__(self: LiteralString) -> Iterator[str]: ... @overload def __iter__(self) -> Iterator[str]: ... @overload def __mod__(self: LiteralString, __x: Union[LiteralString, Tuple[LiteralString, ...]]) -> str: ... @overload def __mod__(self, __x: Union[str, Tuple[str, ...]]) -> str: ... @overload def __mul__(self: LiteralString, __n: SupportsIndex) -> LiteralString: ... @overload def __mul__(self, __n: SupportsIndex) -> str: ... @overload def __repr__(self: LiteralString) -> LiteralString: ... @overload def __repr__(self) -> str: ... @overload def __rmul__(self: LiteralString, n: SupportsIndex) -> LiteralString: ... @overload def __rmul__(self, n: SupportsIndex) -> str: ... @overload def __str__(self: LiteralString) -> LiteralString: ... @overload def __str__(self) -> str: ... Appendix D: Guidelines for using ``LiteralString`` in Stubs =========================================================== Libraries that do not contain type annotations within their source may specify type stubs in Typeshed. Libraries written in other languages, such as those for machine learning, may also provide Python type stubs. This means the type checker cannot verify that the type annotations match the source code and must trust the type stub. Thus, authors of type stubs need to be careful when using ``LiteralString``, since a function may falsely appear to be safe when it is not. We recommend the following guidelines for using ``LiteralString`` in stubs: + If the stub is for a pure function, we recommend using ``LiteralString`` in the return type of the function or of its overloads only if all the corresponding parameters have literal types (i.e., ``LiteralString`` or ``Literal["a", "b"]``). :: # OK @overload def my_transform(x: LiteralString, y: Literal["a", "b"]) -> LiteralString: ... @overload def my_transform(x: str, y: str) -> str: ... # Not OK @overload def my_transform(x: LiteralString, y: str) -> LiteralString: ... @overload def my_transform(x: str, y: str) -> str: ... + If the stub is for a ``staticmethod``, we recommend the same guideline as above. + If the stub is for any other kind of method, we recommend against using ``LiteralString`` in the return type of the method or any of its overloads. This is because, even if all the explicit parameters have type ``LiteralString``, the object itself may be created using user data and thus the return type may be user-controlled. + If the stub is for a class attribute or global variable, we also recommend against using ``LiteralString`` because the untyped code may write arbitrary values to the attribute. However, we leave the final call to the library author. They may use ``LiteralString`` if they feel confident that the string returned by the method or function or the string stored in the attribute is guaranteed to have a literal type - i.e., the string is created by applying only literal-preserving ``str`` operations to a string literal. Note that these guidelines do not apply to inline type annotations since the type checker can verify that, say, a method returning ``LiteralString`` does in fact return an expression of that type. Resources ========= Literal String Types in Scala ----------------------------- Scala `uses <https://www.scala-lang.org/api/2.13.x/scala/Singleton.html>`_ ``Singleton`` as the supertype for singleton types, which includes literal string types, such as ``"foo"``. ``Singleton`` is Scala's generalized analogue of this PEP's ``LiteralString``. Tamer Abdulradi showed how Scala's literal string types can be used for "Preventing SQL injection at compile time", Scala Days talk `Literal types: What are they good for? <https://slideslive.com/38907881/literal-types-what-they-are-good-for>`_ (slides 52 to 68). Thanks ------ Thanks to the following people for their feedback on the PEP: Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев, CAM Gerlach, Arie Bovenberg, David Foster, and Shengye Wan Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: