[HN Gopher] Python strings are immutable, but only sometimes
___________________________________________________________________
 
Python strings are immutable, but only sometimes
 
Author : jermaustin1
Score  : 85 points
Date   : 2021-02-15 18:48 UTC (4 hours ago)
 
web link (web.eecs.utk.edu)
w3m dump (web.eecs.utk.edu)
 
| Animats wrote:
| Python string semantics are very strange, as the way Unicode was
| implemented. The internal representation can be one, two, or four
| bytes per character, depending on the widest character. Plus I've
| heard that there's sometimes a UTF-8 representation in there,
| too.
| 
| Python strings have to be indexable by integers. But, in fact,
| most indexing is iteration. So a UTF-8 representation could work.
| You can index forwards and backwards through UTF-8 with opaque
| cursors. The search, match, and regular expression functions can
| all work on UTF-8. Returned indexes would be opaque cursors.
| Addition and subtraction of small integers to a cursor can be
| handled by indexing forwards and backwards through UTF-8. If a
| program does does random access indexing, or uses an opaque
| cursor in a context where it really has to be convereted to an
| integer, the implementation has to build an index of the entire
| string. Index building costs O(n), but should be rare.
 
| js2 wrote:
| > Does this really count as mutability though? Not really. The
| string would be thrown away anyway so this is just an
| optimization to reuse the memory. The important takeaway here is
| that you are not allocating a new string every single time like
| the internet says.
| 
| This is an undocumented optimization, so you should assume you're
| allocating a new sting every single time like the internet says.
| 
| I've been coding Python since 1.5.2 days and so I'll continue to
| use lists as intermediate string builders and then join
| afterwards because I know this works in past, present, a likely
| future versions of Python.
 
  | orf wrote:
  | Some interesting results. Joining on a generator is slower than
  | joining on a list because the list has a known size beforehand,
  | which lets Python perform some optimizations. Even though it's
  | more memory intensive. Appending a string is surprisingly fast.
  | Building an intermediate list is not.
  | 
  | While I do get your point, Python likes making optimizations.
  | They rarely, if ever, make patterns slower by choice. There's
  | nothing in any spec that says it _has_ to be slow: that's as
  | much as an implementation detail as "joining lists are fast".
  | In [18]: sys.version         Out[18]: '3.9.1 (default, Feb  3
  | 2021, 07:38:02) \n[Clang 12.0.0 (clang-1200.0.32.29)]'
  | In [6]: strings = ["".join(random.choice(string.hexdigits) for
  | _ in range(10)) for _ in range(10)]                  In [9]:
  | def one():            ...:     "".join(s for s in strings)
  | ...:                  In [10]: def two():             ...:
  | "".join([s for s in strings])             ...:
  | In [11]: def three():             ...:     x = []
  | ...:     for s in strings:             ...:         x.append(s)
  | ...:     "".join(x)             ...:                  In [12]:
  | def four():             ...:     x = ""             ...:
  | for s in strings:             ...:         x += s
  | ...:                  In [13]: %timeit one()         753 ns +-
  | 9 ns per loop (mean +- std. dev. of 7 runs, 1000000 loops each)
  | In [14]: %timeit two()         521 ns +- 5.63 ns per loop (mean
  | +- std. dev. of 7 runs, 1000000 loops each)                  In
  | [15]: %timeit three()         696 ns +- 3.94 ns per loop (mean
  | +- std. dev. of 7 runs, 1000000 loops each)                  In
  | [16]: %timeit four()         620 ns +- 4.82 ns per loop (mean
  | +- std. dev. of 7 runs, 1000000 loops each)
 
  | Dylan16807 wrote:
  | Does python document anything about performance?
  | 
  | I assume my array sort won't be n^3, but how do we draw the
  | line on what to assume and what not to assume?
 
  | nine_k wrote:
  | This does play a role if for some reason you use object IDs,
  | e.g. when deserializing data with cross-links. Keeping a real
  | second reference becomes important.
 
    | wizzwizz4 wrote:
    | You should never use object IDs that way; that's what weak
    | references / dictionaries are for. (See pickle for details.)
 
    | heavenlyblue wrote:
    | Python docs explicitly say that object IDs are only unique
    | for objects with overlapping lifetimes.
    | 
    | Congrats, by trying to be smart you just shot yourself in the
    | foot.
 
    | [deleted]
 
  | lostcolony wrote:
  | Yeah; I'm pretty sure this isn't anything you can take
  | advantage of. But I think the point of the post is a "isn't
  | this neat?" rather than "because of an optimization one
  | approach isn't as bad as it could be and you should take
  | advantage of that".
 
    | masklinn wrote:
    | > Yeah; I'm pretty sure this isn't anything you can take
    | advantage of.
    | 
    | You probably _can_ but _should not_ rely on it:
    | 
    | * it was a very official part of the Python 2.4 release notes
    | 
    | * it's unlikely the devs would remove it as they know how
    | implementation details tend to leak into the software (see:
    | dict ordering)
    | 
    | * but it _is_ an optimisation so there 's no guarantee, a
    | simple trace function could break it
    | 
    | * and obviously there's no guarantee (and likely no way)
    | other implementations can implement it
 
      | lostcolony wrote:
      | Right, but my thought is "at what times would it make sense
      | to join, and not append, but this optimization would be
      | faster than joining"? I'm guessing not very many.
      | 
      | Because, yeah, the fact that str += "?" is faster isn't a
      | big deal; that's the natural, ergonomic way to deal with
      | it, because you're appending all of one string. Likewise
      | foo + " " + bar + "?" is probably easier to write that way,
      | than to drop them into a list and join (but even if not,
      | I'd be curious if it's actually any faster; this article
      | doesn't measure). By the time you get to joining large
      | amounts together, concatenating a CSV or something, you're
      | going to use join, naturally, and join is going to be more
      | performant.
      | 
      | So to my mind it's kind of an open question, at what points
      | is the non-ergonomic thing going to be the faster thing?
      | That's what "taking advantage of (an optimization)" feels
      | like; otherwise you're just writing code and letting the
      | performance fall where it may.
 
        | masklinn wrote:
        | > Right, but my thought is "at what times would it make
        | sense to join, and not append, but this optimization
        | would be faster than joining"? I'm guessing not very
        | many.
        | 
        | I'm guessing all of them. Because what it does is what
        | join will do internally anyway, but without the overhead
        | of the list itself, and with operations which have lower
        | overhead than the list's.
 
        | lostcolony wrote:
        | "I'm guessing"
        | 
        | Yes, you are. I'm saying this is a more interesting
        | question, and one that isn't answered. Because the
        | reality is that if you're appending just a handful of
        | things together, the idiomatic approach is to concatenate
        | them, and this optimization comes into play. If you're
        | concatenating a lot of items together...you probably
        | already have a list, and the idiomatic approach to join
        | them is probably faster than a for loop to concat them
        | over and over. So the question then, is is there a point
        | where idiomatically it makes more sense to join things
        | (putting them in a list if they aren't already, but they
        | might be; depends on the example), but this will be more
        | efficient?
 
      | dmurray wrote:
      | This one is less painful for the devs to break, if they
      | need to, because it only works 99% of the time. If you rely
      | on this behaviour for correctness, rather than performance,
      | it will set you right pretty quickly.
      | 
      | If you do want to take advantage of this, maybe you can go
      | a level deeper: dig into Python's behaviour when allocating
      | memory for strings, and figure out in what circumstances
      | you can be guaranteed to get the same id back. E.g. maybe
      | if you create a 28-character string, there will always be
      | room to append 4 more.
 
      | dmw_ng wrote:
      | The optimization got moved to unicode objects during the
      | great Python 3 upheaval and AFAIK still remains
      | unimplemented for what are now bytes objects. The paramiko
      | SSH library relied on it heavily, as a result its runtime
      | is (AFAIK still) horrific on Python 3
 
| [deleted]
 
| pwdisswordfish6 wrote:
| This behaviour (id not changing after concatenation) is
| consistent with the original str object being deallocated and a
| new one allocated in its place with an identical id. It's not
| what actually happens behind the scenes (in CPython), but it's
| still an implementation detail that this demonstration does not
| even unambiguously expose.
 
  | nine_k wrote:
  | CPython could do both; allocations are usually not byte-
  | perfect, and the small remaining room at the end of string can
  | be used as an optimization, or a new buffer can be allocated if
  | there's no more room.
  | 
  | Also, cutting long strings with .strip() can use the same
  | optimization. Allocation isn't fast.
  | 
  | IDK if it's used, but it's entirely plausible to implement.
 
    | masklinn wrote:
    | > CPython could do both; allocations are usually not byte-
    | perfect
    | 
    | Though it's probably not the case for strings, it should be
    | noted that for most objects allocations _are_ byte-perfect,
    | because the average Python object is a bunch of pointers and
    | and a refcount: two gc pointers, a refcount, a weakref
    | pointer, a class pointer and a dict pointer, for a total of
    | 48 bytes (as of 3.9, ignoring all the pointees).
    | 
    | That's not the case for `__slots__` though: they drop the
    | instance dict and the weakref pointer, but then each field
    | will add a pointer to the total (then again that's space
    | which won't be allocated as part of the instance dict).
 
| nine_k wrote:
| If a tree falls in a forest, and nobody observes it, does it make
| a sound? The tree can choose, because there's nobody to notice
| the difference.
| 
| Mutable state is not bad; _shared_ mutable state is bad -- or, at
| least, takes a lot of care to handle. So CPython mutates strings
| which are not observed by anyone else but the mutating code.
| 
| This is exactly the idea behind uniqueness types [1] and move
| semantics in C++ and Rust: mutation is safe as long as it's not
| shared.
| 
| [1]: https://en.wikipedia.org/wiki/Uniqueness_type
 
| ymbeld wrote:
| The title seems like false advertisement. It seems that strings
| in Python are indeed immutable. How it implements immutability is
| up to it (the implementation), and making a brand new allocation
| for each operation is not the only way of adhering to the
| contract.
| 
| Get back to me if you can make several references to the same
| string and make them step on each others' toes by only assigning
| a new value to one of them.
 
| teddyh wrote:
| > _the builtin id() function, which is just the memory address._
| 
| This, crucially, is _not_ guaranteed. Instead, id() is only
| guaranteed to give a unique integer value for every object _valid
| for the lifetime of that object_. So if Python deallocated your
| first string object and allocated a new string object, Python
| could give the same id() for the new string object and this would
| be fine; this does _not_ necessarily mean that the two string
| objects are the "same" object! In fact, since the objects do not
| share lifetimes, they _cannot_ be said to be the same object!
 
| postit wrote:
| I just spend two days chasing a bug because someone shoved a
| variable inside a dataclass in the wrong place, nothing is
| immutable in python.
 
  | iainmerrick wrote:
  | No, strings are immutable, despite the semantic gymnastics this
  | article is performing.
 
    | masklinn wrote:
    | Strings are immutable on the Python side. On the C size,
    | however, PyUnicode_Resize is part of the PEP 384 Stable ABI
    | and very specifically tries to resize the string in-place
    | (it's normally intended for string-building).
 
      | ymbeld wrote:
      | If it is immutable from the client's perspective, it's
      | immutable. Anything else is ill-defined.
 
        | masklinn wrote:
        | The C api is "from a client's perspective". Any rando can
        | build native modules with it.
 
| cs702 wrote:
| My favorite Python WTF "feature" is that integers can have have
| the same reference, but only sometimes:                 >>> a =
| 256       >>> b = 256       >>> a is b       True       >>> a =
| 257       >>> b = 257       >>> a is b       False       >>> a =
| 257; b = 257       >>> a is b       True
| 
| Sometimes I think of Python as the _Nash Equilibrium_ [a] of
| programming languages:
| 
| It's never the absolute best language for anything, but it's hard
| to improve it on any front (e.g., execution speed) without
| hindering it on other fronts (e.g., ad-hoc interactivity), and
| it's never the absolute worst language for anything. Often
| enough, it's good enough.
| 
| Python might well be "the least-worst language for everything."
| 
| --
| 
| [a] https://en.wikipedia.org/wiki/Nash_equilibrium
 
  | patrickthebold wrote:
  | Java basically has the same thing:
  | 
  | Quick Google gives: https://stackoverflow.com/a/1515811
 
  | justinnhli wrote:
  | > Sometimes I think of Python as the Nash Equilibrium of
  | programming languages
  | 
  | FYI: What you're describing is not a Nash equilibrium, but a
  | Pareto optimal point [1]. They are similar in that you couldn't
  | do any better, but Nash equilibria is in terms of whether this
  | would cause other players to change their strategies, while
  | Pareto optimality is only about trading off different
  | features/dimensions.
  | 
  | [1]: https://en.wikipedia.org/wiki/Pareto_efficiency
 
  | th0ma5 wrote:
  | This is outside of the spec... "is" is for testing the exact
  | same reference and it is only coincidence that to speed things
  | up they made smaller integers the same objects in memory. See:
  | >>> a=257         >>> b=a         >>> a is b         True
  | 
  | What you want is double equals.
 
    | [deleted]
 
    | oivey wrote:
    | He names references in his post. I highly doubt he's confused
    | about the difference between is and ==. It's a weird leak of
    | interpreter details that could, in very narrow situations,
    | cause a bug.
 
  | coldtea wrote:
  | > _My favorite Python WTF "feature" is that integers can have
  | have the same reference, but only sometimes_
  | 
  | Many languages do the same, ditto for strings (as in TFA).
 
  | Delk wrote:
  | As other commenters have pointed out, this is an
  | implementation-specification optimization rather than a
  | property of Python as a language.
  | 
  | It is, at a first glance, a bit weird. But the way you should
  | look at it is that Python the language doesn't say the two
  | integers have the same identity, and you shouldn't assume they
  | will. But it also doesn't say they _can 't_ be the same object.
  | Since Python integers are immutable, and thus having the two
  | variables actually reference the same object can't create side
  | effects unless you're directly playing with identities and
  | making assumptions in your code that you shouldn't make, the
  | implementation can have the two variables reference the same
  | object as an optimization without breaking anything.
 
  | patrec wrote:
  | > It's never the absolute best language for anything, but it's
  | hard to improve it on any front (e.g., execution speed) without
  | hindering it on other fronts (e.g., ad-hoc interactivity),
  | 
  | This belief seems common, but I always wonder if anyone with
  | familiarity with dynamic programming languages that were
  | implemented by people who knew what they are doing (as
  | implementers) thinks so. Self, Smalltalk and Common Lisp, for
  | example, are doing _much_ better on the ad-hoc interactivity
  | front in non-trivial ways whilst offering implementations with
  | vastly better performance preceding (C)Python by many years.
  | The fact that python has terrible execution speed is most due
  | to lack of relevant skills in the community not some conscious
  | engineering trade-off.
  | 
  | Having said that, I don't think you are wrong on python being
  | "the least worst language for everything" -- very few other
  | languages have an eco system of remotely comparable
  | expansiveness and quality (the top minds in several disciplines
  | mostly use python for their work) which alone kills of huge
  | swathes of would-be-competitors.
 
  | orf wrote:
  | `is` is for identity whereas `=` is for equality. You rarely
  | want `is` unless you're asking if two references _are the same
  | object_. This is almost exclusively used for `x is False
  | /True`, but sometimes used to denote "missing" arguments (where
  | true/false may be valid):                   missing = object()
  | def func(a=missing):            if a is missing:
  | raise ValueError('you must pass a')
  | 
  | This "numbers less than 256 are the same objects" is a fairly
  | common on the list of "wtf python" but I've never understood
  | it. You don't use `is` like that and you would never use it in
  | code because the operator you're using is not the right one for
  | the thing you're trying to do.
  | 
  | Plus if this is the biggest wtf then that's pretty good going.
 
    | cs702 wrote:
    | Yes, of course. It's not the "biggest" WTF feature, but it's
    | my favorite. There's a long list of WTF features here:
    | 
    | https://github.com/satwikkansal/wtfpython
    | 
    | Otherwise, I agree, it's a pretty good going :-)
 
| nneonneo wrote:
| I mean, who says Python strings are immutable?                 s
| = "asd"       import ctypes       ctypes.memmove(id(s) +
| ord('1'), b'X', 1)       print(s) # prints aXd
| 
| (Python 3.3+, 64-bit; this is a joke so please do not use this in
| anything real)
 
  | yccs27 wrote:
  | Nice!
  | 
  | Can you please explain to a rookie what the ord('1') is for? Is
  | it just a way of writing a magic number of value 49, or is
  | there some significance to it?
 
    | panda88888 wrote:
    | CPython implementation of the string stores metadata to save
    | memory. The ord('1') simply jumps to the correct memory
    | offset in the string structure. I would think it's easier to
    | remember ord('0') as the offset of the string byte array in
    | the structure than the magic number 48?
 
    | nneonneo wrote:
    | It's mostly just a mildly obfuscated way to write 49. It also
    | happens to line up nicely with the index of the character
    | being overwritten (index 1); ord('0') through ord('9') work
    | likewise.
    | 
    | ----
    | 
    | I'm personally kind of shocked at how big strings are - the
    | minimum size is 6 native words in size, corresponding to the
    | following fields: refcount, type pointer, string length,
    | string hash (precomputed), flags, and a wstr pointer which
    | just seems to be NULL for most strings. It seems like they
    | could have at least merged the string length and flags
    | together for short strings - a possible subject for a future
    | PEP...
 
    | rasjani wrote:
    | It feels like its just "magic" number. Probably id() is the
    | actual raw pointer to str class and the string starts at
    | offset 48 ?
 
___________________________________________________________________
(page generated 2021-02-15 23:00 UTC)