proxy70

	[HN Gopher] Python strings are immutable, but only sometimes ___________________________________________________________________ Python strings are immutable, but only sometimes Author : jermaustin1 Score : 85 points Date : 2021-02-15 18:48 UTC (4 hours ago)
	web link (web.eecs.utk.edu)
	w3m dump (web.eecs.utk.edu)
	\| Animats wrote: \| Python string semantics are very strange, as the way Unicode was \| implemented. The internal representation can be one, two, or four \| bytes per character, depending on the widest character. Plus I've \| heard that there's sometimes a UTF-8 representation in there, \| too. \| \| Python strings have to be indexable by integers. But, in fact, \| most indexing is iteration. So a UTF-8 representation could work. \| You can index forwards and backwards through UTF-8 with opaque \| cursors. The search, match, and regular expression functions can \| all work on UTF-8. Returned indexes would be opaque cursors. \| Addition and subtraction of small integers to a cursor can be \| handled by indexing forwards and backwards through UTF-8. If a \| program does does random access indexing, or uses an opaque \| cursor in a context where it really has to be convereted to an \| integer, the implementation has to build an index of the entire \| string. Index building costs O(n), but should be rare. \| js2 wrote: \| > Does this really count as mutability though? Not really. The \| string would be thrown away anyway so this is just an \| optimization to reuse the memory. The important takeaway here is \| that you are not allocating a new string every single time like \| the internet says. \| \| This is an undocumented optimization, so you should assume you're \| allocating a new sting every single time like the internet says. \| \| I've been coding Python since 1.5.2 days and so I'll continue to \| use lists as intermediate string builders and then join \| afterwards because I know this works in past, present, a likely \| future versions of Python. \| orf wrote: \| Some interesting results. Joining on a generator is slower than \| joining on a list because the list has a known size beforehand, \| which lets Python perform some optimizations. Even though it's \| more memory intensive. Appending a string is surprisingly fast. \| Building an intermediate list is not. \| \| While I do get your point, Python likes making optimizations. \| They rarely, if ever, make patterns slower by choice. There's \| nothing in any spec that says it _has_ to be slow: that's as \| much as an implementation detail as "joining lists are fast". \| In [18]: sys.version Out[18]: '3.9.1 (default, Feb 3 \| 2021, 07:38:02) \n[Clang 12.0.0 (clang-1200.0.32.29)]' \| In [6]: strings = ["".join(random.choice(string.hexdigits) for \| _ in range(10)) for _ in range(10)] In [9]: \| def one(): ...: "".join(s for s in strings) \| ...: In [10]: def two(): ...: \| "".join([s for s in strings]) ...: \| In [11]: def three(): ...: x = [] \| ...: for s in strings: ...: x.append(s) \| ...: "".join(x) ...: In [12]: \| def four(): ...: x = "" ...: \| for s in strings: ...: x += s \| ...: In [13]: %timeit one() 753 ns +- \| 9 ns per loop (mean +- std. dev. of 7 runs, 1000000 loops each) \| In [14]: %timeit two() 521 ns +- 5.63 ns per loop (mean \| +- std. dev. of 7 runs, 1000000 loops each) In \| [15]: %timeit three() 696 ns +- 3.94 ns per loop (mean \| +- std. dev. of 7 runs, 1000000 loops each) In \| [16]: %timeit four() 620 ns +- 4.82 ns per loop (mean \| +- std. dev. of 7 runs, 1000000 loops each) \| Dylan16807 wrote: \| Does python document anything about performance? \| \| I assume my array sort won't be n^3, but how do we draw the \| line on what to assume and what not to assume? \| nine_k wrote: \| This does play a role if for some reason you use object IDs, \| e.g. when deserializing data with cross-links. Keeping a real \| second reference becomes important. \| wizzwizz4 wrote: \| You should never use object IDs that way; that's what weak \| references / dictionaries are for. (See pickle for details.) \| heavenlyblue wrote: \| Python docs explicitly say that object IDs are only unique \| for objects with overlapping lifetimes. \| \| Congrats, by trying to be smart you just shot yourself in the \| foot. \| [deleted] \| lostcolony wrote: \| Yeah; I'm pretty sure this isn't anything you can take \| advantage of. But I think the point of the post is a "isn't \| this neat?" rather than "because of an optimization one \| approach isn't as bad as it could be and you should take \| advantage of that". \| masklinn wrote: \| > Yeah; I'm pretty sure this isn't anything you can take \| advantage of. \| \| You probably _can_ but _should not_ rely on it: \| \| * it was a very official part of the Python 2.4 release notes \| \| * it's unlikely the devs would remove it as they know how \| implementation details tend to leak into the software (see: \| dict ordering) \| \| * but it _is_ an optimisation so there 's no guarantee, a \| simple trace function could break it \| \| * and obviously there's no guarantee (and likely no way) \| other implementations can implement it \| lostcolony wrote: \| Right, but my thought is "at what times would it make sense \| to join, and not append, but this optimization would be \| faster than joining"? I'm guessing not very many. \| \| Because, yeah, the fact that str += "?" is faster isn't a \| big deal; that's the natural, ergonomic way to deal with \| it, because you're appending all of one string. Likewise \| foo + " " + bar + "?" is probably easier to write that way, \| than to drop them into a list and join (but even if not, \| I'd be curious if it's actually any faster; this article \| doesn't measure). By the time you get to joining large \| amounts together, concatenating a CSV or something, you're \| going to use join, naturally, and join is going to be more \| performant. \| \| So to my mind it's kind of an open question, at what points \| is the non-ergonomic thing going to be the faster thing? \| That's what "taking advantage of (an optimization)" feels \| like; otherwise you're just writing code and letting the \| performance fall where it may. \| masklinn wrote: \| > Right, but my thought is "at what times would it make \| sense to join, and not append, but this optimization \| would be faster than joining"? I'm guessing not very \| many. \| \| I'm guessing all of them. Because what it does is what \| join will do internally anyway, but without the overhead \| of the list itself, and with operations which have lower \| overhead than the list's. \| lostcolony wrote: \| "I'm guessing" \| \| Yes, you are. I'm saying this is a more interesting \| question, and one that isn't answered. Because the \| reality is that if you're appending just a handful of \| things together, the idiomatic approach is to concatenate \| them, and this optimization comes into play. If you're \| concatenating a lot of items together...you probably \| already have a list, and the idiomatic approach to join \| them is probably faster than a for loop to concat them \| over and over. So the question then, is is there a point \| where idiomatically it makes more sense to join things \| (putting them in a list if they aren't already, but they \| might be; depends on the example), but this will be more \| efficient? \| dmurray wrote: \| This one is less painful for the devs to break, if they \| need to, because it only works 99% of the time. If you rely \| on this behaviour for correctness, rather than performance, \| it will set you right pretty quickly. \| \| If you do want to take advantage of this, maybe you can go \| a level deeper: dig into Python's behaviour when allocating \| memory for strings, and figure out in what circumstances \| you can be guaranteed to get the same id back. E.g. maybe \| if you create a 28-character string, there will always be \| room to append 4 more. \| dmw_ng wrote: \| The optimization got moved to unicode objects during the \| great Python 3 upheaval and AFAIK still remains \| unimplemented for what are now bytes objects. The paramiko \| SSH library relied on it heavily, as a result its runtime \| is (AFAIK still) horrific on Python 3 \| [deleted] \| pwdisswordfish6 wrote: \| This behaviour (id not changing after concatenation) is \| consistent with the original str object being deallocated and a \| new one allocated in its place with an identical id. It's not \| what actually happens behind the scenes (in CPython), but it's \| still an implementation detail that this demonstration does not \| even unambiguously expose. \| nine_k wrote: \| CPython could do both; allocations are usually not byte- \| perfect, and the small remaining room at the end of string can \| be used as an optimization, or a new buffer can be allocated if \| there's no more room. \| \| Also, cutting long strings with .strip() can use the same \| optimization. Allocation isn't fast. \| \| IDK if it's used, but it's entirely plausible to implement. \| masklinn wrote: \| > CPython could do both; allocations are usually not byte- \| perfect \| \| Though it's probably not the case for strings, it should be \| noted that for most objects allocations _are_ byte-perfect, \| because the average Python object is a bunch of pointers and \| and a refcount: two gc pointers, a refcount, a weakref \| pointer, a class pointer and a dict pointer, for a total of \| 48 bytes (as of 3.9, ignoring all the pointees). \| \| That's not the case for `__slots__` though: they drop the \| instance dict and the weakref pointer, but then each field \| will add a pointer to the total (then again that's space \| which won't be allocated as part of the instance dict). \| nine_k wrote: \| If a tree falls in a forest, and nobody observes it, does it make \| a sound? The tree can choose, because there's nobody to notice \| the difference. \| \| Mutable state is not bad; _shared_ mutable state is bad -- or, at \| least, takes a lot of care to handle. So CPython mutates strings \| which are not observed by anyone else but the mutating code. \| \| This is exactly the idea behind uniqueness types [1] and move \| semantics in C++ and Rust: mutation is safe as long as it's not \| shared. \| \| [1]: https://en.wikipedia.org/wiki/Uniqueness_type \| ymbeld wrote: \| The title seems like false advertisement. It seems that strings \| in Python are indeed immutable. How it implements immutability is \| up to it (the implementation), and making a brand new allocation \| for each operation is not the only way of adhering to the \| contract. \| \| Get back to me if you can make several references to the same \| string and make them step on each others' toes by only assigning \| a new value to one of them. \| teddyh wrote: \| > _the builtin id() function, which is just the memory address._ \| \| This, crucially, is _not_ guaranteed. Instead, id() is only \| guaranteed to give a unique integer value for every object _valid \| for the lifetime of that object_. So if Python deallocated your \| first string object and allocated a new string object, Python \| could give the same id() for the new string object and this would \| be fine; this does _not_ necessarily mean that the two string \| objects are the "same" object! In fact, since the objects do not \| share lifetimes, they _cannot_ be said to be the same object! \| postit wrote: \| I just spend two days chasing a bug because someone shoved a \| variable inside a dataclass in the wrong place, nothing is \| immutable in python. \| iainmerrick wrote: \| No, strings are immutable, despite the semantic gymnastics this \| article is performing. \| masklinn wrote: \| Strings are immutable on the Python side. On the C size, \| however, PyUnicode_Resize is part of the PEP 384 Stable ABI \| and very specifically tries to resize the string in-place \| (it's normally intended for string-building). \| ymbeld wrote: \| If it is immutable from the client's perspective, it's \| immutable. Anything else is ill-defined. \| masklinn wrote: \| The C api is "from a client's perspective". Any rando can \| build native modules with it. \| cs702 wrote: \| My favorite Python WTF "feature" is that integers can have have \| the same reference, but only sometimes: >>> a = \| 256 >>> b = 256 >>> a is b True >>> a = \| 257 >>> b = 257 >>> a is b False >>> a = \| 257; b = 257 >>> a is b True \| \| Sometimes I think of Python as the _Nash Equilibrium_ [a] of \| programming languages: \| \| It's never the absolute best language for anything, but it's hard \| to improve it on any front (e.g., execution speed) without \| hindering it on other fronts (e.g., ad-hoc interactivity), and \| it's never the absolute worst language for anything. Often \| enough, it's good enough. \| \| Python might well be "the least-worst language for everything." \| \| -- \| \| [a] https://en.wikipedia.org/wiki/Nash_equilibrium \| patrickthebold wrote: \| Java basically has the same thing: \| \| Quick Google gives: https://stackoverflow.com/a/1515811 \| justinnhli wrote: \| > Sometimes I think of Python as the Nash Equilibrium of \| programming languages \| \| FYI: What you're describing is not a Nash equilibrium, but a \| Pareto optimal point [1]. They are similar in that you couldn't \| do any better, but Nash equilibria is in terms of whether this \| would cause other players to change their strategies, while \| Pareto optimality is only about trading off different \| features/dimensions. \| \| [1]: https://en.wikipedia.org/wiki/Pareto_efficiency \| th0ma5 wrote: \| This is outside of the spec... "is" is for testing the exact \| same reference and it is only coincidence that to speed things \| up they made smaller integers the same objects in memory. See: \| >>> a=257 >>> b=a >>> a is b True \| \| What you want is double equals. \| [deleted] \| oivey wrote: \| He names references in his post. I highly doubt he's confused \| about the difference between is and ==. It's a weird leak of \| interpreter details that could, in very narrow situations, \| cause a bug. \| coldtea wrote: \| > _My favorite Python WTF "feature" is that integers can have \| have the same reference, but only sometimes_ \| \| Many languages do the same, ditto for strings (as in TFA). \| Delk wrote: \| As other commenters have pointed out, this is an \| implementation-specification optimization rather than a \| property of Python as a language. \| \| It is, at a first glance, a bit weird. But the way you should \| look at it is that Python the language doesn't say the two \| integers have the same identity, and you shouldn't assume they \| will. But it also doesn't say they _can 't_ be the same object. \| Since Python integers are immutable, and thus having the two \| variables actually reference the same object can't create side \| effects unless you're directly playing with identities and \| making assumptions in your code that you shouldn't make, the \| implementation can have the two variables reference the same \| object as an optimization without breaking anything. \| patrec wrote: \| > It's never the absolute best language for anything, but it's \| hard to improve it on any front (e.g., execution speed) without \| hindering it on other fronts (e.g., ad-hoc interactivity), \| \| This belief seems common, but I always wonder if anyone with \| familiarity with dynamic programming languages that were \| implemented by people who knew what they are doing (as \| implementers) thinks so. Self, Smalltalk and Common Lisp, for \| example, are doing _much_ better on the ad-hoc interactivity \| front in non-trivial ways whilst offering implementations with \| vastly better performance preceding (C)Python by many years. \| The fact that python has terrible execution speed is most due \| to lack of relevant skills in the community not some conscious \| engineering trade-off. \| \| Having said that, I don't think you are wrong on python being \| "the least worst language for everything" -- very few other \| languages have an eco system of remotely comparable \| expansiveness and quality (the top minds in several disciplines \| mostly use python for their work) which alone kills of huge \| swathes of would-be-competitors. \| orf wrote: \| `is` is for identity whereas `=` is for equality. You rarely \| want `is` unless you're asking if two references _are the same \| object_. This is almost exclusively used for `x is False \| /True`, but sometimes used to denote "missing" arguments (where \| true/false may be valid): missing = object() \| def func(a=missing): if a is missing: \| raise ValueError('you must pass a') \| \| This "numbers less than 256 are the same objects" is a fairly \| common on the list of "wtf python" but I've never understood \| it. You don't use `is` like that and you would never use it in \| code because the operator you're using is not the right one for \| the thing you're trying to do. \| \| Plus if this is the biggest wtf then that's pretty good going. \| cs702 wrote: \| Yes, of course. It's not the "biggest" WTF feature, but it's \| my favorite. There's a long list of WTF features here: \| \| https://github.com/satwikkansal/wtfpython \| \| Otherwise, I agree, it's a pretty good going :-) \| nneonneo wrote: \| I mean, who says Python strings are immutable? s \| = "asd" import ctypes ctypes.memmove(id(s) + \| ord('1'), b'X', 1) print(s) # prints aXd \| \| (Python 3.3+, 64-bit; this is a joke so please do not use this in \| anything real) \| yccs27 wrote: \| Nice! \| \| Can you please explain to a rookie what the ord('1') is for? Is \| it just a way of writing a magic number of value 49, or is \| there some significance to it? \| panda88888 wrote: \| CPython implementation of the string stores metadata to save \| memory. The ord('1') simply jumps to the correct memory \| offset in the string structure. I would think it's easier to \| remember ord('0') as the offset of the string byte array in \| the structure than the magic number 48? \| nneonneo wrote: \| It's mostly just a mildly obfuscated way to write 49. It also \| happens to line up nicely with the index of the character \| being overwritten (index 1); ord('0') through ord('9') work \| likewise. \| \| ---- \| \| I'm personally kind of shocked at how big strings are - the \| minimum size is 6 native words in size, corresponding to the \| following fields: refcount, type pointer, string length, \| string hash (precomputed), flags, and a wstr pointer which \| just seems to be NULL for most strings. It seems like they \| could have at least merged the string length and flags \| together for short strings - a possible subject for a future \| PEP... \| rasjani wrote: \| It feels like its just "magic" number. Probably id() is the \| actual raw pointer to str class and the string starts at \| offset 48 ? ___________________________________________________________________ (page generated 2021-02-15 23:00 UTC)