|
| Animats wrote:
| Python string semantics are very strange, as the way Unicode was
| implemented. The internal representation can be one, two, or four
| bytes per character, depending on the widest character. Plus I've
| heard that there's sometimes a UTF-8 representation in there,
| too.
|
| Python strings have to be indexable by integers. But, in fact,
| most indexing is iteration. So a UTF-8 representation could work.
| You can index forwards and backwards through UTF-8 with opaque
| cursors. The search, match, and regular expression functions can
| all work on UTF-8. Returned indexes would be opaque cursors.
| Addition and subtraction of small integers to a cursor can be
| handled by indexing forwards and backwards through UTF-8. If a
| program does does random access indexing, or uses an opaque
| cursor in a context where it really has to be convereted to an
| integer, the implementation has to build an index of the entire
| string. Index building costs O(n), but should be rare.
| js2 wrote:
| > Does this really count as mutability though? Not really. The
| string would be thrown away anyway so this is just an
| optimization to reuse the memory. The important takeaway here is
| that you are not allocating a new string every single time like
| the internet says.
|
| This is an undocumented optimization, so you should assume you're
| allocating a new sting every single time like the internet says.
|
| I've been coding Python since 1.5.2 days and so I'll continue to
| use lists as intermediate string builders and then join
| afterwards because I know this works in past, present, a likely
| future versions of Python.
| orf wrote:
| Some interesting results. Joining on a generator is slower than
| joining on a list because the list has a known size beforehand,
| which lets Python perform some optimizations. Even though it's
| more memory intensive. Appending a string is surprisingly fast.
| Building an intermediate list is not.
|
| While I do get your point, Python likes making optimizations.
| They rarely, if ever, make patterns slower by choice. There's
| nothing in any spec that says it _has_ to be slow: that's as
| much as an implementation detail as "joining lists are fast".
| In [18]: sys.version Out[18]: '3.9.1 (default, Feb 3
| 2021, 07:38:02) \n[Clang 12.0.0 (clang-1200.0.32.29)]'
| In [6]: strings = ["".join(random.choice(string.hexdigits) for
| _ in range(10)) for _ in range(10)] In [9]:
| def one(): ...: "".join(s for s in strings)
| ...: In [10]: def two(): ...:
| "".join([s for s in strings]) ...:
| In [11]: def three(): ...: x = []
| ...: for s in strings: ...: x.append(s)
| ...: "".join(x) ...: In [12]:
| def four(): ...: x = "" ...:
| for s in strings: ...: x += s
| ...: In [13]: %timeit one() 753 ns +-
| 9 ns per loop (mean +- std. dev. of 7 runs, 1000000 loops each)
| In [14]: %timeit two() 521 ns +- 5.63 ns per loop (mean
| +- std. dev. of 7 runs, 1000000 loops each) In
| [15]: %timeit three() 696 ns +- 3.94 ns per loop (mean
| +- std. dev. of 7 runs, 1000000 loops each) In
| [16]: %timeit four() 620 ns +- 4.82 ns per loop (mean
| +- std. dev. of 7 runs, 1000000 loops each)
| Dylan16807 wrote:
| Does python document anything about performance?
|
| I assume my array sort won't be n^3, but how do we draw the
| line on what to assume and what not to assume?
| nine_k wrote:
| This does play a role if for some reason you use object IDs,
| e.g. when deserializing data with cross-links. Keeping a real
| second reference becomes important.
| wizzwizz4 wrote:
| You should never use object IDs that way; that's what weak
| references / dictionaries are for. (See pickle for details.)
| heavenlyblue wrote:
| Python docs explicitly say that object IDs are only unique
| for objects with overlapping lifetimes.
|
| Congrats, by trying to be smart you just shot yourself in the
| foot.
| [deleted]
| lostcolony wrote:
| Yeah; I'm pretty sure this isn't anything you can take
| advantage of. But I think the point of the post is a "isn't
| this neat?" rather than "because of an optimization one
| approach isn't as bad as it could be and you should take
| advantage of that".
| masklinn wrote:
| > Yeah; I'm pretty sure this isn't anything you can take
| advantage of.
|
| You probably _can_ but _should not_ rely on it:
|
| * it was a very official part of the Python 2.4 release notes
|
| * it's unlikely the devs would remove it as they know how
| implementation details tend to leak into the software (see:
| dict ordering)
|
| * but it _is_ an optimisation so there 's no guarantee, a
| simple trace function could break it
|
| * and obviously there's no guarantee (and likely no way)
| other implementations can implement it
| lostcolony wrote:
| Right, but my thought is "at what times would it make sense
| to join, and not append, but this optimization would be
| faster than joining"? I'm guessing not very many.
|
| Because, yeah, the fact that str += "?" is faster isn't a
| big deal; that's the natural, ergonomic way to deal with
| it, because you're appending all of one string. Likewise
| foo + " " + bar + "?" is probably easier to write that way,
| than to drop them into a list and join (but even if not,
| I'd be curious if it's actually any faster; this article
| doesn't measure). By the time you get to joining large
| amounts together, concatenating a CSV or something, you're
| going to use join, naturally, and join is going to be more
| performant.
|
| So to my mind it's kind of an open question, at what points
| is the non-ergonomic thing going to be the faster thing?
| That's what "taking advantage of (an optimization)" feels
| like; otherwise you're just writing code and letting the
| performance fall where it may.
| masklinn wrote:
| > Right, but my thought is "at what times would it make
| sense to join, and not append, but this optimization
| would be faster than joining"? I'm guessing not very
| many.
|
| I'm guessing all of them. Because what it does is what
| join will do internally anyway, but without the overhead
| of the list itself, and with operations which have lower
| overhead than the list's.
| lostcolony wrote:
| "I'm guessing"
|
| Yes, you are. I'm saying this is a more interesting
| question, and one that isn't answered. Because the
| reality is that if you're appending just a handful of
| things together, the idiomatic approach is to concatenate
| them, and this optimization comes into play. If you're
| concatenating a lot of items together...you probably
| already have a list, and the idiomatic approach to join
| them is probably faster than a for loop to concat them
| over and over. So the question then, is is there a point
| where idiomatically it makes more sense to join things
| (putting them in a list if they aren't already, but they
| might be; depends on the example), but this will be more
| efficient?
| dmurray wrote:
| This one is less painful for the devs to break, if they
| need to, because it only works 99% of the time. If you rely
| on this behaviour for correctness, rather than performance,
| it will set you right pretty quickly.
|
| If you do want to take advantage of this, maybe you can go
| a level deeper: dig into Python's behaviour when allocating
| memory for strings, and figure out in what circumstances
| you can be guaranteed to get the same id back. E.g. maybe
| if you create a 28-character string, there will always be
| room to append 4 more.
| dmw_ng wrote:
| The optimization got moved to unicode objects during the
| great Python 3 upheaval and AFAIK still remains
| unimplemented for what are now bytes objects. The paramiko
| SSH library relied on it heavily, as a result its runtime
| is (AFAIK still) horrific on Python 3
| [deleted]
| pwdisswordfish6 wrote:
| This behaviour (id not changing after concatenation) is
| consistent with the original str object being deallocated and a
| new one allocated in its place with an identical id. It's not
| what actually happens behind the scenes (in CPython), but it's
| still an implementation detail that this demonstration does not
| even unambiguously expose.
| nine_k wrote:
| CPython could do both; allocations are usually not byte-
| perfect, and the small remaining room at the end of string can
| be used as an optimization, or a new buffer can be allocated if
| there's no more room.
|
| Also, cutting long strings with .strip() can use the same
| optimization. Allocation isn't fast.
|
| IDK if it's used, but it's entirely plausible to implement.
| masklinn wrote:
| > CPython could do both; allocations are usually not byte-
| perfect
|
| Though it's probably not the case for strings, it should be
| noted that for most objects allocations _are_ byte-perfect,
| because the average Python object is a bunch of pointers and
| and a refcount: two gc pointers, a refcount, a weakref
| pointer, a class pointer and a dict pointer, for a total of
| 48 bytes (as of 3.9, ignoring all the pointees).
|
| That's not the case for `__slots__` though: they drop the
| instance dict and the weakref pointer, but then each field
| will add a pointer to the total (then again that's space
| which won't be allocated as part of the instance dict).
| nine_k wrote:
| If a tree falls in a forest, and nobody observes it, does it make
| a sound? The tree can choose, because there's nobody to notice
| the difference.
|
| Mutable state is not bad; _shared_ mutable state is bad -- or, at
| least, takes a lot of care to handle. So CPython mutates strings
| which are not observed by anyone else but the mutating code.
|
| This is exactly the idea behind uniqueness types [1] and move
| semantics in C++ and Rust: mutation is safe as long as it's not
| shared.
|
| [1]: https://en.wikipedia.org/wiki/Uniqueness_type
| ymbeld wrote:
| The title seems like false advertisement. It seems that strings
| in Python are indeed immutable. How it implements immutability is
| up to it (the implementation), and making a brand new allocation
| for each operation is not the only way of adhering to the
| contract.
|
| Get back to me if you can make several references to the same
| string and make them step on each others' toes by only assigning
| a new value to one of them.
| teddyh wrote:
| > _the builtin id() function, which is just the memory address._
|
| This, crucially, is _not_ guaranteed. Instead, id() is only
| guaranteed to give a unique integer value for every object _valid
| for the lifetime of that object_. So if Python deallocated your
| first string object and allocated a new string object, Python
| could give the same id() for the new string object and this would
| be fine; this does _not_ necessarily mean that the two string
| objects are the "same" object! In fact, since the objects do not
| share lifetimes, they _cannot_ be said to be the same object!
| postit wrote:
| I just spend two days chasing a bug because someone shoved a
| variable inside a dataclass in the wrong place, nothing is
| immutable in python.
| iainmerrick wrote:
| No, strings are immutable, despite the semantic gymnastics this
| article is performing.
| masklinn wrote:
| Strings are immutable on the Python side. On the C size,
| however, PyUnicode_Resize is part of the PEP 384 Stable ABI
| and very specifically tries to resize the string in-place
| (it's normally intended for string-building).
| ymbeld wrote:
| If it is immutable from the client's perspective, it's
| immutable. Anything else is ill-defined.
| masklinn wrote:
| The C api is "from a client's perspective". Any rando can
| build native modules with it.
| cs702 wrote:
| My favorite Python WTF "feature" is that integers can have have
| the same reference, but only sometimes: >>> a =
| 256 >>> b = 256 >>> a is b True >>> a =
| 257 >>> b = 257 >>> a is b False >>> a =
| 257; b = 257 >>> a is b True
|
| Sometimes I think of Python as the _Nash Equilibrium_ [a] of
| programming languages:
|
| It's never the absolute best language for anything, but it's hard
| to improve it on any front (e.g., execution speed) without
| hindering it on other fronts (e.g., ad-hoc interactivity), and
| it's never the absolute worst language for anything. Often
| enough, it's good enough.
|
| Python might well be "the least-worst language for everything."
|
| --
|
| [a] https://en.wikipedia.org/wiki/Nash_equilibrium
| patrickthebold wrote:
| Java basically has the same thing:
|
| Quick Google gives: https://stackoverflow.com/a/1515811
| justinnhli wrote:
| > Sometimes I think of Python as the Nash Equilibrium of
| programming languages
|
| FYI: What you're describing is not a Nash equilibrium, but a
| Pareto optimal point [1]. They are similar in that you couldn't
| do any better, but Nash equilibria is in terms of whether this
| would cause other players to change their strategies, while
| Pareto optimality is only about trading off different
| features/dimensions.
|
| [1]: https://en.wikipedia.org/wiki/Pareto_efficiency
| th0ma5 wrote:
| This is outside of the spec... "is" is for testing the exact
| same reference and it is only coincidence that to speed things
| up they made smaller integers the same objects in memory. See:
| >>> a=257 >>> b=a >>> a is b True
|
| What you want is double equals.
| [deleted]
| oivey wrote:
| He names references in his post. I highly doubt he's confused
| about the difference between is and ==. It's a weird leak of
| interpreter details that could, in very narrow situations,
| cause a bug.
| coldtea wrote:
| > _My favorite Python WTF "feature" is that integers can have
| have the same reference, but only sometimes_
|
| Many languages do the same, ditto for strings (as in TFA).
| Delk wrote:
| As other commenters have pointed out, this is an
| implementation-specification optimization rather than a
| property of Python as a language.
|
| It is, at a first glance, a bit weird. But the way you should
| look at it is that Python the language doesn't say the two
| integers have the same identity, and you shouldn't assume they
| will. But it also doesn't say they _can 't_ be the same object.
| Since Python integers are immutable, and thus having the two
| variables actually reference the same object can't create side
| effects unless you're directly playing with identities and
| making assumptions in your code that you shouldn't make, the
| implementation can have the two variables reference the same
| object as an optimization without breaking anything.
| patrec wrote:
| > It's never the absolute best language for anything, but it's
| hard to improve it on any front (e.g., execution speed) without
| hindering it on other fronts (e.g., ad-hoc interactivity),
|
| This belief seems common, but I always wonder if anyone with
| familiarity with dynamic programming languages that were
| implemented by people who knew what they are doing (as
| implementers) thinks so. Self, Smalltalk and Common Lisp, for
| example, are doing _much_ better on the ad-hoc interactivity
| front in non-trivial ways whilst offering implementations with
| vastly better performance preceding (C)Python by many years.
| The fact that python has terrible execution speed is most due
| to lack of relevant skills in the community not some conscious
| engineering trade-off.
|
| Having said that, I don't think you are wrong on python being
| "the least worst language for everything" -- very few other
| languages have an eco system of remotely comparable
| expansiveness and quality (the top minds in several disciplines
| mostly use python for their work) which alone kills of huge
| swathes of would-be-competitors.
| orf wrote:
| `is` is for identity whereas `=` is for equality. You rarely
| want `is` unless you're asking if two references _are the same
| object_. This is almost exclusively used for `x is False
| /True`, but sometimes used to denote "missing" arguments (where
| true/false may be valid): missing = object()
| def func(a=missing): if a is missing:
| raise ValueError('you must pass a')
|
| This "numbers less than 256 are the same objects" is a fairly
| common on the list of "wtf python" but I've never understood
| it. You don't use `is` like that and you would never use it in
| code because the operator you're using is not the right one for
| the thing you're trying to do.
|
| Plus if this is the biggest wtf then that's pretty good going.
| cs702 wrote:
| Yes, of course. It's not the "biggest" WTF feature, but it's
| my favorite. There's a long list of WTF features here:
|
| https://github.com/satwikkansal/wtfpython
|
| Otherwise, I agree, it's a pretty good going :-)
| nneonneo wrote:
| I mean, who says Python strings are immutable? s
| = "asd" import ctypes ctypes.memmove(id(s) +
| ord('1'), b'X', 1) print(s) # prints aXd
|
| (Python 3.3+, 64-bit; this is a joke so please do not use this in
| anything real)
| yccs27 wrote:
| Nice!
|
| Can you please explain to a rookie what the ord('1') is for? Is
| it just a way of writing a magic number of value 49, or is
| there some significance to it?
| panda88888 wrote:
| CPython implementation of the string stores metadata to save
| memory. The ord('1') simply jumps to the correct memory
| offset in the string structure. I would think it's easier to
| remember ord('0') as the offset of the string byte array in
| the structure than the magic number 48?
| nneonneo wrote:
| It's mostly just a mildly obfuscated way to write 49. It also
| happens to line up nicely with the index of the character
| being overwritten (index 1); ord('0') through ord('9') work
| likewise.
|
| ----
|
| I'm personally kind of shocked at how big strings are - the
| minimum size is 6 native words in size, corresponding to the
| following fields: refcount, type pointer, string length,
| string hash (precomputed), flags, and a wstr pointer which
| just seems to be NULL for most strings. It seems like they
| could have at least merged the string length and flags
| together for short strings - a possible subject for a future
| PEP...
| rasjani wrote:
| It feels like its just "magic" number. Probably id() is the
| actual raw pointer to str class and the string starts at
| offset 48 ?
___________________________________________________________________
(page generated 2021-02-15 23:00 UTC) |