12 марта 2013 г.

How foreach actually works

Note: This answer assumes that you have some basic knowledge about how zvals work in PHP, in particular you should know what a refcount is and what is_ref means.
foreach works with all kinds of traversables, i.e. with arrays, with plain objects (where the accessible properties are traversed) and Traversable objects (or rather objects that define the internal get_iterator handler). This answer will mainly focus on arrays, I'll just mention the others at the end.
But before getting into that some background on arrays and their iteration that is important in this context:

Background on array iteration

Arrays in PHP are ordered hashtables (i.e. the hash buckets are part of a doubly linked list) and foreach will traverse the array according to that order.
PHP internally has two mechanisms to traverse an array: The first one is the internal array pointer. This pointer is part of the HashTable structure and is basically just a pointer to the current hashtable Bucket. The internal array pointer is safe against modification, i.e. if the current Bucket is removed, then the internal array pointer will be updated to point to the next bucket.
The second iteration mechanism is an external array pointer, called HashPosition. This is basically the same as the internal array pointer, but not stored in the HashTable itself. This external iteration mechanism is not safe against modification. If you remove the bucket that the HashPosition currently points to, then you'll leave behind a dangling pointer, leading to a segmentation fault.
As such external array pointers can only be used when you are absolutely, positively sure that during the iteration no user code will be executed. And user code can be run through a lot of ways, e.g. in an error handler, a cast or a zval destruction. That's why in most cases PHP has to use the internal array pointer instead of an external one. If it would not PHP could segfault when the user started doing weird things.
The issue with the internal array pointer is that it is part of the HashTable. So when you modify it, you are modifying the HashTable and as such the array. And as PHP arrays have by-value (rather than by-reference) semantics that means you need to copy the array whenever you want to iterate an array.
A simple example of why the copying really is necessary (and not just some puristic concern) is a nested iteration:
foreach ($array as $a) {
    foreach ($array as $b) {
        // ...
    }
}
Here you want both loops to be independent and not share the same array pointer in some weird way.
And this leads us to foreach:

Array iteration in foreach

Now you know why foreach has to perform an array copy before traversing the array. But that's obviously not the whole story. Whether or not PHP will actually do the copy depends on a few factors:
  • If the iterated array is a reference, then the copy will not happen, instead only an addref will be done:
    $ref =& $array; // $array has is_ref=1 now
    foreach ($array as $val) {
        // ...
    }
    Why? Because any change of the array should also propagate to the reference, including the internal array pointer. If foreach did a copy in this case it would break reference semantics.
  • If the array has a refcount of 1, the copy will not be done either. refcount=1 means that the array isn't used anywhere else so foreach can use it directly. If the refcount is higher than 1, it means that the array is shared with other variables and in order to avoid modification foreach has to copy it (apart from the reference case mentioned above).
  • If the array is iterated by-reference (foreach ($array as &$ref)), then - apart from the copy/no-copy behavior from above - the array will be made a reference afterwards.
So this is the first part of the mystery: the copying behavior. The second part is how the actual iteration is done, which is also a bit odd. The "usual" iteration pattern that you know (and that is also usually used in PHP - apart from foreach) is something like this (pseudocode):
reset();
while (get_current_data(&data) == SUCCESS) {
    code();
    move_forward();
}
foreach iteration looks a tiny bit different:
reset();
while (get_current_data(&data) == SUCCESS) {
    move_forward();
    code();
}
The difference is that move_forward() is not called at the end of one cycle, but at the beginning. So when your userland code is working on the element $i, then the internal array pointer is already at element $i+1.
This behavior of foreach is also the reason why the internal array pointer is set to the next bucket if the current one is removed, rather than the previous one (as you would expect). It's done so it works nicely with foreach (but it obviously won't work nicely with everything else and will skip array elements in that case).

Implications for code

The first implication of the behavior described above is that foreach has to copy the array it iterates in many cases (slow). But fear not: I actually tried out removing the requirement to copy it and I couldn't really see a performance difference in anything but artificial benchmarks (in those iteration got two times faster). Seems like people aren't iterating enough :P
The second implication is that usually there shouldn't be any implications. The foreach behavior is usually rather transparent to the user and just works as it should do. You don't need to worry about if and how a copy has been made and just when exactly a pointer is advanced.
The third implication is - and now we're getting to your problems - that sometimes you do get to see some very weird and hard to understand behavior. This happens in particular when you try to modify the array within the foreach.
A large collection of various edge-case behaviors that occur when you modify an array during iteration can be found in the PHP testsuite. Starting with this test and then incrementing the number 012 to 013 etc you can see how foreach will behave in various situations (different combinations of references etc).
But now to your actual examples:
  • Test case 1: Here $array has refcount=1 before the loop, so it won't be copied, but it'll get an addref. Once you do the $array[] the zval will be separated, so the array that you are pushing the elements to and the one in the iteration will be different ones.
  • Test case 2: Here the same applies as in test case 1.
  • Test case 3: Again, same story. At the foreach loop you have refcount=1, so you only get an addref and as such the internal pointer of $array will be modified. So at the end of the loop the pointer is NULL (meaning iteration done). each indicated this by returning false.
  • Test cases 4 and 5: Both each and reset are by-reference functions. The $array has a refcount=2 when it is passed to them, so it has to be separated. So here again foreach will be working on a separate array.
But those test cases are lame. The behavior only starts to get really unintuitive when you use a function like current in the loop:
foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 2 2 2 2 2 */
Here you should know that current is also a by-ref function, even though it does not modify the array. It has to be in order to play nice with all the other functions like next which are all by-ref. (current is actually a prefer-ref function. It can also take a value, but will use a ref if it can.) Reference means that the array has to be separated and $array and the foreach-array will be different. The reason you get 2 instead of 1 is also mentioned above: foreach advances the array pointer before running the user code, not after. So even though the code is at the first element, foreach already advanced the pointer to the second.
Now lets try a small modification:
$ref = &$array;
foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 2 3 4 5 false */
Here we have the is_ref=1 case, so the array is not copied (just like above). But now that it is_ref the array has no longer to be separated when passing to the by-ref current function. So now current and foreach work on the same array. You still see the off-by-one behavior though, due to the way foreach advances the pointer.
You get the same behavior when doing by-ref iteration:
foreach ($array as &$val) {
    var_dump(current($array));
}
/* Output: 2 3 4 5 false */
Here the important part is that foreach will make $array an is_ref=1 when it is iterated by reference, so basically you have the same situation as above.
Another small variation, this time we'll assign the array to another variable:
$foo = $array;
foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 1 1 1 1 1 */
Here the refcount of the $array is 2 when the loop is started, so for once we actually have to do the copy upfront. Thus $array and the array used by foreach will be completely separate from the starts. That's why you get the position of the internal array pointer wherever it was before the loop (in this case it was at the first position).

Iteration of objects

When an object is iterated there are two cases:
a) The object is not Traversable (or rather: Does not specify the internal get_iterator handler). In this case the iteration happens very similar to arrays. The same copying semantics apply. The only real difference is that foreach will run some additional code to skip properties that are not visible from the current scope. A few more random facts:
  • For declared properties PHP optimizes the property hashtable away. If you are iterating over an object though it has to reconstruct this hashtable (increasing memory usage). [Not that this should bother you, just a bit of trivia]
  • The hashtable for the properties is refetched on every iteration, i.e. PHP will call the get_properties handler again and again and again. For "normal" properties this makes little difference, but if the properties are dynamically created in the handler (this is something internal classes quite commonly do) then it means that the properties table will be recomputed every time.
b) The object is Traversable. In this case pretty much all what has been said above does not apply in any way. In this case PHP will not do copying and it will also not do any "advance pointer before loop already" tricks. I think that the iteration behavior on Traversables is a lot more predictable and doesn't really require explanation :)

Substituting the iterated entity during the loop

Another odd case that I didn't mention yet is that PHP allows you to substitute the iterated entity during the loop. So you can start iterating on one array and then replace it with another array halfway through. Or start iterating on an array and then replace it with an object:
$arr = [1, 2, 3, 4, 5];
$obj = (object) [6, 7, 8, 9, 10];

$ref =& $arr;
foreach ($ref as $val) {
    echo "$val\n";
    if ($val == 3) {
        $ref = $obj;
    }
}
/* Output: 1 2 3 6 7 8 9 10 */
As you can see in this case PHP will just start iterating the other entity from the start once the substitution has happened.

Changing the internal array pointer during iteration

One last detail of the foreach behavior that I did not yet mention (because it can be used to get really weird behavior) is what happens when you try to change the internal array pointer during iteration.
It may not do what you expect: When you call next or prev in the loop body (in the by-ref case) you can see that the internal array pointer is moved, but it will still not change the iteration behavior. The reason is that foreach will back up the current position and the hash of the current element into a HashPointer after every iteration. On the next iteration foreach will check whether the internal position changed and try to restore it back to the old element (based on the hash).
Let's see what "try" means with a few examples. First, here is an example that shows how a change of the internal pointer does not change the foreach behavior:
$array = [1, 2, 3, 4, 5];
$ref =& $array;
foreach ($array as $value) {
    var_dump($value);
    reset($array);
}
// output: 1, 2, 3, 4, 5
Now lets unset the element that foreach will be at on the first iteration (key 1):
$array = [1, 2, 3, 4, 5];
$ref =& $array;
foreach ($array as $value) {
    var_dump($value);
    unset($array[1]);
    reset($array);
}
// output: 1, 1, 3, 4, 5
You can see that the reset happened this time (double 1) because the element with the backed up hash was removed.
Now remember that a hash is just that: A hash. I.e. it has collisions. So, let's first try the following snippet:
$array = ['EzEz' => 1, 'EzFY' => 2, 'FYEz' => 3];
$ref =& $array;
foreach ($array as $value) {
    unset($array['EzFY']);
    $array['FYFZ'] = 4;
    reset($array);
    var_dump($value);
}
// output: 1 1 3 4
This behaves as expected. We removed the EzFY key (where foreach currently was), so the reset happens. Also we set an additional key, so the 4 is added at the end of the iteration.
But now comes the odd part. What happens if we set the FYFY key instead of FYFZ? Lets try:
$array = ['EzEz' => 1, 'EzFY' => 2, 'FYEz' => 3];
$ref =& $array;
foreach ($array as $value) {
    unset($array['EzFY']);
    $array['FYFY'] = 4;
    reset($array);
    var_dump($value);
}
// output: 1 4
Now the loop went directly to the new element, skipping everything else. The reason is that the FYFY key collides with EzFY (actually all keys in that array collide). Furthermore the Bucket for FYFY happens to be at the same memory address as the Bucket of EzFY that was just dropped. So for PHP it will look like it is still the same position, with the same hash. So the position is "restored" to it, thus jumping to the end of the array.