Alternate precise GC implementation #1057

rainers · 2014-12-05T13:50:34Z

This version of the precise GC saves type info in allocated memory.

The memory footprint of this version is different from the original GC (#1022) which allocated a pointer bitmap alongside every memory pool (one bit per size_t):

one bit per possible allocation block (i.e. per 16 bytes for standard pools, per 4kB for large object pools)
if the allocation block is not NO_SCAN or must be scanned conservatively, one pointer at the end of the allocation block, i.e. only adds additional memory if the allocation size is just below a power of 2, but then the size doubles.

Some more notes:

The runtime overhead for allocation is less because no pointer bitmap has to be filled.
The mark phase has to go through an indirection to get the pointer bitmap of a type, but it now has the abitlity to only scan the memory actually requested and skip padded memory
The RTInfo can contain a delegate to call instead of standard marking. This is used for arrays to only scan the currently used area of the allocation block. It also allows to easily change memory interpretation for large/small arrays which caused troubles in the other GC. For AAs this currently only skips the hash value, there is too little type information available to scan key and value precisely.
compilation does not yet use the optimized trait (Enhancement: add trait getPointerBitmap to help precise scanning dmd#4192).

The druntime benchmark suite runs a few percent slower than the non-precise version, unless it hits false pointers. That doesn't seem to happen very often for the test suite, though.

etcimon · 2014-12-05T17:50:15Z

Do you have the shared type qualifier in RTInfo? I've been planning for a while to have an instance of GC that is locked for the shared allocations. I'd put every other type through a thread-local, lockless instance where only 1 thread is blocked during marking, would that seem like a good direction for the precise GC? It's very nice to see this implementation going forward, thanks :-p

rainers · 2014-12-05T19:09:11Z

Do you have the shared type qualifier in RTInfo?

No, the RTInfo is only generated for bare structs and classes. But "shared" could be available in the type info passed to the GC in most invocations.

would that seem like a good direction for the precise GC?

With the current way to work with shared (i.e. to cast it away while being protected by some mutex), I don't have a clue how this can work with thread local operations. Also note that immutable is implicitely shared.

etcimon · 2014-12-05T19:19:33Z

With the current way to work with shared (i.e. to cast it away while being protected by some mutex), I don't have a clue how this can work with thread local operations.

It seems like the only important part would be when using new shared T, sending it to a shared Gcx shared_gc, whereas new T would send it to Gcx local_gc.

Also note that immutable is implicitely shared.

immutable objects would be unable to carry thread-local instances (as per the D syntax)... which seems to be the case already

Of course, there doesn't seem to be much to do other than branch off the allocate function, and change the Gcx to show that the local GC doesn't require locks, signals nor "stop the world"...

rainers · 2014-12-05T20:00:24Z

It seems like the only important part would be when using new shared T, sending it to a shared Gcx shared_gc, whereas new T would send it to Gcx local_gc.

How is this going to work? Consider a shared array shared(T[[]) locked by a mutex, then some factory function newing T without being aware of sharedness. The thread local T is added to the array casted to T[]. A thread local GC would not find that reference and free the instance of T.
It would require write barriers on every pointer write to transfer objects between thread local and global heaps.

etcimon · 2014-12-05T20:35:24Z

How is this going to work? Consider a shared array shared(T[[]) locked by a mutex, then some factory function newing T without being aware of sharedness. The thread local T is added to the array casted to T[]. A thread local GC would not find that reference and free the instance of T.

That absolutely can't be permitted by the compiler. The compiler has to force all underlying types to be implicitely shared, or they're invalid, ie. typeid(shared(T[])) is typeid(shared(shared(T)[])). The shared type qualifier absolutely has to apply recursively implicitely. The only way you can move a T[] obj into shared T[] would be through a convenience function that copies the data into shared space, like obj.sdup() or makeShared(obj) for shared duplication. This would somewhat work like Isolated! and makeIsolated! from vibe.core.concurrency, but duplicating the data to the shared GC.

yebblies · 2014-12-06T02:52:53Z

@etcimon This compiles with no errors, and is not a bug.

void main()
{
    shared x = new Object();
}

etcimon · 2014-12-06T04:09:01Z

@etcimon This compiles with no errors, and is not a bug.

void main()
{
shared x = new Object();
}

It should be one. Leaving this semantic check out makes no sense. Storing a thread-local pointer on shared storage is like sending a local string pointer to a shared DB for later.
e.g.
string str = "hey";
db.save("str", &str); // stores "0xffff0e12"

You can jump through many hoops and loops to make it dereference properly, but you'll need some really tight coupling to keep it working. This is exactly why the GC is currently shared and locks/stops the world even for local allocations: someone believed local pointers should be allowed to be in shared storage.

yebblies · 2014-12-06T04:33:36Z

It was designed for the single global GC that the language currently requires. Another example is this:

Object maker() pure
{
    return new Object();
}
void main()
{
    immutable x = maker();
}

The function that calls new has no idea that the allocation will become shared.

Neither of these examples involve casting or other explicitly 'unsafe' features. I don't see how a proposal for thread-local heaps that simply declares these use cases invalid and breaks existing code would ever be accepted.

etcimon · 2014-12-06T04:40:05Z

I don't see how a proposal for thread-local heaps that simply declares these use cases invalid and breaks existing code would ever be accepted.

It doesn't need to break existing code. The dual tls/shared support in GC can be versioned or linked externally, and the libraries get an optional warning from the compiler to become progressively compatible. You should (ideally) know about your storage when you create your instance, or at least copy it there...

Object maker() pure
{
    return new Object();
}
void main()
{
    immutable x = maker().idup();
    shared y = maker().sdup()
}

There are countless performance issues that can be solved simply by avoiding the locking. D could beat Java in the benchmarks with this change alone.

http://forum.dlang.org/thread/jhbogjnxmcpjmemgaigs@forum.dlang.org

rainers · 2014-12-06T09:18:40Z

The tests fail because of https://issues.dlang.org/show_bug.cgi?id=8262

rainers · 2014-12-06T09:21:51Z

There are countless performance issues that can be solved simply by avoiding the locking.

I think most of the locking for allocations can be removed with a few CAS operations or just thread-local memory pools that are still scanned by the "global" garbage collection.

etcimon · 2014-12-06T15:01:03Z

that are still scanned by the "global" garbage collection.

scanning involves locking as well if the garbage collection isn't thread local

rainers · 2014-12-06T16:27:22Z

scanning involves locking as well if the garbage collection isn't thread local

Only seldomly for a short period of time if scanning is done concurrently.

I agree that a thread local GC would be nice to have, but I don't see how it fits with the current language.

etcimon · 2014-12-06T17:00:46Z

I agree that a thread local GC would be nice to have, but I don't see how it fits with the current language.

The best way to achieve a thread-local GC would be to improve and enforce shared-correctness in Phobos/druntime (at first). We need to start considering shared as a heap storage attribute as well, for consistency. An optional compiler warning (through a flag) would be a start.

If even a 30% speedup is possible down the line, it's worth it. The more threads, the more improvements. I have over 90k lines of D code to maintain and I'm ready to put in the work for this on druntime as well.

MartinNowak · 2014-12-06T23:09:45Z

I actually implemented a bloom filter at some point and tried to use that for eliminate false pointers, it was much slower and only found very few false pointers in the gc benchmark suite. Almost all false pointers were already catched by the p >= minAddr && p < maxAddr guard. Sure false pointers are a slightly bigger problem on 32-bit systems, but now that even mobile phones start to roll out with 64-bit CPUs I wonder how important that still is.

The other thing you can do with a precise GC is to skip non-pointer field in an allocated object.
But once you already loaded the cacheline each L1 lookup only costs 4 cycles, so it boils down to whether processing the bitmaps is faster than simply loading and checking all the 8-byte aligned pointers. That might indeed be the case when you have many objects that contain only a few pointers (thus can't be NO_SCAN) but also store a lot of POD data, e.g. an RBTree with big values.

rainers · 2014-12-07T09:42:35Z

Almost all false pointers were already catched by the p >= minAddr && p < maxAddr guard.

That's also my experience, but it depends a lot on how much memory is actually used and how much random data is scanned.

Sure false pointers are a slightly bigger problem on 32-bit systems, but now that even mobile phones start to roll out with 64-bit CPUs I wonder how important that still is.

Considering that 32-bit programs are quite a bit faster (e.g. the benchmarks take 43 sec instead of 52 sec for 64-bit on my system), people might still prefer it sometimes.

The other thing you can do with a precise GC is to skip non-pointer field in an allocated object.

Knowing the exact size of an object or array also can help a lot. That can avoid scanning up to half of the allocation block. We might also skip zeroing that area during allocation.

But once you already loaded the cacheline each L1 lookup only costs 4 cycles, so it boils down to whether processing the bitmaps is faster than simply loading and checking all the 8-byte aligned pointers.

Yeah, it's pretty hard to predict what version is better. It's also system dependent. The TypeInfo/RTInfo is probably used quite often so it should also be in L1/L2 cache usually.

MartinNowak · 2014-12-07T11:16:13Z

Yeah, it's pretty hard to predict what version is better. It's also system dependent. The TypeInfo/RTInfo is probably used quite often so it should also be in L1/L2 cache usually.

It will be crucial to store the bitmaps very efficiently.
To deduplicate them, you could mangle them as _d_bitmap_hexorbase64 or somehow pass them as value parameter through a template.
There exist many "real-time" compression algorithms, where real-time here means, it's faster to decompress than to read the uncompressed data from memory (see https://github.com/MartinNowak/d-blosc). They wouldn't really work for random access patterns though, but we should try simple stuff like varints.

MartinNowak · 2014-12-07T11:24:49Z

http://en.m.wikipedia.org/wiki/Bit_array#Compression
http://escholarship.org/uc/item/2sp907t5
http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them

MartinNowak · 2014-12-07T11:40:14Z

It might also be interesting to add some statistic option to the GC that reports what kind of data is allocated (currently classes, NO_SCAN and the rest, later also structs and arrays) and the mean and standard deviation for the percentage of pointers in each allocation (grouped by kind and size).
Then we could ask people to run this with their D projects and hopefully collect some relevant data for making better optimization decisions.

rainers · 2014-12-07T13:26:37Z

It will be crucial to store the bitmaps very efficiently.

I'm not so sure. Most of the bitmaps will fit into a cache line anyway. Maybe it's even better to store a series of bytes with alternating "number of 1" and "number of 0" so you can jump easily over areas without pointers.

Alternatively, we might just check the bitmap after the if (p >= minAddr && p < maxAddr) check succeeds.

To deduplicate them, you could mangle them as _d_bitmap_hexorbase64 or somehow pass them as value parameter through a template.

That could also be the task of the linker (identical comdat folding on the RTInfo).

MartinNowak · 2014-12-07T15:17:12Z

I'm not so sure. Most of the bitmaps will fit into a cache line anyway.

Right, but compressed they will fit in fewer cachelines, so you'll cause less cache pollution during GC scanning.

Maybe it's even better to store a series of bytes with alternating "number of 1" and "number of 0" so you can jump easily over areas without pointers.

Will be tricky to come up with an efficient RLE, anyhow that's fairly advanced and should only be done if necessary or as an optional optimization step.

To deduplicate them, you could mangle them as _d_bitmap_hexorbase64 or somehow pass them as value parameter through a template.

That could also be the task of the linker (identical comdat folding on the RTInfo).

The deduplication is mandatory IMO and ICF only works with Microsoft's linker (currently disabled due to Issue 10664), so please handle this explicitly.

MartinNowak · 2014-12-07T15:35:22Z

I'm not so sure. Most of the bitmaps will fit into a cache line anyway. Maybe it's even better to store a series of bytes with alternating "number of 1" and "number of 0" so you can jump easily over areas without pointers.

See p. 8-10 http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them, that's fairly trival, efficient and should be close to optimal (though we don't know the distribution).
I think we should implement this, maybe first getting rid of the related CTFE allocations in the compiler.
Only needed when the bitmap would get bigger than 1 byte though.

MartinNowak · 2014-12-07T15:40:18Z

src/gc/rtinfo.d

32-byte for each type? The bitmap will only be 1-byte in many cases.

Why is this size_t and not ubyte? How many different mark function do you have?
Maybe it's better to turn this into a ubyte* to [MarkType, length (varint), bitmap bytes] or [MarkType, bitmap bytes, sentinel].
We currently have about 8000 types in the unittest libphobos and 2000 in the release one, that's quite a number of relocations to add. Would be great to use non-global symbols here (visibility hidden) to avoid relocation during loading, but that's currently not possible in D.

32-byte for each type? The bitmap will only be 1-byte in many cases.

Even if it the bitmap needs only 1 byte, alignment will make it 16 bytes. I chose size_t[] because of the handling in markPrecise, but that might could be changed depending on what's faster.

Why is this size_t and not ubyte? How many different mark function do you have?

There are currently 3: default, dynamic array, assoc array. But as there is no "emplace" in this version of the GC, I'd like to allow the user to specify a mark delegate to scan void[N] fields precisely according to emplaced types.

Ubyte has an alignment of 1-byte.
Well then you could have a forth type meaning user callback, that would be the only one requiring alignment.

rainers · 2014-12-07T16:13:19Z

See p. 8-10 http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them, that's fairly trival, efficient and should be close to optimal (though we don't know the distribution).
I think we should implement this, maybe first getting rid of the related CTFE allocations in the compiler.

One other option would be to generate mark function per type by string mixins. That's an idea that @andralex wanted to persue. It will wreck compilation speed, though.

I currently suspect most of the performance being lost in the benchmark by a lot of indirections when marking arrays of strings that iterates over small types (char[] = { length, ptr }).

rainers · 2014-12-12T20:15:51Z

While working on the (array of) struct destructors I realized that the array marking needs at least 2 versions: one for the array layout in lifetime.d, and one for plain arrays as created by std.array.Appender.
I'm not sure how this can be achieved by a single type info for T[], though.

Maybe RTInfo!T should be evaluated in the scope of the type to allow specialization. That could open some options to setup specific mark functions without implementing everything in the runtime.

schveiguy · 2014-12-12T20:31:00Z

why should the runtime be concerned about a phobos type's misuse of the runtime?

rainers · 2014-12-12T20:59:38Z

why should the runtime be concerned about a phobos type's misuse of the runtime?

If it is a common idiom (e.g. Appender) to allocate arrays of objects like this, we need a way to specify how to scan it. It might also be C malloced memory added as a gc-range.

Even if I wouldn't care if Appender is dumped, I still think that if precise scanning is done through the TypeInfo, a struct/class must have the ability to specify how a class emplaced into a field has to be scanned. The other precise GC does this by modifying the pointer bitmap stored by the GC, this version needs to specify a mark callback function.

schveiguy · 2014-12-12T21:35:48Z

OK, as long as the marking function is defined and maintained in Phobos, that sounds correct.

Orvid · 2015-01-13T09:38:34Z

As far as bitmap compression goes, I will say that there is nothing as far as I know in the spec that prevents the compiler from re-ordering the fields of a non-decorated (non- extern(C|C++|Windows|whatever)) type.

MartinNowak · 2016-04-03T11:42:34Z

Because the discussion came up again, a summary of my points:

False pointers are not an issue on 64-bit platforms.
Precise GC isn't per se useful unless you figure out how to use it to improve marking speed.
Take ubyte[][] as an extreme example, even though you only have to look at every 2nd word, you'll have a hard time to reach the same speed as a linear scan when encoding the mark pattern as metadata.
Bitpatterns must be deduplicated and stored in a dense format, b/c marking is already completely bound by the speed of the memory bus.
Indirect functions calls (type-specific mark functions) will cause plenty of code cache misses (b/c the jumps aren't predictable) and increase mark latency.
Even w/ bigger types (more than the 64bytes of one cacheline) the prefetcher might get the next cacheline before you are able to figure out that you don't need to scan it.
For types w/ very big internal buffers (ubyte[4096] buf) it'd be better to teach people, that using an external ubyte[] buf = new ubyte[](4096) is faster for the GC to mark.
We should still invest some more time to separate data with points from NO_SCAN data, e.g. by separating data sections (see https://trello.com/c/Dsdxd1r3/56-ptr-data-section).

I think there are lower hanging fruits to speed up the GC, e.g. background/parallel finalization, vectorized (SIMD) loop in mark.

andralex · 2016-04-03T16:45:44Z

A few answers interspersed with as many questions (I'm not very familiar with the current implementation).

Because the discussion came up again, a summary of my points:

False pointers are not an issue on 64-bit platforms.

Indeed I could not find much information about that. Searching for quoted "false pointers" yields little data (a subset of which points to discussions within the D community), because so few languages have them! I fear this might be a political issue - we're the only language that sticks with conservative GC when all other languages and systems use precise GCs and build upon them.

Precise GC isn't per se useful unless you figure out how to use it to improve marking speed.
Take ubyte[][] as an extreme example, even though you only have to look at every 2nd word, you'll have a hard time to reach the same speed as a linear scan when encoding the mark pattern as metadata.

A more interesting case is structures with pointers on one cache line and no pointers on the next one. Consider:

struct Customer {
    string lastname;
    string firstname;
    string address;
    string zipcode;
    int id;   
    char[8] ssn;
    ...
}

An array of Customer has 64 bytes worth of pointers per item and then some non-pointer data, say 64 bytes or more. In that case only a fraction of the cache lines need to be looked at. Would this save on scanning? How does that interact with prefetching? I'm not sure we can just make the blanket assumption that in all cases brute linear scanning wins.

Another question: what is the cost per word to decide whether it is a real pointer or a false pointer?

Bitpatterns must be deduplicated and stored in a dense format, b/c marking is already completely bound by the speed of the memory bus.

I agree bit patterns are not the best way to go about characterizing data shapes. We've discussed in the past the variant with generating a mark function per allocated type. So instead of bit patterns we have specialized code that knows how to scans types (and arrays thereof) transitively.

Indirect functions calls (type-specific mark functions) will cause plenty of code cache misses (b/c the jumps aren't predictable) and increase mark latency.

In the scheme sketched above, the only indirect calls are for roots. All others are direct calls because the functions are generated with full type information.

Even w/ bigger types (more than the 64bytes of one cacheline) the prefetcher might get the next cacheline before you are able to figure out that you don't need to scan it.

At some point surely prefetching will become slower than loading a fraction of the pages. I think we'd need to test this. Until then I think it's difficult to base decisions on it.

For types w/ very big internal buffers (ubyte[4096] buf) it'd be better to teach people, that using an external ubyte[] buf = new ubyte[](4096) is faster for the GC to mark.

Historically framing things as education issues are problematic. Also, often data schemata grow by accretion - one member here, one member there, and each member grows in size with adding features to the code. I don't think we can count this as a strong argument.

We should still invest some more time to separate data with points from NO_SCAN data, e.g. by separating data sections (see https://trello.com/c/Dsdxd1r3/56-ptr-data-section).

That'd be great indeed, but has a hard limit on what you can do. One theme in precise GC is that there are more levers to improving it than conservative.

I think there are lower hanging fruits to speed up the GC, e.g. background/parallel finalization, vectorized (SIMD) loop in mark.

For the most part these don't compete with precision and would improve either approach.

LightBender · 2016-04-03T18:46:52Z

Erm, the point about false pointers is factually incorrect. While yes, in theory, a 64-bit address space should "solve" the false pointer problem, the reality is, as always, a little more complicated. There was a recent thread on the forum by someone doing big-data work. He ran into a false pointers problem in a 64-bit app. http://forum.dlang.org/thread/mafpzsuiuhzlriixymvy@forum.dlang.org

My guess is that the problem is that while the OS allocator could randomly map addresses inside the virtual address space, it is not. I strongly suspect that the behavior is more along the lines of: find free space in the physical memory then map the address to the next available block of contiguous free space in the virtual memory space. This would necessarily involve mostly filing up the 32-bit sized space before spilling over into 64-bit space.

As we move into an era of larger data-sets and the need to accurately work on them, I fear that the mantra of "64-bit solves false pointers!" is going to be proven false at an ever increasing rate.

DemiMarie · 2016-04-03T19:55:00Z

A MAJOR win for precise or mostly-precise GC is the ability to compact all (resp. most) of the heap. This eliminates (resp. dramatically reduces) fragmentation.

DemiMarie · 2016-04-03T19:55:41Z

Also, the precise GC has already been written – it is probably a good idea to use it!

rainers · 2016-04-03T20:32:48Z

False pointers are not an issue on 64-bit platforms.

I don't really buy this. Just adding std.datetime to the parsed files of the vdparser test shows a varying process memory usage of 400 to 900 MB for the x64 benchmark under win64 (address randomization is likely to cause more trouble than the usual addresses under linux). That doesn't happen with the precise GC.

I agree with most of what Andrei said, but this part:

So instead of bit patterns we have specialized code that knows how to scans types (and arrays thereof) transitively. [...]
In the scheme sketched above, the only indirect calls are for roots. All others are direct calls because the functions are generated with full type information.

I don't think transitivity works in D: it moves the memory type description from the allocation to the referrer, but that information is incomplete (a class reference doesn't know about derived classes) and ambiguous (multiple referers might use different pointer types, especially in system code). You have to keep the type information for every block anyway (in case it becomes a root), but tracking whether a transitive scan covered the actual allocation type seems duplicate effort.

Please note that the implementation in this PR is rather slow due to the number of indirections. Converting the scan function into CTFE generated code (making compilation even slower than the bitmap generation used to be before converting it into a trait) just reduces the number of indirections by one.

I still prefer #1022, but it might need some optimizations for saving the pointer bitmap during allocation. Tuning these with dmd's weak optimizations was frustrating, though. Large arrays might use a simple scheme to save the pattern just once.

andralex · 2016-04-03T23:15:15Z

I don't think transitivity works in D: it moves the memory type description from the allocation to the referrer, but that information is incomplete (a class reference doesn't know about derived classes) and ambiguous (multiple referers might use different pointer types, especially in system code).

Interesting re class references - so for classes you'd need an indirect call (including a hidden virtual method). Regarding storing different pointer types, what cases do you have in mind? For void* an indirect call would be needed, but otherwise I think we can assume fairly strict typing rules.

You have to keep the type information for every block anyway (in case it becomes a root), but tracking whether a transitive scan covered the actual allocation type seems duplicate effort.

Please note that the implementation in this PR is rather slow due to the number of indirections. Converting the scan function into CTFE generated code (making compilation even slower than the bitmap generation used to be before converting it into a trait) just reduces the number of indirections by one.

Couldn't that be significant?

rainers · 2016-04-04T07:04:05Z

Regarding storing different pointer types, what cases do you have in mind? For void* an indirect call would be needed, but otherwise I think we can assume fairly strict typing rules.

The void* pointer was my main concern, too. Unions of different pointer types are similar. Most other pointer casts are just temporary and might only become a problem if the stack is scanned more precisely.

Uh oh!

Alternate precise GC implementation #1057

Alternate precise GC implementation #1057

Uh oh!

Conversation

rainers commented Dec 5, 2014

Uh oh!

etcimon commented Dec 5, 2014

Uh oh!

rainers commented Dec 5, 2014

Uh oh!

etcimon commented Dec 5, 2014

Uh oh!

rainers commented Dec 5, 2014

Uh oh!

etcimon commented Dec 5, 2014

Uh oh!

yebblies commented Dec 6, 2014

Uh oh!

etcimon commented Dec 6, 2014

Uh oh!

yebblies commented Dec 6, 2014

Uh oh!

etcimon commented Dec 6, 2014

Uh oh!

rainers commented Dec 6, 2014

Uh oh!

rainers commented Dec 6, 2014

Uh oh!

etcimon commented Dec 6, 2014

Uh oh!

rainers commented Dec 6, 2014

Uh oh!

etcimon commented Dec 6, 2014

Uh oh!

MartinNowak commented Dec 6, 2014

Uh oh!

rainers commented Dec 7, 2014

Uh oh!

MartinNowak commented Dec 7, 2014

Uh oh!

MartinNowak commented Dec 7, 2014

Uh oh!

MartinNowak commented Dec 7, 2014

Uh oh!

rainers commented Dec 7, 2014

Uh oh!

MartinNowak commented Dec 7, 2014

Uh oh!

MartinNowak commented Dec 7, 2014

Uh oh!

MartinNowak Dec 7, 2014

Choose a reason for hiding this comment

Uh oh!

MartinNowak Dec 7, 2014

Choose a reason for hiding this comment

Uh oh!

rainers Dec 7, 2014

Choose a reason for hiding this comment

Uh oh!

rainers Dec 7, 2014

Choose a reason for hiding this comment

Uh oh!

MartinNowak Dec 16, 2014

Choose a reason for hiding this comment

Uh oh!

rainers commented Dec 7, 2014

Uh oh!

rainers commented Dec 12, 2014

Uh oh!

schveiguy commented Dec 12, 2014

Uh oh!

rainers commented Dec 12, 2014

Uh oh!

schveiguy commented Dec 12, 2014

Uh oh!

Orvid commented Jan 13, 2015

Uh oh!

MartinNowak commented Apr 3, 2016

Uh oh!