# Slow Chat Archives > Slow Chat: Developing Multithreaded Applications >  Interlocked* API functions and "memory access semantics"

## GeoRanger

I watched with interest 'Parallel Programming Talk #66 - Listener Question "What is acquire memory access semantics and do I need to worry about this in parallel programming?"' which features Mr. Tersteeg and Dr. Breshears.

http://software.intel.com/en-us/blog...l-programming/

The Talk is very informative and helped me learn quite a bit.  I ended up with four additional questions after pondering for a while.

(I hope it is OK to ask about this video here on a different website.  That is, I hope I'm not violating some internet etiquette of which I am unaware.)

1.  I did not know the Interlocked* API functions effected a fence as described in the video.  I have used them quite a bit in multithreading as a faster alternative to a critical section for simple changes to variables, and assumed that the operating system was responsible for ensuring only one thread was inside the call for any given destination variable.  Thus, it seems confusing to me that these API functions are seemingly accomplishing two not-necessarily-related purposes.  That is, on the one hand they are a quick critical section (so I suppose), and on the other they are a fence to the processor via a special machine instruction they use internally.  Can someone explain why these two purposes are intertwined in these functions?

2.  Dr. Breshears states loading from a volatilve variable effects an acquire fence and storing into a volatile effects a release fence (or maybe I've got that backwards).  Can someone expound on this?  I don't see why this would be the case.

3. I think I generaly get the "[acquire|release] memory access semantics" in terms of their asymmetric fence properties.  Can someone give a simple example of how this asymmetry can be used to acheive something more efficient than a full fence?

4. Is it only the Itanium that supports the acquire/release versions of these functions, or are they available on Xeons and i7s?

Thanks to anyone for information, and to Mr. Tersteeg and Dr. Breshears for the informative Talk video!
GeoRanger

----------


## Codeplug

>> *Can someone explain why these two purposes are intertwined in these functions?*
Win32 critical sections are implemented using "memory barriers". Or in other words, they use instructions which guarantee a deterministic outcome in the face of multiple threads accessing the same memory location at the same time.

Critical sections provide the strong guarantee of sequential consistency - analogous to a full memory fence. Full memory fences are a relatively expensive operation. It can involve flushing cache lines between cores, or completing accesses to/from main memory. An acquire/release fence is typically less expensive, so algorithms which don't require a full fence can benefit by using the less expensive acquire/release fence. 

Another aspect is that critical sections may make a kernel call in order to block the thread until the critical section becomes available. The proper use of only Interlocked API's prevents this relatively expensive operation. However, for most use cases, critical sections are just as fast since they already implement optimizations which prevent this kernel call (using Interlocked operations under-the-hood).

>> *2.*
At ~24:22 of the video, he mentions the load/store, acquire/release of volatile variables. He also goes on to say "this only works at the compiler level". In other words, it doesn't prevent reordering of reads and writes at the hardware level. 

So "at the compiler level" really means the order of the compiler-generated instructions (which the HW may be free to reorder). From a standards perspective, volatile accesses are more like a full fence - because the compiler can't move any volatile load/store instruction before/after another. But again, this doesn't help when multi-threading since the HW can reorder reads and writes.

>> *3.*
In general, a full fence is more expensive than an acquire/release fence. Identifying when you need one and not the other is the tricky part. Typically only lock-free or wait-free algorithms are concerned with this level of granularity. A simple exampled would be the double-checked-locking idiom, which is covered here: "C++ and the Perils of Double-Checked Locking" (also a good read for understanding how re-ordering can bite you).

gg

----------


## dvyukov

> From a standards perspective, volatile accesses are more like a full fence - because the compiler can't move any volatile load/store instruction before/after another.

Nope. That's not true. Volatile accesses can't be reordered only relative to other volatile accesses. They do not affect ordering of plain non-volatile accesses in any way.
So if a program contains non-volatile accesses (which is usually a case), then volatiles do not act as fences.

----------


## dvyukov

> Can someone explain why these two purposes are intertwined in these functions?

Because they are almost always required at the same time.
If they would not provide fences, then it hardly be the case that you use then for quite a bit and do not know what fences are. Because they would brutally bite every time every time you use them, causing problems that one can localize for weeks.

----------


## dvyukov

> Dr. Breshears states loading from a volatilve variable effects an acquire fence and storing into a volatile effects a release fence

From C++ point of view volatile have nothing to do with ordering and multithreading.
Perhaps, Clay referred to Microsoft Visual C++ extended volatile semantics. Indeed, in MSVC volatile load/store is acquire/release on both compiler and hardware level.

----------


## dvyukov

> Can someone give a simple example of how this asymmetry can be used to acheive something more efficient than a full fence?

Here is a single-producer/single-consumer queue which is based on release/acquire (or more precisely - consume) fences:
http://software.intel.com/en-us/arti...onsumer-queue/

----------


## dvyukov

> Is it only the Itanium that supports the acquire/release versions of these functions, or are they available on Xeons and i7s?

Yes and No. They are implicit on Xeons and i7s. I.e. every load is acquire, [almost] every store is release.

----------


## dvyukov

> Is it only the Itanium that supports the acquire/release versions of these functions, or are they available on Xeons and i7s?

Yes and No. They are implicit on Xeons and i7s. I.e. every load is acquire, [almost] every store is release.

----------


## Codeplug

>>>> *the compiler can't move any volatile load/store instruction before/after another [volatile load/store].* 

>> *Volatile accesses can't be reordered only relative to other volatile accesses.*
Absolutely, and that's what I intended. I should of been more clear (via the text in red). Thanks for pointing that out.

>> *Perhaps, Clay referred to Microsoft Visual C++ extended volatile semantics.*
That was my first thought as well. But in the video at around 24:22, it sounds like he was only referring to compiler re-ordering - in that the compiler can not reorder volatile accesses relative to other volatile access - but the HW can, making standard-volatile useless in terms of ordering and multi-threading. I think that's the point he was trying to get across.

>> *Here is a single-producer/single-consumer queue which is based on release/acquire (or more precisely - consume) fences:*
Is the volatile cast only to ensure the compiler generates are real load/store? If so, would it be "better" to use some sort inline assembly trick to ensure the load/store is generated? 

By "better", I'm thinking in terms of not using a keyword that was never intended for multi-threading use (even though part of the original intention may of been to ensure real loads/stores are generated). The following c.l.c.m post is kinda what got me thinking about "better" in this regard - http://groups.google.com/group/comp....825fe3595f93bb

gg

----------


## dvyukov

> Is it only the Itanium that supports the acquire/release versions of these functions, or are they available on Xeons and i7s?

Yes and No. They are implicit on Xeons and i7s. I.e. every load is acquire, [almost] every store is release.

----------


## dvyukov

>> Volatile accesses can't be reordered only relative to other volatile accesses.
> Absolutely, and that's what I intended. I should of been more clear (via the text in red). Thanks for pointing that out.

But then they are *not* fences in any way, shape or form. Fences order *other* accesses.

----------


## dvyukov

> But in the video at around 24:22, it sounds like he was only referring to compiler re-ordering - in that the compiler can not reorder volatile accesses relative to other volatile access - but the HW can, making standard-volatile useless in terms of ordering and multi-threading. I think that's the point he was trying to get across.

If so, then the statement is incorrect. According to ISO C++ volatile accesses are not acquire/release on any level. Acquire/release accesses must order *other* accesses, which volatiles do not.

Btw, ordering on compiler level only is sometimes useful too. The examples are: (1) communication between a thread and a signal handler, (2) communication between threads bound to a single hardware processors, (3) asymmetric synchronization patterns where a thread executes a fence "for me and for that guy".

----------


## dvyukov

> Is the volatile cast only to ensure the compiler generates are real load/store? If so, would it be "better" to use some sort inline assembly trick to ensure the load/store is generated?

Until C++0x there is just no best way. Some people prefer this and some prefer that. I prefer volatiles because they are easy to use and work for me. As for asm, in the presence of link time code generation/inter-procedural optimization it does not look 100&#37; safe too (just like volatiles and everything else).

----------


## dvyukov

> By "better", I'm thinking in terms of not using a keyword that was never intended for multi-threading use

I do not use volatiles for multi-threading. I use load_consume()/store_release() for multi-threading. And how they are implemented is no more than an implementation detail. Implementation requires several things - actual stores and loads, ordering on compiler level, ordering on hardware level. In this case, volatile is just one of the measures to ensure some of these things for particular compiler. I see nothing wrong with this. Anyway, there is just no things intended for multi-threading, so that's just what we have to live with for now. I will happily switch to std::atomic/std::atomic_thread_fence ASAP.

----------


## Codeplug

Thanks Dmitriy!

>> *Can someone give a simple example of how this asymmetry can be used to achieve something more efficient than a full fence?*
One more example using Peterson's lock, with an introduction to C++0x atomics:
http://bartoszmilewski.wordpress.com...mory-ordering/ 
(with corrections, thanks to Dmitriy and Anthony)
http://www.justsoftwaresolutions.co....x_atomics.html

gg

----------

