Wednesday, September 24, 2008

Memory barriers and SSE2

After reading that lock-free code would be unstable on multiprocessor machines (dual cores, etc.), I studied that there was a way to prevent such things from happening: Memory barriers.

A memory barrier makes sure that all the data you've written to memory effectively goes to the memory and doesn't stay in the CPU Cache.

Unfortunately, according to a recently reported GCC bug, GCC's instruction for memory barriers, __sync_synchronize(), is flawed in x86-64 processors (Pentium Core Duo, AMD 64 x2, etc) because it doesn't implement the instruction "mfence".

mfence is an SSE2 instruction (SSE2 is available for Pentium-4 and later processors) that implements an efficient memory barrier. There are other implementations to do a memory barrier on x86 processors, which I won't mention here for reasons of space. So I replaced the __sync_synchronize() primitive with qprof's AO_nop_full(), which invokes a memory barrier, using mfence if available. Also, I added this function to all the atomic primitives, and also to the syAudioBuffer class.

Now I can be 100% sure that the code won't crash. At least for multiprocessor-caused race conditions, anyway. ;-)

Tuesday, September 23, 2008

"mutable" keyword, and Major rewrite in the core...

After redesigning the syAudioBuffer and implementing some stuff in the AudioOutputDevice, I realized that my assumption to use only one thread for reading and writing led me to write some very defective code, so I almost have to rewrite the entire thing. Fortunately I have finished doing so with AudioOutputDevice. But VideoOutputDevice, VideoInputDevice and AudioInputDevice are yet to be rewritten. Sigh.

Additionally, I do need to redeclare the majority of functions as const. Fortunately, now I can do so freely because I have discovered the mutable keyword!

mutable allows you to allow some variables in the object to change, even if the modification resides inside a const function, or if the object was passed as "const". This way I can declare the thread synchronization variables as "mutable" and still enjoy the compile-time protection of const.

Update (Sep 23, 2008):

Finally some good results from the rewrite! I discovered a bug in syBitmapCopier that instead of copying the source to the destination, it copied the destination to the source!! :-S Yikes.

It's nice how all these knowledge tidbits contribute to improve a programmers' level as a whole.

Wednesday, September 17, 2008

Happy 17th birthday, Linux!



Some hours ago, Linux became 17 years old! I have to mention that after studying the kernel code (because that's what Linux is, a kernel, NOT an Operating System!), I realized that calling the different distributions "Linux" is wrong.

Probably you'll say "aw man, not that Stallman crap again". Nope. It's actually the opposite. I've been having some annoying problems with the Linux DISTRIBUTION (which includes the kernel, the GNU operating system files, KDE and some other stuff) I'm using right now. But saying "aw man, my Linux isn't working" is WRONG. It's not the Kernel's fault! It's the stupid distro's fault.

Because as I have studied the multithread code, the posix mutexes implementation, and read about the Completely Fair scheduler, I realized:

Man, this kernel is one of the most beautiful things created by men. For starters, it implements posix threads, which have a beautiful and simple API compared to... eeew... Windows - trust me, I've been using this thing.

Another example of Linux' beauty is one of the recent patches for the scheduling code, which was merged in 2.6.25. It implements something called ticket spinlocks. To explain, a spinlock is an item that can only be possessed by only one thread at the same time. And all threads keep competing for it, and they never give up until they get it. Not a nanosecond of rest. Imagine a discounts sale on a women's clothing shop. Yes, it gets that ugly :P.

But the latest patch makes each thread take a "ticket" and wait for its turn. When I read that I was like... whoa.... you just blew my mind. I didn't even SUSPECT things like this could be implemented! Even after 17 years, Linux (yes, the Kernel) is getting more and more beautiful.

Now, repeat after me:

Linux. Is. Beautiful.

So now on my to-do list is to replace all the "Linux" references in Saya's website for "GNU/Linux", to give the Kernel (represented by that cute little penguin we call Tux :) ) the place it deserves.


HAPPY BIRTHDAY, LINUX!

As a gift, here's a cute paper tux to decorate your desk. By the way, you can edit the image to replace that logo with your favorite distro's logo. Enjoy the party :)

Monday, September 15, 2008

New AVController design: Four threads, audio frames.

After not visiting the Audio Video playback/sync module for a while (due to my work with the threads and the API cleanup in the timeline model),I finally decided to print the source code. Believe it or not, it allows you to concentrate better and see the code as a whole. So it was like 20 to 30 pages, stapled per file for a better organization.

Anyway - I studied the code for avcontroller.cpp, and I saw a fatal flaw in the premilinary design: There was only ONE worker thread, and I need four!

Why?

The code does more or less this: It tells the VideoInputDevice to send data to the VideoOutputDevice, which tells VideoOutputDevice to load the data. The problem is that VideoOutputDevice also Renders the data after loading it, instead of invoking a separate thread or something.

So there's a sequence to load the video:

1) VideoInputDevice fetchs the video from a file or whatever. This implies a HEAVY OVERHEAD as it can imply either decoding, or compositing.

2) When the data is available, VideoInputDevice sends the data to the VideoOutputDevice.

3) After receiving the data, VideoOutputDevice sends the data to the display or encoder.

Which means, that while the decoder processes the input, the encoder is blocked. While this can be fine for Video, where you can just play the same frame over and over, it's NOT fine for audio where the buffer can simply go blank.

And then I realized an even WORSE problem: There is no separation between Video and Audio processing time! It's all done with ONE thread!! So while the Video is trying to decode / process the video effects, the audio can run out of data.

To solve this problem, we will use FOUR threads: Video In, Audio In, Video Out, and Audio Out. If one operation is blocked, it won't affect the others. Of course, if the video or audio simply run out of data we'll have to stay silent. But at least the Audio Input won't affect Video Ouput or viceversa.

Another problem is that using Mutexes can get things slow, specially for audio. I've been researching for a while about lock-free data transfer, and I remember one guy saying something about audio FRAMES.

What? O.O *blink blink* Frames in audio? O.O *blink blink*

That's right. By splitting the buffer into separate small frames of time (say, 1000 samples) we can use a fixed buffer of POINTERS to those frames.

This means: Instead of having this:

[..........................................................]

we have this:

A [.....]
B [.....]
C [.....]
D [.....]
E [.....]
F [.....]
G [.....]
H [.....]
I [.....]
J [.....]

[ABCDEFGHIJ]

So a thread can reserve one of the N frames of audio for buffering, and instead of having to avoid touching a HUGE GODZILLA chunk of data, we can just meddle around with one of the little godsukis. And the list which tells which order they should be played, can be a lock-free list (this means it will have thread-safe algorithms which do not require mutexes or any of that.

Finally, with the lessons learned during the thread module's development, I can finally delete that awful sleep-based synchronization and start using semaphores and/or condition variables. I think I'll go for condition variables, as I can wake up the thread either when I need to send data, or when I need to stop it.

Edit (Oct 2,2008):

The actual redesign didn't use audio frames, after all. It was much simpler to use a circular buffer with one sample per slot.

Sunday, September 14, 2008

Forward declarations: Not good enough?

It's widely known by C++ programmers that replacing classes instantiations with class pointers in your classes, allows you to do a forward declaration in your headers and save compilation time.

Example:

Bad:

a.h:

class A {
};


b.h:

#include "a.h"
class B {
A myvar;
}


Good:

a.h:

class A {
};


b.h:

class A;
class B {
A* myvar;
}


Then, in the CPP you do the #include. But this is NOT enough! Since most of my classes are pretty brief, due to the fact that some are template instantiations, I have not only to include in each .cpp the headers for all the classes and ALL the classes contained in those classes ad infinitum. OK, If I followed the first approach I would only "include one file", but that's not the point, since that "one file" would start an include chain. So it doesn't matter what I do, in each .cpp I always end up invoking all the .h files and end up with a gigantic object file.

Let's see what this little code does:


#include "smap.h"
#include "avclip.h"
#include "avtransition.h"
#include "aveffect.h"
#include "aveffects.h"


AVClip::AVClip()
{
//ctor
m_Effects = new AVEffects;
m_EndingTransition = new AVTransition;
m_Markers = new SMapUintUint;
}

AVClip::~AVClip()
{
//dtor
delete m_Markers;
delete m_EndingTransition;
delete m_Effects;
}



why do I need to include all the headers JUST TO CREATE AN OBJECT? Because by doing "m_Effects = new AVEffects", I call AVEffects::AVEffects(), and for that I need the include file. Which also ends up invoking svector.h, which also includes serializable.h and . Eew!!

Enter forward declarations in the .cpp files!

To solve this problem, I created a class named AVClasses, which encapsulates object creation, deletion, serialization AND deserialization for ALL THE KNOWN CLASSES.

avclasses.h:

#include

class AVSettings;
class FXParameterList;
class AVEffectParamDeclaration;
class AVEffectParamDeclarations;
// ...

class AVClasses {
public:

static void Create(AVSettings* &ptr);
static void Create(FXParameterList* &ptr);
static void Create(AVEffectParamDeclaration* &ptr);
static void Create(AVEffectParamDeclarations* &ptr);
// ...

static void Delete(AVSettings* &ptr);
static void Delete(FXParameterList* &ptr);
static void Delete(AVEffectParamDeclaration* &ptr);
static void Delete(AVEffectParamDeclarations* &ptr);
// ...

static std::string serialize(AVSettings* &ptr);
static std::string serialize(FXParameterList* &ptr);
static std::string serialize(AVEffectParamDeclaration* &ptr);
static std::string serialize(AVEffectParamDeclarations* &ptr);
// ...

static bool unserialize(AVSettings* &ptr, const std::string& s);
static bool unserialize(FXParameterList* &ptr, const std::string& s);
static bool unserialize(AVEffectParamDeclaration* &ptr, const std::string& s);
static bool unserialize(AVEffectParamDeclarations* &ptr, const std::string& s);
// ...
};


This little toy hides the object creation, destruction and (un)serialization. Of course, the avclasses.cpp file will include ALL THE HEADERS. It's fun!

avclasses.cpp:

#include "avclasses.h"

#include "avsettings.h"
#include "fxparameterlist.h"
#include "aveffectparamdeclaration.h"
#include "aveffectparamdeclarations.h"
// ... etc.

void AVClasses::Create(AVSettings* &ptr) { ptr = new AVSettings; }
void AVClasses::Create(FXParameterList* &ptr) { ptr = new FXParameterList; }
void AVClasses::Create(AVEffectParamDeclaration* &ptr) { ptr = new AVEffectParamDeclaration; }
void AVClasses::Create(AVEffectParamDeclarations* &ptr) { ptr = new AVEffectParamDeclarations; }
// ...

void AVClasses::Delete(AVSettings* &ptr) { delete ptr; ptr = NULL; }
void AVClasses::Delete(FXParameterList* &ptr) { delete ptr; ptr = NULL; }
void AVClasses::Delete(AVEffectParamDeclaration* &ptr) { delete ptr; ptr = NULL; }
void AVClasses::Delete(AVEffectParamDeclarations* &ptr) { delete ptr; ptr = NULL; }
// ...

std::string AVClasses::serialize(AVSettings* &ptr) { return ptr->serialize(); }
std::string AVClasses::serialize(FXParameterList* &ptr) { return ptr->serialize(); }
std::string AVClasses::serialize(AVEffectParamDeclaration* &ptr) { return ptr->serialize(); }
std::string AVClasses::serialize(AVEffectParamDeclarations* &ptr) { return ptr->serialize(); }
// ...

bool AVClasses::unserialize(AVSettings* &ptr, const std::string& s) { return ptr->unserialize(s); }
bool AVClasses::unserialize(FXParameterList* &ptr, const std::string& s) { return ptr->unserialize(s); }
bool AVClasses::unserialize(AVEffectParamDeclaration* &ptr, const std::string& s) { return ptr->unserialize(s); }
bool AVClasses::unserialize(AVEffectParamDeclarations* &ptr, const std::string& s) { return ptr->unserialize(s); }
// ...


I took a look at the filesize of the avclasses.o. It measures 579K! Half a meg!

But now let's see what happens if I use AVClasses inside the previous AVClip.h:


#include "avclip.h"
#include "avclasses.h"

AVClip::AVClip()
{
//ctor
AVClasses::Create(m_Effects);
AVClasses::Create(m_EndingTransition);
AVClasses::Create(m_Markers);
}

AVClip::~AVClip()
{
//dtor
AVClasses::Delete(m_Effects);
AVClasses::Delete(m_EndingTransition);
AVClasses::Delete(m_Markers);
}


And the results are the following:

Before:

avclip.o: 105356 bytes.

After:

avclip.o: 33136 bytes.

Needless to say, compilation is almost instant!

Yes, there is a drawback to all this. That extra call when creating the objects. But if we look at the AVClasses code, it's more like a jumptable, except for the serialization stuff, which would duplicate all the memory transfers. But that can be solved by simply modifying the code to pass a string by parameter instead of returning a string.

So we save a lot of duplicate code in the object files for a tiny runtime penalty. Is it worth it? WAY YES!!!

When I finish changing all the code in my files I'll tell you how much the compilation time disminished.

Update: After changing lots of files to se this approach, I noticed that the gain in compilation time is only marginal. I wonder if the change is really worth it.

Update (Sept 15, 2008): I finally realized that this is nonsense, it's not worth it, and just stupid. Also, I also made another realization: The current STL-based model is optimal in complexity. I may not like the STL bloat, but the fact is that it keeps programming SIMPLE. I don't need t ocomplicate myself with more pointers and implement lots of "exceptions-to-the-rule" routines.

So I reverted my SVN copy to the one before these changes and kept going. However, it was a good experiment to make, and this way I won't have to worry with yet another "what if I had..." thought. Now move along.

Thursday, September 11, 2008

Threads module FINISHED! (for real this time)

After moving all the private data to the CPP, cleaning up the API (that means it's stable now, unless I add new functions) and fixing a few things, I'm glad to announce that the Threads module is Finished!

It costed me many hours of sleep, many e-mail exchanges with linux kernel experts, but finally, it's done.

Now, back to the playback framework.

Wednesday, September 10, 2008

Con Kolivas answers: Yield() is bad. VERY bad.

Con Kolivas, the mind behind the new Linux scheduler, answered a simple question for me.

It's the same Sleep() vs. Yield() issue that has been debated for ages, but this time we've already discarded sleep, so the question was more Yield() vs. wait.

Question: What should I use to put a thread to sleep for a limited time if there's no work to be done? A semaphore, or a yield?

Answer:

Yield just drops the cpu for other processes and threads to go first. It does not specify who should go next, how many processes should use cpu next, nor does it say when to return. Let's say you yield and one nanosecond later the work is available. On a loaded system it's theoretically possible to yield for up to 10 seconds. That's 9.99* seconds too long that you've yielded. Alternatively, if the thread yields for .5 seconds and the work is available
at .6 seconds, the yielding thread comes back and there is no work available, that thread now wastes cpu uselessly, checking for work and just yielding again for however long.

With semaphores, you rely on the work actually being available before waking up the thread. There is no wasted cpu at all, and the thread is signalled very fast that it should now wake up.

It takes more effort to keep track of where you've put your semaphores, who's waiting on what and so on, but in the long run the results are potentially much better. How big the gain is depends on how parallelisable your workloads are. Nonetheless, wasting cpu doing nothing when there is a signal mechanism to only wake things when there's work available is much better.

Con


Thanks a lot, Con! Now onto rewriting that userspace mutex again... *sigh*

--
Update: I forgot to thank Joseph Seigh of atomic-ptr-plus, he was the one who suggested using a semaphore in the first place.

Monday, September 8, 2008

Mutexes revisited; Scheduling questions.

I finally completed implementing the sySafeMutex class.

My locking routine was something like this:

bool SafeMutex::Lock() {
bool success = false;
while(!success) {
success = TryLock();
if(success) { break; }
if(aborter && aborter->MustAbort()) return false;
Sleep(1); // Sleep for one millisecond
}
}

The problem is that if in that sleep time the other thread uses the mutex, then we fall in a problem called Resource starvation; The thread that begins to sleep never gets the mutex because other threads get the lock while the sleeping thread missed it. So instead of sleeping for a huge millisecond I only sleep for one scheduling cycle:

bool SafeMutex::Lock() {
bool success = false;
while(!success) {
success = TryLock();
if(success) { break; }
if(aborter && aborter->MustAbort()) return false;
Yield(); // Sleep as little as possible
}
}
But the problem of resource starvation still threatens us; So what can we do? I added a tight loop that tries to lock the mutex for the first 30 iterations. A tight loop when trying to lock a mutex is called a spinlock. Spinlocks are very CPU-inefficient (they consume too much CPU), but if they're combined with sleeping, they make things more balanced.

bool SafeMutex::Lock() {
bool success = false;
unsigned i = 0;
while(!success) {
success = TryLock();
if(success) { break; }
if(aborter && aborter->MustAbort()) return false;
if(++i>=30) {
i = 0; Yield();
}
}
}
Now I thought: Let's make the last thread that used the Mutex wait more the next time:

bool SafeMutex::Lock() {
bool success = false;
unsigned i = 0;
bool waspenalized = false;
while(!success) {
if(!waspenalized && m_LastThread == GetCurrentThreadId()) {
Yield();
waspenalized = true;
}
success = TryLock();
if(success) { m_LastThread = GetCurrentThreadId(); break; }
if(aborter && aborter->MustAbort()) return false;
if(++i>=30) {
i = 0; Yield();
}
}
}
By penalizing the last thread who got the mutex with an additional Yield(), we avoid the situation of one thread getting the lock all the time. Unfortunately, this adds an unnecessary yield() every single time we try to get the lock!

After I studied more, it turns out that mutexes, semaphores and all those synchronization primitives rely much more on thread scheduling than on simple tricks. For example, a Semaphore puts the first waiting thread next on the line for execution - this goes way beyond simple user implementations. But for our purposes, I think the proposed implementation (without the yield) will suffice.

Update: I finally figured it out. Instead of penalizing the last winning thread BEFORE trying to lock the mutex, penalize it AFTER trying to lock it. This way, the other threads will try to spinlock the mutex, while the last winning thread will be sleeping. Let's extend that to the TWO last winning threads, and this is what we get:



bool sySafeMutex::Lock(syAborter* aborter) {
bool result = false;
unsigned int i = 0;
unsigned long id = syThread::GetCurrentId();
for(;;) {
result = TryLock(aborter);
if(result) { break; }
if(aborter && aborter->MustAbort()) { break; }
// Scheduling strategy:
// After the first try, the last owner loses its chance of winning;
// After the 5th try, the previous-to-last owner loses its chance of winning;
// After the 50th try, all threads get the same chance of winning.
if((i >= 50) || (id==m_LastOwner) || (id==m_LastOwner2 && i >= 5)) {
syThread::Yield();
} else {
++i;
}
}
return result;
}

void sySafeMutex::Unlock() {
unsigned int id = syThread::GetCurrentId();
if(m_Owner == id) {
if(m_Recursive && m_LockCount > 1) {
--m_LockCount;
} else {
// Set m_LockCount to 0.
m_LockCount = 0;
m_LastOwner2 = m_LastOwner;
m_LastOwner = id;
m_Owner = 0xFFFFFFFF;
}
}
}

The RAII parttern; Sentry classes

I just found an amazing article on a seemingly unknown pattern in programming: RAII ( Resource Acquisition Is Initialization).

Basically, it is a class whose constructor initializes some resources, and whose destructor frees them. This prevents problems such as memory leaking or keeping a file locked when an exception occurs.

The concept can be generalized into "Sentry Classes". These classes perform an action on construction, and on destruction they clean up.

I had wondered what would happen in one of my classes if something had raised an exception (code has to be exception safe, did you know?). We already had prepared for locked mutexes on exceptions with the class wxMutexLocker (or in our case, syMutexLocker, sySafeMutexLocker, etc).

But what about flags? In VideoOutputDevice, there's a flag called m_Playing that tells us whenever we're sending a frame to the output buffer. What happens when (not if) an exception occurs during that?


void VideoOutputDevice::LoadVideoData(syBitmap* bitmap) {
bool result = true;

{
syMutexLocker mylocker(*m_mutex);
if(m_playing || MustAbort()) {
result = false;
} else {
m_playing = true;
}
}

if(result) {
LoadDeviceVideoData(bitmap);
// EXCEPTION OCCURS! The code is aborted and
// m_playing is never set to false! The whole class
// is rendered useless.
m_playing = false;
}
}

Eeew. Looks ugly, isn't it? Well. Pay attention to the new, improved version using Sentry classes:

void VideoOutputDevice::LoadVideoData(syBitmap* bitmap) {
sySafeMutexLocker lock(*m_Busy, this);
if(lock.IsLocked()) {
syBoolSetter setter(m_Playing, true);
LoadDeviceVideoData(bitmap);
}
}

What? That's it? Yes!! The sySafeMutexLocker tries to lock the safe mutex m_Busy. If successful, we set the m_Playing flag to true, and then proceed to Load the Data from bitmap.

Whether or not an exception occurs in LoadDeviceVideoData, the setter is destroyed, setting m_Playing to its old value (which is false), and on the function exit, the safe mutex m_Busy is unlocked.

Here's the code for syBoolSetter:
/** @brief Class that sets a flag to a specific value during its lifetime. */
class syBoolSetter {
public:
/** Constructor */
syBoolSetter(bool& flag,bool newvalue);

/** Destructor. Sets the flag to its original value. */
~syBoolSetter();
private:
bool& m_Flag;
bool m_Old;

};

syBoolSetter::syBoolSetter(bool& flag,bool newvalue) :
m_Flag(flag) {
m_Old = m_Flag;
m_Flag = newvalue;
}

syBoolSetter::~syBoolSetter() {
m_Flag = m_Old;
}

Simple, effective and elegant.

Sunday, September 7, 2008

New addition to threads module: Atomic Operations.

I realized that I could simplify a lot of code in the Saya Core by replacing some mutexes with atomic operations. I decided to encapsulate them in a class named syAtomic. This makes Saya now requiring not only GCC compatibility, but also a 486 or greater CPU (I added -march=486 to the compiler flags). Given the fact that most CPU's right now are Pentiums (and you'd certainly need a Pentium for video editing :P ), i'm sure there will be no problem with this.

Regarding the simplification, here's an example: the VideoInputDevice class has a flag called m_IsBusy, which it sets and unsets whenever it's doing an operation. I realized that for all intents and purposes, this was, in fact, a mutex (with an auxiliary mutex to lock access). I also realized that the only thing I did when waiting for m_IsBusy was to check if an abort signal was sent.

So I made my own class sySafeMutex (now possible thanks to the atomic operations), which, when a thread tries to lock it, instead of just waiting for the mutex to be unlocked, it periodically checks if an abort signal is sent (an syAborter* parameter is sent to the Lock() function). Additionally, I added another flag to optionally run the check in a tight loop for the first 3 milliseconds. This will avoid wasted CPU cycles if the only thing we want is to change a variable.

I'll also make more simplifications to the code of the a/v device classes. Stay tuned.

Saturday, September 6, 2008

Threads module finished!!

Whew. I thought I would never finish this behemot. It's still not tested and some things MIGHT not compile on Windows, but that can be fixed with a tiny bit of code revision.

Later I'll decide whether to add support for atomic functions or not.
If anyone's an expert in multithreading and OS design, please help me debug this thing.

Decided to rely on GCC features.

At first I wanted the project to be compilable by any compiler, but I have decided to stick with GCC (which is cross-platform, anyway), for two reasons:

1) GCC provides 64-bit integers for C++, which I need for the timeline.
2) It provides atomic operations, which are VERY useful for efficient multithreading programming.

I won't say more, since it's kinda late and I need sleep. Good night.

Wednesday, September 3, 2008

Lock-free algorithms part 2: CAS

Hello again. I'll finish this lock-free programming with three links:

The first link is the Compare-and-swap atomic operation. It's atomic because most processors already have an instruction for it. Nothing can be more atomic than one cpu-instruction :)

http://en.wikipedia.org/wiki/Compare-and-swap

In computer science, the compare-and-swap CPU instruction ("CAS") (or the Compare & Exchange - CMPXCHG instruction in the x86 and Itanium architectures) is a special instruction that atomically compares the contents of a memory location to a given value and, if they are the same, modifies the contents of that memory location to a given new value. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it).

CAS is used to implement synchronization primitives like semaphores and mutexes, as well as more sophisticated lock-free and wait-free algorithms. Maurice Herlihy (1991) proved that CAS can implement more of these algorithms than atomic read, write, and fetch-and-add, and that, assuming a fairly large amount of memory, it can implement all of them [1].

The second link is an article from Dr. Dobb's journal implementing a the aforementioned Hazard Pointers.

http://www.ddj.com/cpp/184401890

And finally, an algorithm from AT&T research to implement STL-compatible (well, more or less) vectors using Hazard Pointers:

http://www.research.att.com/~bs/lock-free-vector.pdf

Enjoy.

Lock-free algorithms and hazard pointers

And here I thought mutexes, critical sections, conditions and semaphores were all that would have to be learned about multithreaded programming. Turns out I was wrong.

It all began with a guy on irc.freenode.net (his nick was dshs) asking how to do inter-thread synchronization for a multimedia project (I invited him to join the team, but he politely declined). He said something about "lock free queues". So I wikipediaed it and here's what I read:

In contrast to algorithms that protect access to shared data with locks, lock-free and wait-free algorithms are specially designed to allow multiple threads to read and write shared data concurrently without corrupting it. "Lock-free" refers to the fact that a thread cannot lock up: every step it takes brings progress to the system. This means that no synchronization primitives such as mutexes or semaphores can be involved, as a lock-holding thread can prevent global progress if it is switched out. "Wait-free" refers to the fact that a thread can complete any operation in a finite number of steps, regardless of the actions of other threads. All wait-free algorithms are lock-free, but the reverse is not necessarily true. An intuitive way of understanding the difference between wait- and lock-free algorithms is that a thread executing an operation of a lock-free algorithm may not be impeded if another thread's execution is prematurely halted, whereas if the algorithm was wait-free, the first thread may not be impeded even if the second thread is aggressively interfering with the shared state.

Lock-free algorithms are one kind of non-blocking synchronization.

As I kept reading, I learned that algorithms for lock-free stacks, queues and even vectors are already out there. It means that you can read and write all the stuff you want in a multithreaded environment, without having to use a single semaphore, mutex or condition. Wow.

And that wasn't all. I just learned that a NEW algorithm called "Hazard pointers" was created for non-blocking synchronization, in 2004.

So here's the description of a Hazard Pointer, right from Wikipedia:

In a multithreaded computing environment, a hazard pointer is an element used by a methodology that allows the memory allocated to the nodes of lock-free dynamic shared objects to be reclaimed. Using the methodology, each thread keeps a list of hazard pointers indicating which nodes the thread may later access. This list can only be written to by the particular thread but can be read by any thread. When a thread wishes to remove a node, it places it on a private list and periodically scans the lists of all other threads for pointers referencing that node. If no such pointers are found the memory occupied by the node can be safely freed.

This was JUST four years ago! It's a revolution that could be comparable to fast genetic sequencing, and I'm only learning about it right now! Where have I been these years? Oh, right. Working.

If you want to know more about Hazard pointers, you can read the Hazard Pointers Full Article (PDF), courtesy of IBM Research.

I don't know if this will be of use later for Saya, where we'll have to quickly deal with low-latency stuff. Probably I may need it for later.