This article discusses synchronization mechanisms under DirectShow (part of the
DirectX Media SDK) in order to play a video stream synched with the audio stream.
The article provides custom solutions for that purpose, using the Microsoft
Foundation Classes framework.
Playing a video stream
on a standard PC computer is not something easy & natural. If the video
stream is encoded with a frame rate of say 25 images a second, one can not say
that a given PC is able to play the stream on screen really at this rate. Whether
we are talking about an old PC, a Pentium based, or recent PCs, having in hand
more horse power means moving the problem, not solving it. Indeed, a PIII based
PC is very likely to be able to play the stream at such rate, but whether you're
doing something else at the same time, say compile a source code, processing
something in the background, and so on, the computer will drastically lose performance.
Core mechanisms in DirectShow are designed to undertake this issue. The more
politically correct solution, and by the way working solution, is to try to
play the video stream as fast as possible, being synch with the audio stream,
and drop any image from the video stream whenever it loses synchronization with
the audio stream.
In this article,
we are to talk with different point of views about synchronization mechanisms
and by the way provide custom solutions, including or not multithreading.
Solution #1: using CWinApp::OnIdle from the MFC framework.
Solution #2: using asynchronous notifications through a custom
video renderer filter
Solution #3: using asynchronous streaming through multithreading
Why video loses synchronization with audio
It's well known that
audio decoding needs less CPU time than video decoding. It's because for a specific
time duration, there is much more data to decode in the video stream than in
the audio stream. Thus, whenever there is a lack of synchronization between
the video and the audio stream, it's very likely the video stream is hard fighting
trying to catch up the lost audio stream timestamp.
For this reason,
if the computer or software had to choose the reference stream out of the two,
then it would be the audio stream. Good reasons for that are:
- The audio stream can't be delayed without producing undesirable awful sounds
- The video stream takes much more bandwidth than the audio stream
The GraphEditor tool
from the SDK exemplifies this quite well. Let's launch it, insert a source filter
to a video file, then Render its output pin. You can see that once the filter
graph construction is finished, the audio renderer has a yellow circle in it,
saying that it features the reference clock for the whole stream. Doing this,
when you play the video file, the video stream will do its best to stay near
the audio stream, so that only unheard small delays will occur, with the immediate
effect of dropped image frames.
You may prefer not
use any reference clock, thus the 2 streams will run their own way. Let's try
it: go the Graph menu, uncheck "Use Clock", then run the video file. Both streams
will go as fast as possible, without trying at all to synchronize.
The audio reference clock helps to synchronize the streams
In a piece of code
using multimedia streaming without reference clock customization, the audio
renderer features the reference clock.
with a set of interfaces that help play a video/audio stream by only calling
a given method, Update() as often as possible, until the streams are completed.
These are the multimedia streaming interfaces. Check out the documentation at
the following node: DirectShow / MultimediaStreaming / MultimediaStreaming reference.
can be regarded as interfaces that are on top of the filter graph interfaces.
It's not true, but it helps understanding that DirectShow doesn't require you
to use one set of interfaces and nothing else. The multimedia streaming interfaces
help playing a stream, just saying OpenFile, Run, then call Update as long as
the stream is not completed, then Stop, and good bye. It can also be used for
streams output to the disk. It uses a different set of functions and metaphore
than the filter graph, and is suitable for progressive rendering under your
MultimediaStreaming is a part of DirectShow which simplifies playing/encoding
Information about dropped frames
Let me recall first
that trying to sync is a internal job. You can't customize it without writing
yourself one or more filters, in which some clock will either synchronize with
some other clocks or authentify themself as the reference clock.
The property page from the "Sample video renderer" filter does give information
To get statistics
about how many frames are dropped by second, you may use another more verbose
video renderer filter for example. Say the SampVid.ax video renderer filter
is ready in the /bin directory of the DirectShow SDK. Let's launch the GraphEditor,
then insert the filter called "Sample video renderer" in the list of DirectShow
filters, then insert a source filter, and render its output pin. The filter
graph manager will construct a filter graph using the video renderer filter
you have inserted. It features a property page giving lotsa information about
what we are dealing with. Let's by the way say that the full source code for
this filter is available in the samples directory.
For example the renderer
frame rate can be compared with the frame rate encoded in the stream itself,
which can be retrieved with code like this:
if (m_pDDStream != NULL)
return ((REFTIME)result) / 10000000;
the frame rate is thus 1 / getTimePerFrame(). getTimePerFrame()
often returns 0.040s (40 milliseconds), which is 25 frames a second.
I can live without audio too
For a given set of
video-based applications, there is no particular need to render the audio stream.
In this case, synchronization is not needed anymore. For example, an application
that automatically cuts the video stream into sequences of short shots, basing
the cut algorithm solely on image details, doesn't need audio at all.
For this purpose,
let's first of all take the following piece of code (without error handlers):
CComPtr <IAMMultiMediaStream> pAMStream;
CoCreateInstance( CLSID_AMMultiMediaStream, NULL, CLSCTX_INPROC_SERVER,
IID_IAMMultiMediaStream, (void **)&pAMStream);
pAMStream->Initialize(STREAMTYPE_READ, 0, NULL);
pAMStream->AddMediaStream(m_pDD3, &MSPID_PrimaryVideo, 0, NULL);
pAMStream->AddMediaStream(NULL, &MSPID_PrimaryAudio, AMMSF_ADDDEFAULTRENDERER, NULL);
This opens a standard
video/audio stream. Now let's replace the four last lines of code with:
Now there is no audio
anymore, and no sync. Video frames won't be dropped. This is a suitable piece
of code for critical applications requiring to render all frames in the video.
Please also notice that because there is no reference clock, the frame rate
won't be calculated by the renderer, thus the piece of code given above doesn't
work in this context. You will have to retrieve the timestamps for the given
media sample, and for example calculate the midpoint from the begin and end
times. These, as a series of data, help in order to calculate an average frame
What about the video size & color depth
I didn't underline
much at this point the fact that the video is rendered either in a window or
fullscreen, that is any case covers a given surface of the desktop window. And
the CPU time needed to blit the current video frame, which is in the initial
core video size, to the final size, is proportional to both the extent in X
and Y and also the color depth. For this reason, it's advised to use hardware
acceleration as much as possible. The set of multimedia streaming interfaces
let you create a video stream with a custom DirectDraw surface at the end, thus
it's up to you to do what you want with it. That standard thing is to stretch
blit the surface so that it fills the client rect of a given working video window.
Or else, in fullscreen mode. Check out the samples.
applications simply render the video to its initial size. This is a preferred
solution of course to avoid synchronization problems with the audio, but tell
me why the hell all my windows can be stretched while this damn expensive third-party
video application I am using doesn't allow stretching? All of you third
party developers, what about Windows compliancy?
A note about color
depth: as DirectDraw doesn't allow color mapping on the fly, you will have
to either insert an additional Color space converter filter, or even
worse, use the standard Blit/StretchBlit methods from the GDI.
Samples in the DirectShow SDK
Dealing with video/audio
synchronization, the DirectShow SDK comes with three main samples worth looking.
The first is in the
sample directory. It's called ShowStrm. It plays a video/audio stream
using the set of multimedia streaming interfaces. It's small and self explanatory,
but it also lacks some windowing stuff.
The second, and I
will base all explanations on it, is in plain text in the documentation itself.
It's called MovieWin. Let's check it out at the following node:
DirectShow / Application developer's guide / How to / Play a movie in a window using
DirectDrawEx and multimedia streaming / Entire MovieWin example code.
MovieWin is not in the sample directory.
The other sample
is VidClip from the sample directory. It encodes a video/audio stream
to an AVI video/audio stream using the multimedia streaming interfaces for both
reading and writing at the same time. Good starting point for an encoder under
As MovieWin is written
in raw WIN32, I provide a MFC version of this sample, thus we'll be able without
much work to use one of the simplified OnIdle mechanisms that the MFC framework
provides. The source code for it is available in this web site.
In the following,
I try to exhibit several custom solutions in order to manage synchronized video/audio
- The first of them is basic. It uses the standard OnIdle mechanism from the
CWinApp MFC class.
- The second uses a custom renderer filter notification mechanism.
- The third uses multithreading
All solutions are
working solutions. The fact that I wanted to show both is that while the first
is straight forward, it has a potential drawback. The drawback is that the CPU
keeps calling RenderToSurface() inside the OnIdle() method, resulting in full
CPU charge all the time the application is running. Weird that. The two other
solutions don't have this drawback, but are significantly more complex to run.
I use the set of
multimedia streaming interfaces to construct the filter graph, and run the video/audio
stream. This is standard stuff.
I don't use the simple
FilterGraph metaphore (such like in the MFCPlay sample from the SDK), because
it doesn't give the control you may need on the rendered surface. For example,
MFCPlay renders a multimedia stream to a default ActiveMovie window. What if
you want to blit the frame buffer somewhere else ? In fact, there is not so
much to say about rendering multimedia streams through simple FilterGraph Play()
because all the synchronization is handled at a hidden level, and thus speaks
for itself for demo time, but is also unsuitable for a real video application.
We know that, relying
on a hidden controlling measure, not all frames will be shown. Some will be
dropped down depending on the capacity of the CPU to handle video/audio streaming,
decoding and rendering.
But what's interesting
to check now is what is the process that will actually tell the video engine
to render each new available frame buffer (the image media sample) ready
for the final blit on screen.
In fact, we have
nothing else to do than render each new frame buffer, then wait for the new
frame buffer to come. It looks like we are going to call the Update() method
in such a way that this method doesn't return until the frame buffer is ready
for final blit. At this point, it should be noted that we have the option to
call Update() with an ASYNC flag so that the method immediately returns, and
it's up to us to know when the new frame buffer is ready. More on that later
in this article.
Let's get back. It's up to our video
engine (CMovieView class) to call the Update() method:
if (pMediaSample->Update(0, NULL, NULL, 0) != S_OK)
bAppactive = FALSE;
Now what class/component will call
the RenderToSurface() method and how often?
And, while we are
at it, why don't we loop the RenderToSurface() call in the video engine itself
until it's finished? That's simple: who then processes the windows messages
for this application. Usually a CView processes a message then gives back control
to the core Afx stuff, which more or less calls the Run/OnIdle mechanism from
the CWinApp entry-point class (inherited from CWinThread).
Now for the main loop:
BOOL CMovieApp::OnIdle( LONG lCount )
STREAM_TIME total = m_viewChild->getDuration();
STREAM_TIME current = m_viewChild->getCurrentPosition();
long pos = long( current * MAXFRAME / total );
m_viewChild is a pointer to the CMovieView class (video engine).
The OnIdle() method
is called within the main Run() method from CWinApp. It's called as often as
there are no more windows messages to process and dispatch to the child windows,
On one hand, this
implementation is straight forward, we don't need a timer to signal us we should
update the buffer onscreen, but on the other hand, the CPU is always working,
and thus has no real Idle time. If you launch the performance monitor, then
you will immediately notice this.
the source code for solution #1 : Rendering mechanism using OnIdle.
Here we get a bit
deep. For full details about building
video renderer filter
, check one of my other articles. It features
and explains everything about that.
Though the OnIdle()
technique is really well working, it's not exactly suitable for an application
which does other processing in the same time, ie needs CPU time to process something
else, for example applying a real-time filter in a part of the current frame
The idea is to get
into the asynchronous metaphore. We tell the DirectShow engine to render the
new frame buffer, and while we are it, we register as people receiving a notification
when the new frame buffer is ready (or the stream is complete or whatever error).
Doing so, there is no synchronous Update() call at all. Once the stream is run,
it notifies us of any new frame buffer, and we can even ignore them if we want.
The stream is however controlled with methods such like Pause()/Stop()/Run()/Seek(nFrame).
That's the filter graph metaphore.
Well in fact, we
are using the filter graph manager. The custom thing is a renderer filter which
is able to use a notification sink interface and to communicate with the filter
graph component to tell a new frame buffer is ready. And, as the filter graph
is at the application level, we register as listeners of any new messages that
may be sent to, and proceed as we want. That's the magics behind. Now for the
Below is the filter graph construction
mechanism. We use the standard filter graph manager method call. That is, we
create an empty filter graph, then we add our renderer filter by hand (we know
the CLSID), then we add a source filter and call Render on its output pin, exactly
as we would do with the GraphEditor.
HRESULT CMovieView::RenderFile(LPCTSTR szFilename)
if ( !CreateFilterGraph() )
MultiByteToWideChar( CP_ACP, 0, szFilename,
-1, wPath, MAX_PATH );
if (FAILED( m_pGraph->RenderFile(wPath, NULL) ))
hr = CoCreateInstance(CLSID_ARST_DirectDrawVIDEORenderer,
IID_IBaseFilter, (LPVOID *)&pFilter);
hr = m_pGraph->AddFilter(pFilter, L"ARST DirectDraw Video Renderer");
We then create a
listener for all notification messages sent to the filter graph. As you can
see, we also register the EC_REPAINT message as a specific message that we will
handle (normally handled by the filter graph). What follows is a piece of code
which can be embedded in any init() routine:
hr = m_pGraph->QueryInterface(IID_IMediaEvent, (void **) &pME);
hr = pME->GetEventHandle((OAEVENT*) &m_hGraphNotifyEvent);
The EC_REPAINT message is sent as
often as a new frame buffer is ready for final rendering to screen. What we
do in the main active loop is check for any message received, and process as
The GetGraphEventHandle() method
retrieves the handle to the listener we have just created.
if (m_pMainWnd == NULL && AfxOleGetUserCtrl())
TRACE0("Warning: m_pMainWnd is NULL in CMovieApp::Run - quitting application.\n");
BOOL bIdle = TRUE;
LONG lIdleCount = 0;
const int cObjects = 1;
if( (ahObjects[ 0 ] = GetView()->GetGraphEventHandle()) == NULL )
while ( bIdle && !::PeekMessage(&m_msgCur, NULL, NULL, NULL, PM_NOREMOVE))
bIdle = FALSE;
result = MsgWaitForMultipleObjects( cObjects, ahObjects,
FALSE, (bIdle ? 0 : INFINITE),
if( result != (WAIT_OBJECT_0 + cObjects) )
if( result == WAIT_OBJECT_0 ) GetView()->OnGraphNotify();
else if( result == WAIT_TIMEOUT )
bIdle = FALSE;
bIdle = TRUE;
lIdleCount = 0;
} while (::PeekMessage(&m_msgCur, NULL, NULL, NULL, PM_NOREMOVE));
of the method is such that it sleeps and wakes up when the EC_REPAINT (or any
other event notification code) has arrived, then it processes it, and then it
processes all windows messages, before calling OnIdle and looping again. All
in all, the CPU performs an active sleep. When your application is paused, the
CPU % is almost 0, and when it's streaming, it's only taking the process time
for the final rendering on screen. really nothing to do with the previous solution.
Please note we override
the CWinApp::Run() method, not OnIdle() which is left with its default implementation,
ie the CPU can turn to idle state properly speaking.
Each time we get
a new message, not necessarily EC_REPAINT, it can be EC_COMPLETE (see the following
node in the official documentation for a full list of event codes: DirectShow
/ C++ reference / Event Notification codes), we then call back the OnGraphNotify()
method implemented by our video engine.
For any EC_REPAINT message received,
we get the parameters passed along with the message itself, and process the
final rendering on screen. To be short, because it's explained in another article
on this web site, the other fundamental parameter passed is the pointer to a
core DirectDraw surface fully handled by our custom renderer filter. Thus this
message does give us all the needed data for the final blit.
The OnGraphNotify() implementation
must also process other event codes, such like EC_COMPLETE, otherwise the program
may break down at the end of the stream for example.
long lEventCode, lParam1, lParam2;
static long count=0;
ASSERT( m_hGraphNotifyEvent != NULL );
ASSERT( m_pGraph != NULL);
if( SUCCEEDED(m_pGraph->QueryInterface(IID_IMediaEvent, (void **) &pME)))
if( SUCCEEDED(pME->GetEvent(&lEventCode, &lParam1, &lParam2, 0)))
HRESULT hrTmp = pME->FreeEventParams(lEventCode, lParam1, lParam2);
m_pDDSOffscreen2 = (IDirectDrawSurface*)lParam1;
Now what about synchronization?
It's working and it's hidden!!! The cascaded filters manage the video &
audio streams. The audio renderer has a reference clock, which is default behaviour
whenever you use the filter graph manager, thus any frame that can't be processed
by the filters is dropped away, and the frames we get back are frames that are
synchronized with the audio stream, thus candidate for consistent immediate rendering
on screen while the sister audio data is being heard.
This time we get back
to the multimedia streaming interfaces. We didn't use yet all the power behind
the Update() method from the IDirectDrawStreamSample interface. Indeed, when passing
no specific flag, the Update() method call returns as soon, but not before, the
next frame buffer is ready for final rendering, but Update has more to offer.
We may use the SSUPDATE_ASYNC flag and pass along either a WIN32 event to signal
when the next frame buffer is ready, or even pass a pointer to a callback function
which is automatically called at this time.
we suggest uses the event model from WIN32. At init time, we create an apartment
thread (ie a thread with a windows message loop (that helps in serializing method
calls)), then create an instance of a trigger event, and pass along to the thread
the handle of this event.
Then what we do in
this thread is trivial: we wait for a signal, and then we send a message to
the video engine to notify that the frame buffer should be rendered now. Otherwise,
The following code shows the code for the auxiliary thread:
BOOL StartThread(CMovieView *pView,DWORD WaitTime = 500);
static DWORD WINAPI RealThreadProc(void* pv);
BOOL WaitWithMessageLoop(HANDLE hEvent);
m_ThreadId = 0;
m_hThread = NULL;
m_hComponentReadyEvent = NULL;
m_WaitTime = 500;
BOOL CStreamUpdater::StartThread(CMovieView *pView,DWORD WaitTime)
m_cpView = pView;
m_hThread = ::CreateThread(NULL,
if (m_hThread == NULL)
m_hComponentReadyEvent = ::CreateEvent(NULL, FALSE, FALSE, NULL);
if (m_hComponentReadyEvent == NULL)
m_WaitTime = WaitTime;
DWORD r = ResumeThread(m_hThread);
assert(r != 0xffffffff);
m_bShouldStopNow = TRUE;
return (m_hThread != NULL);
DWORD WINAPI CStreamUpdater::RealThreadProc(void* pv)
CStreamUpdater* pApartment = reinterpret_cast<CStreamUpdater*>(pv);
BOOL bContinue = TRUE;
while (bContinue )
switch( ::WaitForSingleObject( hUpdater,m_WaitTime) )
BOOL CStreamUpdater::WaitWithMessageLoop(HANDLE hEvent)
DWORD dwReturn = ::MsgWaitForMultipleObjects(1,
if (dwReturn == WAIT_OBJECT_0)
else if (dwReturn == WAIT_OBJECT_0 + 1)
while(::PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
Please note that
we don't need either CWinApp::OnIdle() nor CWinApp::Run() custom implementation
at all. For this reason, the CPU won't at all be loaded up to 100% as in solution
Whenever the auxiliary
thread gets the signal, it doesn't call RenderToSurface(), a method to perform
the final rendering on screen, because if it did, it would do it with this particular
execution unit (this thread), though the rendering step should always be performed
by the video engine itself. Thus, a message is sent to the window wrapping the
video engine, and the rendering step is actually done when the message is processed
by the message loop within CWinApp::Run() implementation. Because a lot of other
messages may occur, including WM_MOUSEMOVE, or so, the message WM_USER+100 will
probably not be processed immediately. It's a good idea that there is no message
taking a big process time, because there would be otherwise a delay between
the moment the frame buffer is ready (and the audio sound heard) and the actual
image be seen on screen.
Now, what's left
to see is the method called when the WM_USER+100 message is being processed:
ON_MESSAGE( WM_USER+100, OnRefresh )
afx_msg void CMovieView::OnRefresh(UINT wparam, LONG lparam)
TRACE ("Refresh msg received\n");
void CMovieView::Update(BOOL bForceUpdate)
if (!m_bFileLoaded) return;
if ( bForceUpdate || IsPlaying() )
pMediaSample->Update(SSUPDATE_ASYNC, m_hUpdateEvent, NULL, 0);
Note also that when
the multimedia stream is run, you should call once the previous method so that
the signal is ignited.
there is another option to the Update() call. Using the SSUPDATE_ASYNC flag,
we could as well or instead ofpass no event handle at all, but try iteratively
a call to CompletionStatus() which is a method implemented at the same level
than Update(). This method helps in looping and waiting for a new frame buffer
A possible implementation can be:
pMediaSample->Update(SSUPDATE_ASYNC, NULL, NULL, 0);
hr = pMediaSample->CompletionStatus(COMPSTAT_WAIT,10);
In other words, the
return value of CompletionStatus is MS_S_PENDING, a frame buffer is pending,
until it turns to either S_OK or any error. The method returns when it's finished
or at the end of the next 10 milliseconds. This rate works for most cases because
a lot of video streams are encoded at 15 frames by second or 25 frames by second,
which is 40 milliseconds between two frames.
source code for solution #3: using a separate thread for asynchronous rendering
This article has
presented a couple of synchronization techniques in order to render multimedia
streams under DirectShow. The source code has no copyright.
Stephane Rodriguez -
September 20, 1999