Subtitle Synchronization with C#






4.67/5 (11 votes)
Demonstrates regular expression use for subtitles synchronization
Introduction
This article shows an example of text file manipulation techniques and regular expression use for subtitle synchronization.
Background
Yesterday I watched "P.S. I love you", and I love it (no pun intended). I downloaded it and came across a subtitle file with correct translation, but it was out-of-sync. So I started to write a small program to fix that.
In my research, I found nice programs to do that, but it was early Sunday and I had nothing better to do. You know how it ends.
Using the Code
This is a console application which receives a filename. I expect that to be a regular .SRT (SubRip) file, so it's a text file with the following structure:
203
00:16:38,731 --> 00:16:41,325
<i>Happy Christmas, your arse
I pray God it's our last</i>
So we have a sequential number, start and end time for display and the text; those blocks are separated by an empty line. In my case, I just wanted to add a time offset.
I start creating a regular expression for that pattern:
private static Regex unit = new Regex(
@"(?<sequence>\d+)\r\n(?<start>\d{2}\:\d{2}\:\d{2},\d{3}) --\> " +
@"(?<end>\d{2}\:\d{2}\:\d{2},\d{3})\r\n(?<text>[\s\S]*?\r\n\r\n)",
RegexOptions.Compiled | RegexOptions.ECMAScript);
I used named matches (?<name>)
to identify the relevant parts from text units, so we can read:
- the text unit sequence, as one or more numbers, and a line break
- start and end time, formatted as
HH:mm:ss,fff
, and a line break - legend text and the end of that block (two consecutive line breaks)
Believe me, that's the "hardest" part. Let's ask for time offset:
double offset = 0;
Console.Write("offset, in seconds (+1.1, -2.75): ");
while (!Double.TryParse(Console.ReadLine(), out offset))
{
Console.WriteLine("Invalid value, try again");
}
Note the Double.TryParse
. As you probably can imagine, it's trying to parse a string
into a double
value, but don't throw exceptions. It's very useful when you ask for input values and can speed up the code execution.
Now we just need to read one file and write another one.
using (StreamReader input = new StreamReader(args[0], Encoding.Default))
{
using (StreamWriter output =
new StreamWriter(args[0] + ".srt", false, Encoding.Default))
{
output.Write(
unit.Replace(input.ReadToEnd(), delegate(Match m)
{
return m.Value.Replace(
String.Format("{0}\r\n{1} --> {2}\r\n",
m.Groups["sequence"].Value,
m.Groups["start" ].Value,
m.Groups["end" ].Value),
String.Format(
"{0}\r\n{1:HH\\:mm\\:ss\\,fff} --> " +
"{2:HH\\:mm\\:ss\\,fff}\r\n",
sequence++,
DateTime.Parse(m.Groups["start"].Value.Replace(",","."))
.AddSeconds(offset),
DateTime.Parse(m.Groups["end" ].Value.Replace(",","."))
.AddSeconds(offset)));
}));
}
}
So, read an entire input file into memory and replace, one unit at time, the original time by new ones, adding the offset. Write everything in the output file and you are good to go.
Points of Interest
The most interesting part was the time offset formatting. Seems .SRT file format uses a comma as milliseconds separator, so I spent another replace inside the MatchEvaluator
delegate to fix that.
Another point was the file encoding. The trick was to define the encoding as default and make sure it was the same output file encoding, so the accents could be correctly preserved.
Besides that, this program ran correctly in the first use.
History
- 1.0: Initial version
- 1.1: Fixed some typos and added some external links