Click here to Skip to main content
15,881,882 members
Please Sign up or sign in to vote.
2.60/5 (2 votes)
See more:
Hey guys, help!!!

Recently I have done a genome annotation job for a bacteria genome and now one of the annotation result is NCBI blastp output log text file, it is in size of 100GB or maybe larger in the later time, i have wrote a class for parsing the blastp output file before, but the text file size I deal with with this class before this time just in size below 2GB.

Now facing with the text file in size more that 100GB, the .NET Regular Expression and StringBuilder class object can not handle such a huge size string any more, it throw me an OutOfMemory Exception on our 1TB memory Linux server!

Who can help me....... I know this is very crazy..

Here is the loading code, base on the exception information, the program stuck at here:

VB
Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

Dim SourceText As String = IO.File.ReadAllText(LogFile) 
Dim Sections As String() = (From matche As Match
                            In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                            Select matche.Value).ToArray



The blast output file is consist of many many section like this, i think it is easy to understand using a regular expression to parsing it and the regular expression makes the code clean:

blablablabla.........

Query= XC_0118 transcriptional regulator

Length=1113
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  lcl5167|ana:all4503 two-component response regulator; K07657 tw...  57.4    2e-009
  lcl2658|cyp:PCC8801_3460 winged helix family two component tran...  56.6    4e-009
  lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ...  55.1    7e-009
  lcl9057|kol:Kole_0706 two component transcriptional regulator, ...  55.1    1e-008
  lcl9114|ter:Tery_2902 two component transcriptional regulator; ...  55.1    1e-008
  lcl9051|trq:TRQ2_0821 two component transcriptional regulator (...  54.7    1e-008
  lcl9023|tpt:Tpet_0798 two component transcriptional regulator (...  54.7    1e-008
  lcl8929|tma:TM0126 response regulator; K02483 two-component sys...  54.3    1e-008
  lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The...  52.4    6e-008


blablablabla.........


> lcl5167|ana:all4503 two-component response regulator; K07657 
two-component system, OmpR family, phosphate regulon response 
regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324]
Length=253

 Score = 57.4 bits (137),  Expect = 2e-009, Method: Compositional matrix adjust.
 Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%)

Query  3    LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE  62
            LR +R+  L  +P  +   + + P+   ++  G+ + L P+   +L      A  V S E
Sbjct  144  LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE  203

Query  63   QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL  110
            QLL  VW   F GD+  V   I  LR KL  D   P +I T+R  GYR 
Sbjct  204  QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF  252


blablablabla.........


Lambda      K        H        a         alpha
   0.321    0.133    0.395    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6 

Effective search space used: 1655396995

blablablabla.........
Posted
Updated 13-Nov-14 21:41pm
v4
Comments
Tomas Takac 13-Nov-14 15:00pm    
Interesting, how do you parse the input? Can you post some code?
Mr. xieguigang 谢桂纲 13-Nov-14 15:13pm    
The text file was read using method IO.File.ReadAllText
and split using Regex.Matches method.

Tomas Takac 13-Nov-14 15:20pm    
Regex, at lease the .NET implementation won't help you here. Are there any line breaks in your input? So you can parse it line by line, assuming the matched string doesn't span multiple lines. Or you can write your own parser, using ANTLR for example.
BillWoodruff 13-Nov-14 15:04pm    
Well, you're going to have to read a "chunk" at a time. However, if you have a 1 terabyte server: why not run a program on the server to serve-up a chunk whose size you can deal with as you request chunks ?
Mr. xieguigang 谢桂纲 13-Nov-14 15:15pm    
the blastp output file was consist with result sections and the sections is in various length, it is difficult to decided the length of the chunk, so i think the best method is using regular expression....

From your Regex pattern
C#
const string BLAST_QUERY_HIT_SECTION = "Query=.+?Effective search space used: \d+";
(Yes I converted to C#)
I'd guess that your file format is something like:
possibly some stuff in the line ahead of Query=something query identifier-ish
lots (multiple? lines) of query result stuff ...
possibly some stuff in the line ahead of Effective search space used: some digits

So why not write a fairly simple loop that looks at the Blast log file a line at a time and assembles the query results as the pieces?
C#
public static IEnumerable<string> BlastQueryHits(string logFile)
{
  bool inHit = false;
  StringBuilder sb = new StringBuilder();
  foreach (string line in File.ReadLines(logFile))
  {
    if (inHit)
    {
      sb.Append(line);
      if (line.StartsWith("Effective search space used:"))  // from discussion comment below, this is more efficient
      {
        yield return sb.ToString();
        inHit = false;
      }
    }
    else if (line.StartsWith("Query="))  // from discussion comment below, this is more efficient
    {
      sb.Clear().Append(line);
      inHit = true;
    }
  }
  if (inHit)
     yield return sb.ToString();   // just in case there's any "leftovers"
}

This will return each of the query hits as a string.
They will be assembled as needed, so there's no need to have everything in memory at the same time.
Work with them one at a time!!!

Caveat: This code is "off the top of my head"... I didn't compile or execute it. It might have an "issue" or two, but should be pretty close! ;-)

Edit: revised based on comments below.

The way this could be used in a parallel processing scenario would be with something like:
C#
string logFile = "path to blast log file";
Parallel.ForEach(BlastQueryHits(logFile), DoSomethingWithOneQuery);

private void DoSomethingWithOneQuery(string queryHit)
{
  // do the per-query-hit processing
}

This will still produce the query hits only as needed. So if it parallelizes over 3 cores, it will produce the first 3 right away and then the next ones, as the previous ones are completed. But it still will not produce them all at once, leaving them sitting around in memory until the processing gets to them. This should have a substantially lower memory footprint.
The BlastQueryHits() method is equivalent to your Regex.Match, but it is much more efficient and will work with an arbitrarily large file (as long as each chunk is less than the 2GB limit for strings).
 
Share this answer
 
v5
Comments
Mr. xieguigang 谢桂纲 14-Nov-14 3:57am    
here is the example of the section: each section start from "Query=" and end with Effective search space used:

blablablabla.........

Query= XC_0118 transcriptional regulator

Length=1113
Score E
Sequences producing significant alignments: (Bits) Value

lcl5167|ana:all4503 two-component response regulator; K07657 tw... 57.4 2e-009
lcl2658|cyp:PCC8801_3460 winged helix family two component tran... 56.6 4e-009
lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ... 55.1 7e-009
lcl9057|kol:Kole_0706 two component transcriptional regulator, ... 55.1 1e-008
lcl9114|ter:Tery_2902 two component transcriptional regulator; ... 55.1 1e-008
lcl9051|trq:TRQ2_0821 two component transcriptional regulator (... 54.7 1e-008
lcl9023|tpt:Tpet_0798 two component transcriptional regulator (... 54.7 1e-008
lcl8929|tma:TM0126 response regulator; K02483 two-component sys... 54.3 1e-008
lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The... 52.4 6e-008


blablablabla.........


> lcl5167|ana:all4503 two-component response regulator; K07657
two-component system, OmpR family, phosphate regulon response
regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324]
Length=253

Score = 57.4 bits (137), Expect = 2e-009, Method: Compositional matrix adjust.
Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%)

Query 3 LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE 62
LR +R+ L +P + + + P+ ++ G+ + L P+ +L A V S E
Sbjct 144 LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE 203

Query 63 QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL 110
QLL VW F GD+ V I LR KL D P +I T+R GYR
Sbjct 204 QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF 252


blablablabla.........


Lambda K H a alpha
0.321 0.133 0.395 0.792 4.96

Gapped
Lambda K H a alpha sigma
0.267 0.0410 0.140 1.90 42.6 43.6

Effective search space used: 1655396995

blablablabla.........


and yes, this may be a solution, but it can not be parallel, and if we using a for each loop, then the program only utilize 1 CPU core, dealing with the 100GB text file, is impossible.....
Matt T Heffron 14-Nov-14 12:47pm    
I have updated the Solution to show how this could be used in a parallel scenario.
Also, see my other discussion comment here.
Matt T Heffron 14-Nov-14 12:38pm    
My solution was to the issue of breaking the HUGE file into chunks (query hits). That cannot be done in parallel anyway.
What can be done in parallel is whatever processing you want to do with each of the query hits.
It will still "play well" with parallel processing of the hits, as it will return the query hit strings only as needed and they will not be "sitting around" in memory (in an array) waiting to be processed. And once processed, then can be released; again, not sitting in an array when no longer necessary.
BillWoodruff 14-Nov-14 13:00pm    
+5 Matt, I appreciate the effort you've gone to in your thoughtful reply ! If the file format contained discrete "entities" that did not have forwards/backwards dependencies, then I see no reason why it could not be processed in parallel, queried in parallel, etc., or at least pre-processed in some way.
Matt T Heffron 14-Nov-14 13:11pm    
Thanks.
The method I wrote just breaks the HUGE file into the individual query hits, just like the OP's Regex.Match, except it produces them as needed, instead of all at once.
Even if chunks are not of predefined size, you should find a method to split the file to chunks - or to get the position of every chunk limit.
Still if this is a lab application, I would not use c#. I would use python for example. It has more sophisticated text manipulation techniques implemented.

On the other hand, what you can do by loading everything into the memory you can do also jumping around the file. Won't be quick, but will work. In your case matching the pattern "Query=.+?Effective search space used: \d+" is not complex at all. You need a simple pattern matching algorythm. Matt T Heffron's answer contains one approach, but might not be "low level" enough in some cases.
Even better, your patern is a "type 2 language statement", thus you can look for it with a simple state machine implementation. Consider the file as a simgle array of characters. If you see it this way, you don't need to load it in the memory. If you want to speed it up, you can load a page of fixed size and offset it or you can try using MemoryMappedFile[^], since this is what's this made for.
 
Share this answer
 
Comments
Matt T Heffron 13-Nov-14 18:08pm    
"... I would use python for example."
Perl lives for string manipulation like this!
And it has the advantage of being self-encrypting! ;-)
(Almost like old DEC TECO macros.)
Zoltán Zörgő 13-Nov-14 18:36pm    
Yes, perl, is a good one - but a harder nut to crack for many. Still I suppose such a big file won't be straightforward neither for perl :)
Mr. xieguigang 谢桂纲 14-Nov-14 4:06am    
I trying your advice of split the big file into a chunk and then search in each chunk, maybe I can find a solution tonight. the difficulty of this job is how to make the loading process parallel or it will maybe takes whole day on this loading job....
Zoltán Zörgő 15-Nov-14 9:43am    
Parrallelism might loog promising, but keep in mind, that you lack of memory, not processing power in this case. So if you still try to load everything - even into different threads, you can still run out of memory - or at least slow dows considerably because of paging.

Still, I am really interested in making a good solution for this problem. I suppose you can't share your original input file with us, but could share a portion of it, or a link to a publicly available file with similar content, even with considerably smaller size?
Hey, guy, i have work out how to dealing with this ultra large size text file parsing job, it contains 3 steps:

1. Loading all of the data into memory and split into chunk in size 786MB, it seems the UTF8.GetString function can not handle the size large than 1GB and then cache the chunk into a list
2. using the regular expression to parsing the section, due to the regex matching function just using one single thread for its parsing job, so that using parallel linq can speed up this job
3. do the section text parsing job as i does before.

here is my code:

VB
''' <summary>
''' size smaller than 1GB
''' </summary>
''' <param name="LogFile">Original plant text file path of the blast output file.</param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function TryParse(LogFile As String) As v228
   Call Console.WriteLine("Regular Expression parsing blast output...")

   Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
      Call p.Start()

      Dim SourceText As String = IO.File.ReadAllText(LogFile) 'LogFile.ReadUltraLargeTextFile(System.Text.Encoding.UTF8)
      Dim Sections As String() = (From matche As Match
                                  In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                                  Select matche.Value).ToArray

      Call Console.WriteLine("Parsing job done!")

      Dim Sw As Stopwatch = Stopwatch.StartNew
#If DEBUG Then
      Dim LQuery = (From Line As String In Sections Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#Else
      Dim LQuery = (From Line As String In Sections.AsParallel Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#End If
      Dim BLASTOutput As v228 = New v228 With {.FilePath = LogFile & ".xml", .Queries = LQuery}
      Console.WriteLine("BLASTOutput file loaded: {0}ms", Sw.ElapsedMilliseconds)

      Return BLASTOutput
   End Using
End Function

''' <summary>
''' It seems 786MB possibly is the up bound of the Utf8.GetString function.
''' </summary>
''' <remarks></remarks>
Const CHUNK_SIZE As Long = 1024 * 1024 * 786
Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

''' <summary>
''' Dealing with the file size large than 2GB
''' </summary>
''' <param name="LogFile"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function TryParseUltraLarge(LogFile As String, Optional Encoding As System.Text.Encoding = Nothing) As v228
    Call Console.WriteLine("Regular Expression parsing blast output...")

    'The default text encoding of the blast log is utf8
    If Encoding Is Nothing Then Encoding = System.Text.Encoding.UTF8

    Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
       Call p.Start()

       Dim TextReader As IO.FileStream = New IO.FileStream(LogFile, IO.FileMode.Open)
       Dim ChunkBuffer As Byte() = New Byte(CHUNK_SIZE - 1) {}
       Dim LastIndex As String = ""
       'Dim Sections As List(Of String) = New List(Of String)
       Dim SectionChunkBuffer As List(Of String) = New List(Of String)

       Do While TextReader.Position < TextReader.Length
          Dim Delta As Integer = TextReader.Length - TextReader.Position

          If Delta < CHUNK_SIZE Then ChunkBuffer = New Byte(Delta - 1) {}

          Call TextReader.Read(ChunkBuffer, 0, ChunkBuffer.Count - 1)

          Dim SourceText As String = Encoding.GetString(ChunkBuffer)

          If Not String.IsNullOrEmpty(LastIndex) Then
             SourceText = LastIndex & SourceText
          End If

          Dim i_LastIndex As Integer = InStrRev(SourceText, "Effective search space used:")
          If i_LastIndex = -1 Then  '当前区间之中没有一个完整的Section
             LastIndex &= SourceText
             Continue Do
          Else
             i_LastIndex += 42

             If Not i_LastIndex >= Len(SourceText) Then
                LastIndex = Mid(SourceText, i_LastIndex)  'There are some text in the last of this chunk is the part of the section in the next chunk.
             Else
                LastIndex = ""
             End If
             Call SectionChunkBuffer.Add(SourceText)
          End If

          'This part of the code is non-parallel

          'Dim SectionsTempChunk = (From matche As Match
          '                         In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
          '                         Select matche.Value).ToArray

          'If SectionsTempChunk.IsNullOrEmpty Then
          '    LastIndex &= SourceText
          '    Continue Do
          'Else
          '    Call Sections.AddRange(SectionsTempChunk)
          'End If

          'LastIndex = SectionsTempChunk.Last()
          'Dim Last_idx As Integer = InStr(SourceText, LastIndex) + Len(LastIndex) + 1
          'If Not Last_idx >= Len(SourceText) Then
          '    LastIndex = Mid(SourceText, Last_idx)  'There are some text in the last of this chunk is the part of the section in the next chunk.
          'Else
          '    LastIndex = ""
          'End If
      Loop

      Call Console.WriteLine("Loading job done, start to regex parsing!")

      'The regular expression parsing function just single thread, here using parallel to parsing the cache data can speed up the regular expression parsing job when dealing with the ultra large text file.
       Dim Sections As String() = (From strLine As String 
                                   In SectionChunkBuffer.AsParallel
                                   Select (From matche As Match
                                           In Regex.Matches(strLine, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                                           Select matche.Value).ToArray).ToArray.MatrixToVector

       Call Console.WriteLine("Parsing job done!")
       '#Const DEBUG = 1
       Dim Sw As Stopwatch = Stopwatch.StartNew
#If DEBUG Then
       Dim LQuery = (From Line As String In Sections Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#Else
       Dim LQuery = (From Line As String In Sections.AsParallel Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#End If
       Dim BLASTOutput As v228 = New v228 With {.FilePath = LogFile & ".xml", .Queries = LQuery}
       Console.WriteLine("BLASTOutput file loaded: {0}ms", Sw.ElapsedMilliseconds)

       Return BLASTOutput
   End Using
End Function
 
Share this answer
 
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900