Click here to Skip to main content
15,440,884 members

Comments by Mr. xieguigang 谢桂纲 (Top 15 by date)

Mr. xieguigang 谢桂纲 6-May-15 11:32am View    
Control.BeginInvoke(MethodInvoker) is still not working... the code is still stuck running at here
Mr. xieguigang 谢桂纲 15-Nov-14 2:33am View     CRLF
Hey, guy, i have work out how to dealing with this ultra large size text file parsing job, it contains 3 steps: 1. Loading all of the data into memory and split into chunk in size 786MB, it seems the UTF8.GetString function can not handle the size large than 1GB and then cache the chunk into a list 2. using the regular expression to parsing the section, due to the regex matching function just using one single thread for its parsing job, so that using parallel linq can speed up this job 3. do the section text parsing job as i does before. here is my code: ''' ''' It seems 786MB possibly is the up bound of the Utf8.GetString function. ''' ''' <remarks> Const CHUNK_SIZE As Long = 1024 * 1024 * 786 Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+" ''' ''' Dealing with the file size large than 2GB ''' ''' <param name="LogFile"></param> ''' <returns> ''' <remarks> Public Shared Function TryParseUltraLarge(LogFile As String, Optional Encoding As System.Text.Encoding = Nothing) As v228 Call Console.WriteLine("Regular Expression parsing blast output...") 'The default text encoding of the blast log is utf8 If Encoding Is Nothing Then Encoding = System.Text.Encoding.UTF8 Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar Call p.Start() Dim TextReader As IO.FileStream = New IO.FileStream(LogFile, IO.FileMode.Open) Dim ChunkBuffer As Byte() = New Byte(CHUNK_SIZE - 1) {} Dim LastIndex As String = "" 'Dim Sections As List(Of String) = New List(Of String) Dim SectionChunkBuffer As List(Of String) = New List(Of String) Do While TextReader.Position < TextReader.Length Dim Delta As Integer = TextReader.Length - TextReader.Position If Delta < CHUNK_SIZE Then ChunkBuffer = New Byte(Delta - 1) {} Call TextReader.Read(ChunkBuffer, 0, ChunkBuffer.Count - 1) Dim SourceText As String = Encoding.GetString(ChunkBuffer) If Not String.IsNullOrEmpty(LastIndex) Then SourceText = LastIndex & SourceText End If Dim i_LastIndex As Integer = InStrRev(SourceText, "Effective search space used:") If i_LastIndex = -1 Then '当前区间之中没有一个完整的Section LastIndex &= SourceText Continue Do Else i_LastIndex += 42 If Not i_LastIndex >= Len(SourceText) Then LastIndex = Mid(SourceText, i_LastIndex) 'There are some text in the last of this chunk is the part of the section in the next chunk. Else LastIndex = "" End If Call SectionChunkBuffer.Add(SourceText) End If 'This part of the code is non-parallel 'Dim SectionsTempChunk = (From matche As Match ' In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase) ' Select matche.Value).ToArray 'If SectionsTempChunk.IsNullOrEmpty Then ' LastIndex &= SourceText ' Continue Do 'Else ' Call Sections.AddRange(SectionsTempChunk) 'End If 'LastIndex = SectionsTempChunk.Last() 'Dim Last_idx As Integer = InStr(SourceText, LastIndex) + Len(La
Mr. xieguigang 谢桂纲 14-Nov-14 4:06am View    
I trying your advice of split the big file into a chunk and then search in each chunk, maybe I can find a solution tonight. the difficulty of this job is how to make the loading process parallel or it will maybe takes whole day on this loading job....
Mr. xieguigang 谢桂纲 14-Nov-14 3:57am View     CRLF
here is the example of the section: each section start from "Query=" and end with Effective search space used: blablablabla......... Query= XC_0118 transcriptional regulator Length=1113 Score E Sequences producing significant alignments: (Bits) Value lcl5167|ana:all4503 two-component response regulator; K07657 tw... 57.4 2e-009 lcl2658|cyp:PCC8801_3460 winged helix family two component tran... 56.6 4e-009 lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ... 55.1 7e-009 lcl9057|kol:Kole_0706 two component transcriptional regulator, ... 55.1 1e-008 lcl9114|ter:Tery_2902 two component transcriptional regulator; ... 55.1 1e-008 lcl9051|trq:TRQ2_0821 two component transcriptional regulator (... 54.7 1e-008 lcl9023|tpt:Tpet_0798 two component transcriptional regulator (... 54.7 1e-008 lcl8929|tma:TM0126 response regulator; K02483 two-component sys... 54.3 1e-008 lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The... 52.4 6e-008 blablablabla......... > lcl5167|ana:all4503 two-component response regulator; K07657 two-component system, OmpR family, phosphate regulon response regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324] Length=253 Score = 57.4 bits (137), Expect = 2e-009, Method: Compositional matrix adjust. Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%) Query 3 LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE 62 LR +R+ L +P + + + P+ ++ G+ + L P+ +L A V S E Sbjct 144 LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE 203 Query 63 QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL 110 QLL VW F GD+ V I LR KL D P +I T+R GYR Sbjct 204 QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF 252 blablablabla......... Lambda K H a alpha 0.321 0.133 0.395 0.792 4.96 Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6 Effective search space used: 1655396995 blablablabla......... and yes, this may be a solution, but it can not be parallel, and if we using a for each loop, then the program only utilize 1 CPU core, dealing with the 100GB text file, is impossible.....
Mr. xieguigang 谢桂纲 14-Nov-14 3:50am View    
yes, the IO.File.ReadAllText dealing with the file with size below 2GB is perfect and clean, but when dealing the size very large, it crash. i think the MS should improved the .NET class object for the ultra large size text file processing.