How to code a program to decrease a text file size?

Question

1.00/5 (2 votes)

See more:

I have a very large text file that I want to increase its size, so I want to code a program that would do that by deleting some data I don't want in that file.
Here is a small sample of that file.

START
POINT 1000 5356 4720.589395 33044.474616 111.699997 10005356
2.197266 1554.908813
2.278646 1309.400635
5.615234 572.443115
5.696615 572.070190
6.510417 616.282471
6.591797 611.210938
6.673177 615.655396
POINT 1050 5360 4770.576031 33044.253728 112.699997 10005360
2.197266 883.810486
2.278646 1237.972656
2.360026 1187.120972
2.522787 922.997620
2.604167 868.807739
2.685547 810.683044
2.766927 794.258240
2.929688 706.232666

The program should ask for the destination of the file. The first line "start" is not repeated and should be ignored (not deleted) then it should 'group' the files. All data between 2 'POINTS' should be a group of its own.
So for example, this would be considered as a group:
POINT 1000 5356 4720.589395 33044.474616 111.699997 10005356
2.197266 1554.908813
2.278646 1309.400635
5.615234 572.443115
5.696615 572.070190
6.510417 616.282471
6.591797 611.210938
6.673177 615.655396
And so on....

Then it would delete the groups according to their heading "POINT ......"
I want the program to ask me a couple of questions, such as:
Start point (from POINT 1000 ......) for example
End point (till POINT 3521 .....)
Increment (delete every 5 points, for example delete POINT 10 Then POINT 15 Then 20... till the end point)

I hope you understood me and I prefer that it is done in vb but I guess it won't won't work as the file is 9 million lines. So if not please tell me if it could be done in c++ or c# and please tell me the method or a tutorial(s) that could help me.
Thanks in advance

Posted 12-Dec-14 5:40am

Member 9472140

Add a Solution

Comments

ZurdoDev 12-Dec-14 11:44am

Where are you stuck?

Member 9472140 12-Dec-14 12:00pm

I am stuck from the begging.
I mean, I don't know how to start.

OriginalGriff 12-Dec-14 12:01pm

And what have you done so far?
Where are you stuck?
What help do you need?
And why do you think "it can't be done in VB"? If it can be done in C# (and it can) it can be done in VB: they both compile to the same IL...

Member 9472140 12-Dec-14 12:03pm

I heard that vb can't import huge files like this.

Sergey Alexandrovich Kryukov 12-Dec-14 12:28pm

Perhaps your pose the problem in a wrong, counter-productive way. Instead of trying to stick huge files somewhere, who not thinking at getting rid from them. Big text files really make little sense, no matter what's the purpose. But you can help if you explain your ultimate goals.
—SA

Member 9472140 12-Dec-14 12:32pm

This file contains data that is used in some sort of a software but this software can't open it because of its large size.

Sergey Alexandrovich Kryukov 12-Dec-14 13:08pm

In principle, reading files of any size is not a problem. You just should not try to read it all at once, find a way to navigate to some records (again, text file is a bad idea).
—SA

Richard MacCutchan 13-Dec-14 4:50am

Then read it line by line instead.

Maciej Los 12-Dec-14 16:05pm

C++ knows the structures[^]. You can use them to store binary data in a file[^]. The weight of result file must be much, much less than text file.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2014-12-12T07:54:00

Please see my comments to the question. Again, you are approaching it in a wrong way.

When I replied that using big text files is a bad thing and asked about your goals, you did not really explain them, but you mentioned that "some sort of software" which probably doesn't give you a choice. But then, why asking about "decreasing a size"? Who is going to decrease it? Isn't that logical.

And still, we don't know essential information, structure and semantic of the file. Okay, this is one of possible approaches:

You can index the file, to introduce the ability to read it by smaller chunks. Let's assume the file has some shallow structure; in particular, it would mean it can be decomposed on some smaller logical chunks we shall call "records". A record can be a line, but it could be a group of lines, like the group you've shown in your example. Then only problem then is that each group has different size; first of all, all lines have different size, so you don't know the location of each record before you read the whole file.

So, on first run, you can read the whole file line by line and create another, smaller file, the index file. In index file, you can write the location of each record as file position. It would be better to make the index file binary, to navigate faster in that file. You can have more then one index file, sorted by different criteria (one is sorted by record number in the order defined in the original file another one sorted by some kind of keyword, for example). Then, you can hold the index file in memory, and, if even the index files are big, store the only index of the index file, and read the index files on request.

Now, on request/query, you get the information on some record from one or another index file (take from memory or read from index file). From index information, get the position in the main original big file and seek this position in the file stream (open it once and keep open during the whole lifetime of the application). Then read your record from the original file.

One slightly different alternative: do everything as described above, but, on first run, completely rewrite the original text file in something more convenient for navigation, which could be much shorter binary file. In that binary file, don't store numbers as strings; it will save you a lot of space and, more importantly, greatly improve your performance.

—SA