Click here to Skip to main content
15,881,812 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I'm working on Bag of Visual words module .I extract SIFT descriptors from 984 Training images,and stored them on disk as .yaml file . I wonder why my descriptors file .yaml is too huge ,it's size 923MB ,although the size of my whole training images is just 146MB . My problem that i cannot load my descriptors file which is a matrix has 1.4 million rows and 128 cols in memory it's look there's a threshold used in the opencv alloc.cpp prevent me from allocate more than 730MB . is it normal that i the descriptors file .yaml is too huge according to the training image size which is 146MB ?
Posted
Updated 10-Apr-15 12:55pm
v2
Comments
Sergey Alexandrovich Kryukov 10-Apr-15 18:44pm    
Sorry, this is what it is: no relevant information on the problem => no advice. The only advice would be: hit Improve question and provide a lot more detail.
—SA
Member 10644036 10-Apr-15 18:59pm    
I store a Matrix object it's dimension million X 128
on disk with format yaml this size equal 923MB ,nevertheless I need 730 MB to load this object into memory , opencv prevent me code dump Insufficient memory what is the solution ?
Sergey Alexandrovich Kryukov 10-Apr-15 19:06pm    
This is still not enough to solve your problem immediately, but I just answered your question in a very general way, which can even be better, because it can arm you with the approach applicable to a very wide range of problems. Please see Solution 1, ask any follow-up questions.
—SA
Sergey Alexandrovich Kryukov 10-Apr-15 19:12pm    
I wrote in assumption that you need access to a YAML tree. If you use the matrix object (do I understand you write: this is a memory representation of mathematical matrix of the rank 2), it makes the problem truly trivial. There is not a problem to represent a matrix as a file and provide the interface to it identical to the interface of the Matrix object you have right now. You will have to keep in memory only one element of the matrix at a time (well, may be very few of them). You can decide in you want to keep all the index file in memory, which is quite possible (it could be just 1 million members). If the sizes of Matrix elements are equal, you don't even need an index file/table.

I'll repeat: this is really trivial, can be done in no time.

Are you getting the idea?

—SA

Please see my comment to the question — not enough information.

On second thought, I can give you a very general idea. Yes, your data structure represented training data is by far over-populated with data, too big to hold in memory. One common and universal approach is to implement the same interface to this data you have right now, but mimic access to memory through access to file.

Create a file with some appropriate structure (which can be different from what you have in input, you will need to digest it onto suitable structure), open it for read-only non-shared access and keep open during the lifetime of your application. You will need to seek records in this file on request, so it might be binary; and you may need to have well-defined records in this file. Yes, I understand that YAML data is hierarchical, not sequential, but you can somehow artificially subdivide it into records at some reasonable level of granularity.

Then, you need to read this file once and remember file positions of records in this file. Most likely, it should be some hash table of this file positions with the ability to perform quick search by one of several keys (then if could be several hash tables). Please see: http://en.wikipedia.org/wiki/Hash_table.

In case of hierarchical data like YAML, the search index could be the place of some item in the YAML tree, such as 1-3-203313-9. I hope you understand that this index should be binary structure, not text. Also, you can store some metadata information in this structure, such as number of children in each node.

You can permanently store this hash table in another file to be used on next run. Then this file should be kept until you update your training data. If your index can be binary structure of fixed size, you would not need the secondary index, "the index of the index file". You can easily calculate the position based on the size of the representation of this structure in the file.

Good. Now, a next step. Apparently, you already have some software interface providing you fast access to your YAML structure, or anything else taking too much space. It could be a usual interface for access to a tree node or anything. Do the following: replace the implementation of this interface with another implementation wrapping your access to the same structure represented as a file. It will help you to keep the rest of the code intact.

[EDIT]

If the data structure in question is Matrix, this problem with indexing becomes even simpler, much simpler. Please see my comment on this in the comments to the question.

Basically, any item is indexed by just two indices: Y and X (row, column). Even if each matrix element takes different size when presented in a file, this is not a problem: it could be an index table, which is the array of rank 3 of the file positions. So, you can have this position table (index table) as an additional structure used for fast search of the position in the bigger file by Y-X the index. This is very, very simple to do.

—SA
 
Share this answer
 
v2
Your descriptor file is too large because SIFT descriptors have double as the variable and no compression is done when storing them on disc while images are mostly compressed and read one by one into memory. Each image on average generates 2K features, thus the feature database must be large. The memory issue is usually a very hard problem if you wish to create a scalable system, and thus you must somehow split the matrix into smaller matrices and store those separately on disk. You will then need a search algorithm that will automatically load the right matrix given a query descriptor from a scene at runtime. Such an algorithm should be approximate for speed such as one based on a modified best-bin first search algorithm. You can also explore locality sensitive hashing for such a problem. See http://en.wikipedia.org/wiki/Locality-sensitive_hashing[^]

Hope this helps.
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900