Many times, I was involved in sorting my Opera (fantastic web browser, for whose don't knot it) cache, keeping in a safe place some HTML file, deleting some other useless. In the kept files, I often had to search for URL and retrieve the file associated. It was a huge task, even if Opera can nicely manage download. So I create HTML files with all the URL, then use a web grabber to retrieve them in background.
But the task to compile all files, look for all the URL, creating a single HTML still have to be done by hand. So I created this little tool to look for a specific pattern in a set of files and output multiple or an unique file with the URL. It was then too much easy to compile stuffs ;)
The tool allows to fetch only the wanted part of text, to input the text one or multiple times in another text, include the unmatched text and so on, so you may do almost everything with this tool ;)
Using the tool
Imagine in a HTML file you have, between the mess and the ads, such links :
<br /><img src="http://galleries.amberlace.com/ndnikki1/pics/01.jpg">
<br /><img src="http://galleries.amberlace.com/ndnikki1/pics/02.jpg">
<br /><img src="http://galleries.amberlace.com/ndnikki1/pics/03.jpg">
<br /><img src="http://galleries.amberlace.com/ndnikki1/pics/04.jpg">
Now you wants to save them cleanely in the same file without the mess and the ads, or compile every URL you may find in a big file, in order to get this :
Here the explanation step-by-step, section-per-section :
Source : Source folder that contains the files. Currently no sub-directory support.
Extension : File extension to process, separated with a semi-column. If empty, process every file of the source folder.
Destination : Destination folder, where the modified files have to be written. If the path ends with a filename, things extracted will be written inside a unique file.
Here comes the most interresting part, but be cautious :
Start : The beginning of the text to look for. In the previous example/pattern, it would be
Body : Name the body part you wan to keep safe and write in the destination file. Leave it to body if you want, it's not such important...
End : The end of the text that will encapsulate the body. From the previous example, it would be after the interresting URL, such
:<br /><img src=
Replace : Put here the replacement line including the retrieved body. To include the body, just write the body name with a percent sign before. Hence, to create a valid URL, write
Include unprocessed text : Include in the output file the text found before the start and after the end of what have to be processed. Hence you may just modify/clean a file ;)
Add end of line : Add an end-of-line after each match. Useful to create one line per URL.
Header : Header of the new file.
Tail : Tail of the new file.
!!! : Let's GO !
None yet ;)
- First the tool IS NOT Unicode compliant. It just process single byte charset.
- Second, sorry for the morons who will be shocked by the links I gave as example. It's just cute ;)
- Third, this is provided as is. I'll make upgrades on purpose. But feel free to modify the tool for your own usage.
- Fourth, this tool is far more useful than you even expected. It lacks also of a configuration saving, in order to set back a parameter set to process another bunch of files.