Click here to Skip to main content
15,945,603 members
Articles / Programming Languages / C#
Tip/Trick

A Simple Example of Scraping a Web Page Using Visual FA

Rate me:
Please Sign up or sign in to vote.
4.90/5 (5 votes)
20 Apr 2024MIT2 min read 9.5K   4   3
Scraping the web is easy with Visual FA. Here's an example of how.
Here I present a simple example of scraping a web page looking for URLs using the Visual FA engine.

Introduction

I produced this tip in order to demonstrate how easy it can be to use Visual FA to do things like scrape the web. I thought a simple example would be helpful in terms of using it.

Background

Visual FA is my lexing/tokenizing engine for C#. It is essentially an augmented regular expression engine. Unlike the one built into .NET this one is built for performance rather than features, so it doesn't things like backtracking or capturing. It also operates more efficiently than .NET's as a result. Furthermore, it can tokenize, whereas .NET's engine is simply a matcher.

Here we use it to scrape the web. This is very simple, and normally you'd be lexing/tokenizing the result instead of doing simple flat matches.

Using the code

The Scrape project is included with Visual FA.

Here's a simple example of pulling all of the URLs from google.com:

C#
using VisualFA;
var expr = FA.Parse(@"https?\://[^"";\)]+");
var client = new HttpClient();
using (var msg = new HttpRequestMessage(HttpMethod.Get, "https://www.google.com"))
{
    using (var resp = client.Send(msg))
    {
        using (var reader = new StreamReader(resp.Content.ReadAsStream()))
        {
            foreach (var match in expr.Run(reader))
            {
                if (match.IsSuccess)
                {
                    Console.WriteLine(match.Value);
                }
            }
        }
    }
}

This will print every http or https URL to the console. The main thing is we're spinning up a state machine from the regular expression https?\://[^";\)]+. That says find http:// or https:// and keep matching until we find a quote, a semicolon, or a closing parenthesis. Once we've called Parse() we can use the FA instance's Run() method to return a series of FAMatch objects. This can be done over a TextReader, as shown above, or over a string. Lexers like Visual FA's runners return all the content, but we only care about successful matches, so we check the IsSuccess property to decide whether or not to print the Value.

Obviously, you can do this with .NET's engine but it requires reading the whole page into memory before matching, and the result will be marginally less performant compared to doing so with Visual FA. That doesn't really justify using Visual FA in and of itself, but normally you'd be using it to lex content, which I've covered in the Visual FA series.

History

  • 21st April, 2024 - Initial submission

License

This article, along with any associated source code and files, is licensed under The MIT License


Written By
United States United States
Just a shiny lil monster. Casts spells in C++. Mostly harmless.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Ștefan-Mihai MOGA23-Apr-24 7:03
professionalȘtefan-Mihai MOGA23-Apr-24 7:03 
QuestionNice example, however Pin
Graeme_Grant20-Apr-24 23:36
mvaGraeme_Grant20-Apr-24 23:36 
Page scraping is involved. Not all websites like being scraped. I plan to write a detailed subject on this topic with solutions. But for now, try scraping these two pages:
* sample 1[^] (access denied unless http1.1)
* sample 2[^] (403 unless http2)
There are others that aren't as friendly.

use the following with your HttpClient
C#
using var request = new HttpRequestMessage(HttpMethod.Get, new Uri(url)) { Version = new Version(2, 0) };
request.Headers.TryAddWithoutValidation("Connection", "keep-alive");
request.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" ); // "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
request.Headers.TryAddWithoutValidation("Upgrade-Insecure-Requests", "1");
request.Headers.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
request.Headers.TryAddWithoutValidation("Accept-Language", "en-US;q=0.9");
request.Headers.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");

using var response = await client.SendAsync(request).ConfigureAwait(false);


Graeme


"I fear not the man who has practiced ten thousand kicks one time, but I fear the man that has practiced one kick ten thousand times!" - Bruce Lee

AnswerRe: Nice example, however Pin
honey the codewitch21-Apr-24 3:53
mvahoney the codewitch21-Apr-24 3:53 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.