Web Scraping using JSOUP

Question

0.00/5 (No votes)

See more:

I am using the JSOUP API to scrape the contents of the webpage.I am attaching the JAVA source code and html source which i have tried.

Java

Document document = Jsoup
.connect(
"http://sports.williamhill.com/bet/en-gb/betting/e/4509085/St-Patricks-v-Limerick.html")
.timeout(1000 * 1000).get();
Elements content = document.select("div.livePushContent");
Elements info = content.select("div#allMarketsTab");
Elements primerycollect = info.select("div#primaryCollectionContainer");
 
System.out.println("1..............................................");
System.out.println("No of Tab's: "
+ info.select("div#primaryCollectionContainer")
.select("div.marketHolderExpanded").size());
Elements market = primerycollect.select("table.tableData");
Elements tbody = market.select("tbody");
for (int i = 0; i < info.select("div#primaryCollectionContainer")
.select("div.marketHolderExpanded").size(); i++) {
    int a = 1;
    a+=i;
    System.out.println("Tab No"+a+": "+info.select("div#primaryCollectionContainer")
    .select("div.marketHolderExpanded").get(i)
    .select("table.tableData").select("thead").select("tr")
    .select("th[class~=leftPad title]").select("span").last().text());
}
 
Elements primerycollect1 = info.select("div#sur_collection_267");
System.out.println(".............................................."); 
System.out.println("2..............................................");
System.out.println("No of Tab's: "
+ info.select("div#sur_collection_267")
.select("div.marketHolderExpanded").size());
Elements market1 = primerycollect1.select("table.tableData");
Elements tbody1 = market1.select("tbody");
for (int i = 0; i < info.select("div#sur_collection_267")
.select("div.marketHolderExpanded").size(); i++) {
    System.out.println("Tab No"+i+++": "+info.select("div#sur_collection_267")
    .select("div.marketHolderExpanded")
    .select("table.tableData")
    .select("thead").select("tr")
    .select("th[class~=leftPad title]").select("span").last().text());
}

System.out.println("3..............................................");
System.out.println("No of Tab's: "
+ info.select("div#sur_collection_25")
.select("div#collection25").size());
Elements primerycollect2 = info.select("div#sur_collection_25");
Elements market2 = primerycollect2.select("table.tableData");
Elements tbody2 = market2.select("tbody");
for (int i = 0; i < info.select("div#sur_collection_25")
.select("div#collection25").size(); i++) {
    System.out.println("Tab No"+i+": "+info.select("div#sur_collection_25")
    .select("div.collectionContainer displayBlock displayNone")
    .select("div.marketHolderCollapsed").get(i)
    .select("table.tableData").select("thead").select("tr")
    .select("th[class~=leftPad title]").select("span").last().text());
}
 
System.out.println("..............................................");

The output i am getting is

1..............................................
No of Tab's: 10
Tab No1: Match Betting
Tab No2: Correct Score
Tab No3: Double Result
Tab No4: Draw No Bet
Tab No5: Match Handicaps
Tab No6: Double Chance
Tab No7: Both Teams To Score
Tab No8: 1st Half Betting
Tab No9: 2nd Half Betting
Tab No10: Total Match Goals Under/Over
..............................................
2..............................................
No of Tab's: 1
Tab No0: GOAL scored in the first 5 minutes? 00:00 - 04:59
3..............................................
No of Tab's: 1
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(Unknown Source)
at java.util.ArrayList.get(Unknown Source)
at org.jsoup.select.Elements.get(Elements.java:523)
at com.yotechnologies.scraper.williamhill.test.Trial3.main(Trial3.java:67)

Now i am attaching the html source code
which i am not able to scrpe

Goals

Goals

Show All 30 Markets

My Actual Requirement is i want to scrape all the tab names of the page like Match Betting,Correct Score,etc. For the attached html source code i am having 52 tabs. The problem occurs in the third part of the tag as you can see in the output. It should return St Patricks To Score Both Halves ,Limerick To Score Both Halves .etc . It is returning null values. I am not able to scrape. I want to know the reason behind that. Please help me. They are updating the site frequently. If it is not having any tabs . Please inform me . i will mention the different url.

Posted 3-Jun-13 19:41pm

shanmuga1509

Updated 3-Jun-13 21:51pm

v3

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)