Click here to Skip to main content
15,945,978 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am a newbie to python. I want to extract the name of categories and webpages (category tree) of a wikipedia page having a category through the crawling procedure. During the course of this I am facing the following error and I am frustrated with an error. In regard, any help is greatly appreciated.

Traceback (most recent call last):
File "C:\Users\SIBA\Desktop\PDF\Code\", line 100, in <module>
printTree(name, 0)
File "C:\Users\SIBA\Desktop\PDF\Code\", line 80, in printTree
content = open("categories/Category:"+catName+".html").readlines()
FileNotFoundError: [Errno 2] No such file or directory: 'categories/Category:Cricket.html'

The code snippet of what I have tried is as follows. I am using Python 3.6 version.

What I have tried:

import httplib2
from bs4 import BeautifulSoup
import subprocess
import time
import os,sys

catRoot = ""
done = []
ignore = []
# Removes all newline characters and replaces with spaces
def removeNewLines(in_text):
return in_text.replace('\n', ' ')

# Downloads a link into the destination
def download(link, dest):
# print link
if not os.path.exists(dest) or os.path.getsize(dest) == 0:
subprocess.getoutput('wget "' + link + '" -O "' + dest+ '"')
print ("Downloading")

def ensureDir(f):
    if not os.path.exists(f):

# Cleans a text by removing tags
def clean(in_text):
s_list = list(in_text)
i,j = 0,0
while i < len(s_list):
    # iterate until a left-angle bracket is found
    if s_list[i] == '<':
        if s_list[i+1] == 'b' and s_list[i+2] == 'r' and s_list[i+3] == '>':
            print (hello)
        while s_list[i] != '>':
            # pop everything from the the left-angle bracket until the right-angle bracket
        # pops the right-angle bracket, too

    elif s_list[i] == '\n':

# convert the list back into text
return (join_char.join(s_list))#.replace("<br>","\n")

# Gets bullets
def getBullets(content):
    mainSoup = BeautifulSoup(contents)

# Gets empty bullets
def getAllBullets(content):
mainSoup = BeautifulSoup(str(content))
subcategories = mainSoup.findAll('div',attrs={"class" :  "CategoryTreeItem"})
empty = []
full = []
for x in subcategories:
    subSoup = BeautifulSoup(str(x))
    link = str(subSoup.findAll('a')[0])
    if (str(x)).count("CategoryTreeEmptyBullet") > 0:
        empty.append(clean(link).replace(" ","_"))
    elif (str(x)).count("CategoryTreeBullet") > 0:
        full.append(clean(link).replace(" ","_"))


def printTree(catName, count):
catName = catName.replace("\\'","'")
if count == MAX_DEPTH: return
   download(catRoot+catName, path)
content = ("Category:"+catName+".html")
(emptyBullets,fullBullets) = getAllBullets(content)

for x in emptyBullets:
    for i in range(count): print ("  "),
    download(catRoot+x, "categories/Category:"+x+".html")
    print (x)

for x in fullBullets:
    for i in range(count): print ("  "),
    print (x)
    if x in done:
        print ("Done... "+x)
    try: printTree(x, count + 1)
    except: print ("ERROR: " + x)

name = "Cricket"
printTree(name, 0)
Updated 11-Feb-20 19:11pm

1 solution

The error message is quite clear: The mentioned file does not exist.

But the posted code has indentation errors and does not correspond to the code line from the error message so that it is rather impossible to help by just seeing the posted code.

In any case you can check if the file exists before trying to open it and act accordingly.

Note also that using relative pathes is prone to errors.

Finally, you should check if execution of the wget tool was successful. Otherwise, the file is not created.
Share this answer

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900