Extract the image tag and url from RSS feed using Python and feedparser module

Question

0.00/5 (No votes)

See more:

I currently have this code in Python using feedparser module:

Python

import feedparser

RSS_FEEDS = {'cnn': 'http://rss.cnn.com/rss/edition.rss'}

def get_news_test(publication="cnn"):
    feed = feedparser.parse(RSS_FEEDS[publication])
    articles_cnn = feed['entries']

    for article in articles_cnn:
        print(article)


get_news_test()

This returns the following information (a single iteration):

HTML

<item>
            <title>
                <![CDATA[Are China's latest weapons science fiction or battle-ready?]]>
            </title>
            <description>
                <![CDATA[Since the beginning of January, the Chinese military has revealed a dizzying array of sophisticated and powerful new weaponry. ]]>
            </description>
            <link>https://www.cnn.com/2019/01/19/asia/china-new-weapons-2019-intl/index.html</link>
            <guid isPermaLink="true">https://www.cnn.com/2019/01/19/asia/china-new-weapons-2019-intl/index.html</guid>
            <pubDate>Sun, 20 Jan 2019 06:04:16 GMT</pubDate>
            <media:group>
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-super-169.jpg" height="619" width="1100" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-large-11.jpg" height="300" width="300" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-vertical-large-gallery.jpg" height="552" width="414" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-video-synd-2.jpg" height="480" width="640" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-live-video.jpg" height="324" width="576" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-t1-main.jpg" height="250" width="250" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-vertical-gallery.jpg" height="360" width="270" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-story-body.jpg" height="169" width="300" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-t1-main.jpg" height="250" width="250" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-assign.jpg" height="186" width="248" />
                <media:content medium="image" url="https://cdn.cnn.com/cnnnext/dam/assets/190119113947-china-df-26-missile-beijing-hp-video.jpg" height="144" width="256" />
            </media:group>
        </item>

I know I can return some portions of this, for instance, the title by calling:

Python

print(article.title)

Someone said this is json data but I am having a hard time trying to get the individual image tag.

What I have tried:

I have tried calling the <media:content> as a key but that doesn't work.

Posted 19-Jan-19 21:11pm

Member 14123629

Updated 19-Jan-19 21:44pm

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Richard MacCutchan · Answer 1 · 2019-01-19T21:45:00

Solution 1

It is not JSON, it is XML. See XML Processing Modules — Python 3.7.2 documentation[^].

Posted 19-Jan-19 21:45pm

Richard MacCutchan

Comments

Member 14123629 20-Jan-19 4:44am

Thanks! I did this and can get a list of the image urls but I still don't know how to get to the individual elements. :(

from bs4 import BeautifulSoup
import requests

source = requests.get('http://rss.cnn.com/rss/edition.rss')

soup = BeautifulSoup(source.text, 'xml')

#media = media.find_all("url")

for url in soup.find_all("media:content"):
print(url)

Richard MacCutchan 20-Jan-19 6:49am

Sorry, but I do not know BeautifulSoup. Try a Google search to find sample code.