Filtering rows in pandas dataframe considering day and month

Question

0.00/5 (No votes)

See more:

Hi there I have the Following Code, which I run in Jupyter Notebook :-

<pre>import pandas as pd
import requests
from bs4 import BeautifulSoup
   
#res = requests.get("http://web.archive.org/web/20011108193342/http://www.raf.mod.uk/bbmf/calendar.html")     
res = requests.get("http://web.archive.org/web/20041020000138/http://www.raf.mod.uk/bbmf/displaydates.html")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table', align="CENTER")[0]
df = pd.read_html(str(table))
   
df = df[0]
  
  
##################
##################
##################
  
  
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
 
 
#make df[0] to list
list=[]
for i in df[0]:
    list.append(i)
  
#reverse the list to make split to sublist easier
list.reverse()
  
#split list to sublist using condition len(val)> 2 
size = len(list) 
idx_list = [idx + 1 for idx, val in
            enumerate(list) if len(val) > 2] 
res = [list[i: j] for i, j in
        zip([0] + idx_list, idx_list + 
        ([size] if idx_list[-1] != size else []))] 
  
#make monthname to numbers and print
for i in res:
    for j in range(len(i)):
        if i[j].upper()=='JUNE':
            i[j]='6'
        elif i[j].upper() =='MAY':
            i[j]='5'
        elif i[j].upper() == 'APRIL':
            i[j]='4'
        elif i[j].upper() =='JANUARY':
            i[j]='1'
        elif i[j].upper() == 'FEBRUARY':
            i[j]='2'
        elif i[j].upper() =='MARCH':
            i[j]='3'
        elif i[j].upper() == 'JULY':
            i[j]='7'        
        elif i[j].upper() =='AUGUST':
            i[j]='8'
        elif i[j].upper() == 'SEPTEMBER':
            i[j]='9'
        elif i[j].upper() =='OCTOBER':
            i[j]='10'
        elif i[j].upper() == 'NOVEMBER':
            i[j]='11'
        elif i[j].upper() =='DECEMBER':
            i[j]='12'       
  
  
#append string and append to new list
finallist=[]
for i in res:
    for j in range(len(i)):
        if j < len(i) - 1:
            #print(f'2004-{i[-1]}-{i[j]}')
            finallist.append(f'2004-{i[-1]}-{i[j]}')
#print(finallist)
finallist.reverse()
  
  
  
  
#print("\n=== ORIGINAL DF ===\n")
#print(df)
  
  
  
#convert dataframe to list
listtemp1=df.values.tolist()
  
#replace found below values with 0000_removable
removelist=['LOCATION','LANCASTER','SPITFIRE','HURRICANE','DAKOTA','DATE','JUNE','JANUARY','FEBRUARY','MARCH','MAY','JULY','AUGUST','SEPTEMBER','OCTOBER','NOVEMBER','DECEMBER','APRIL']
for i in listtemp1:
    for j in range(len(i)):
        for place in removelist:
            if str(i[j]).upper()==place:
                i[j]='0000_removable'
            else:
                pass
  
                  
#remove sublists with the replaced values we redirected
dellist=['0000_removable', '0000_removable', '0000_removable', '0000_removable', '0000_removable', '0000_removable']
res = [i for i in listtemp1 if i != dellist]
  
#assign back to dataframe DF3
df3=pd.DataFrame()
df3=pd.DataFrame(res, columns=['Date','LOCATION','LANCASTER','SPITFIRE','HURRICANE','DAKOTA'])
#print("\n=== AFTER REMOVE month and column names from DF, assigned to new as DF3 ===\n")
#print(df3)
  
  
#now assign that sorted date list to dataframe DF3
idx = 0
df3.insert(loc=idx, column='DATE', value=finallist)
pd.options.display.max_rows = 500
  
#print("\n=== FINAL DF3 after joining the edited date format column list ===\n")
#print(df3)
  
  
#validation logic if needed compare processed date from new joined "edited_Date_format" column with already existing "Date" column
#df3['ED1']=  pd.to_datetime(df3['EDITED_DATE_FORMAT'],format='%Y-%m-%d').dt.day
#df3['validation of date'] = df3.apply(lambda x: str(x['ED1']) == x['Date'], axis=1)
  
  
#convert df3['EDITED_DATE_FORMAT'] column from object to datetime64 foramt
#df3['EDITED_DATE_FORMAT']= pd.to_datetime(df3['EDITED_DATE_FORMAT'],format='%Y-%m-%d')
  
  
  
##################
##################
##################
  
  
 
 
  
 
#df3 = df3.rename(columns=df.iloc[0])
#df3 = df.iloc[2:]
#df3.head(15)
pd.options.display.max_rows = 1000
pd.options.display.max_columns = 1000

df3['LANCASTER'] = df3['LANCASTER'].replace({'X': 'L'})
df3['HURRICANE'] = df3['HURRICANE'].replace({'X': 'H'})
df3['SPITFIRE'] = df3['SPITFIRE'].replace({'X': 'S'})
df3['SPITFIRE'] = df3['SPITFIRE'].replace({'X x 2': 'SS'})
df3['DAKOTA'] = df3['DAKOTA'].replace({'X': 'D'})


#display = df3[(df3['LOCATION'].str.contains('[a-zA-Z]')) & (df3['DAKOTA'].str.contains('X')) & (df3['SPITFIRE'].str.contains('X', na=True)) & (df3['LANCASTER'] != 'X')]    
#display = df3[(df3['LOCATION'].str.contains('[a-zA-Z]')) & (df3['DAKOTA'].str.contains('D')) & (df3['SPITFIRE'].str.contains('S', na=True)) & (df3['LANCASTER'] != 'L')]    
    
display = df3[(df3['LOCATION'].str.contains('[a-zA-Z]')) & (df3['DATE'].str.contains('-10$|15$')) & (df3['DAKOTA'].str.contains('D')) & (df3['SPITFIRE'].str.contains('S', na=True)) & (df3['LANCASTER'] != 'L')]    
    
#Months = May Jun Jul Aug Sep
#Months = -5- -6- -7- -8- -9-

#print(display) 

display['DATE']= pd.to_datetime(display['DATE'],format='%Y-%m-%d')
display['DATE']= pd.to_datetime(display['DATE']).dt.strftime('%m-%d-%Y')
##added two lines above to convert date format

display=display.rename_axis('MyIdx')
display=display.sort_values(by=['DATE','MyIdx'])
display['DATE']= pd.to_datetime(display['DATE']).dt.strftime('%d-%b-%Y')
#display.drop_duplicates(subset=['LOCATION', 'DATE'], keep='last', inplace=True)

#display.drop('LANCASTER', axis=1, inplace=True)
#display.drop('Date', axis=1, inplace=True)
display=display.dropna(subset=['SPITFIRE', 'HURRICANE'], how='all')
display=display[['LOCATION','DATE','DAKOTA','HURRICANE','SPITFIRE']]
display=display.fillna('--')
#display.reset_index(drop=True, inplace=True)
display.to_csv(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF-2004-Code (Dakota With Fighters).csv')


display

display.sort_values(by=['DATE'])

#print(display)

And when I include that last line of Code, it correctly outputs the earliest days in the DataFrame Output first, i.e. 10 before 15, but not in the month order I want :-

LOCATION DATE DAKOTA HURRICANE SPITFIRE
MyIdx
176 Duxford 10-Jul-2004 D H S
177 Cirencester 10-Jul-2004 D H S
178 Brize Norton 10-Jul-2004 D H S
74 Shrivenham 20:00 10-Jun-2004 D H S
257 Campbletown 15-Aug-2004 D -- S
258 Sunderland 15-Aug-2004 D -- S
261 Scampton 15-Aug-2004 D -- S
200 RIAT Fairford 15-Jul-2004 D -- SS
22 Tilford 15-May-2004 D -- S
23 Abingdon 15-May-2004 D -- S
24 Hyde Heath Village 15-May-2004 D -- S

I want 10th June 2004 first then the 10th of July/s then the 15th of May's then the 15th of July Rows, then the 15th August Rows. How do I modify that line of Code, so that I can filter to get that order, without changing the index position of the Rows via code, which I know how to do ?

I mean add something to the first line of code e.g. the displays.sort_values Line, so that the Earlier month with a day, is shown 'favoured' before the later month with the same day ? i.e. 10-Jun-2004 is shown before 10-Jul-2004 , 15-May-2004 is shown before 15-Jul-2004 Rows etc. But still dates with day 10, showing before day 15 Rows.

(df3['DATE'].str.contains('-10$|15$'))

Was the part of the Code, I use to filter the days of the month, I want included
in the DataFrame Output.

Any help would be much appreciated.

Eddie Winch

What I have tried:

I run the Code as shown above, and get the stated Output.

Posted 20-Aug-20 3:23am

Eddie Winch

Updated 21-Aug-20 7:14am

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2020-08-20T03:46:00

Solution 1

Parse each line, and convert the date information to a datetime — Basic date and time types — Python 3.8.5 documentation[^]
Those should sort "properly" where strings sort by finding the first different character pair and basing the whole comparison on the difference between those, ignoring the rest of the string completely.

Posted 20-Aug-20 3:46am

OriginalGriff

v2

Comments

Eddie Winch 20-Aug-20 9:59am

Hi OriginalGriff,

Sorry could you explain how to do that ? I am not sure where to modify the Code exactly, many thanks for your help.

Regards

Eddie Winch

OriginalGriff 20-Aug-20 10:50am

Which bit is giving you problems?

Eddie Winch 20-Aug-20 10:56am

I think I just need to modify the Line of Code :- display.sort_values(by=['DATE'])

and or the Line of Code :- (df3['DATE'].str.contains('-10$|15$')) to get the answer I want, could you suggest what I should add or change in those lines of Code, to get the DataFrame Output I want ?

Best Regards

Eddie

OriginalGriff 20-Aug-20 11:22am

Read what I said again.

Eddie Winch 20-Aug-20 11:32am

I am not sure what parsing each line means, I should be able to convert the date information to datetime though.

OriginalGriff 20-Aug-20 11:56am

"Parsing" means looking at each line and breaking it down into it's constituent parts:
200 RIAT Fairford 15-Jul-2004 D -- SS
Parts:
200
RIAT FairFord
15-Jul-2004
D
--
SS

One of those is the part you want - the date - so find it (and remember that the second field appears to be single or multiple words so you can't just pick out "word 3", you have to be a little cleverer than that.
you have it, you can convert it to a date.

Eddie Winch 21-Aug-20 13:12pm

Hi OriginalGriff,

I have added the following lines to my Code :-

display['DATE'] = pd.to_datetime(display['DATE'])

display = display.sort_values(by='DATE', key=lambda col: 100 * col.dt.day + col.dt.month)

display['DATE']= pd.to_datetime(display['DATE']).dt.strftime('%d-%b-%Y')

display

And after upgrading to the latest Version of Pandas, which accepts sort_values with a Key argument, when I run my full Code, I get the DataFrame Output I want.

Many thanks for all your help ))

Best Regards

Eddie Winch

OriginalGriff 22-Aug-20 1:57am

You're welcome!