Hacker News

Five methods for Filtering data with multiple conditions in Python

by min2broon 1/21/2020, 1:14:01 PM with 9 comments

by danpalmeron 1/21/2020, 4:06:06 PM
Title should probably clarify that this is with Pandas, that's much more specific and less generally useful than "in Python".
Original title: "Pandas dataframe filter with Multiple conditions"
by RobinLon 1/21/2020, 3:36:36 PM
To help readability I tend to do something like this:
f1 = (df["col1"] == condition1)
f2 = (df["col2"] == condition2)
df[f1 & f2]
This is equivalent to the 'pandas boolean indexing multiple conditions' method.
by TrackerFFon 1/21/2020, 4:32:37 PM
This is probably gonna be sacrilege to the Pythonians, but I often wish there was support for some SQL-like syntax when working with (pandas) data frames. It certainly would make the process a lot smoother for some tasks.
by SiempreVierneson 1/21/2020, 4:04:57 PM
This seems to be about doing filtering with Pandas, not pure python. The title should probably be changed to reflect this.
by brian_herman__on 1/21/2020, 6:14:48 PM
Yeah it looks like his code that this person uploaded isnt escaping the HTML or is being unescaped when it should be escaped.
df.loc[(df['Salary_in_1000']>=100) & (df['Age']< 60) & (df['FT_Team'].str.startswith('S')),['Name','FT_Team']]
by lordgrenvilleon 1/21/2020, 4:32:23 PM
Would have been nice to see a comparison of performance, or at least which is suggested style.
by closedon 1/21/2020, 6:54:25 PM
One thing that really surprises me: NONE of these methods work with grouped DataFrames.
But grouping data is extremely common in data analysis.
Basically, the strategy with grouped data, is taking the loc approach, and sprinkling in a bunch of additional .transform calls. :/
by data_derson 1/21/2020, 4:09:34 PM
I strongly prefer .query() for legibility and that it can but used in a pipe. My only problem is that often flake8 will not detect the use of a variable inside of the query string. Has anyone else come across this before?
by antmanon 1/21/2020, 4:56:01 PM
Some speed comparison on a larger dataset would be interesting