It seems to be largely focused around very recent American politics - well over 10% of entries contain the words "trump", "biden", "obama" or "brandon" (I just learned about this last one a couple of days ago, it's incredibly stupid and esoteric yet it makes up 3% of all entries).
And just eyeballing through a few entries it's very clear this is only targeted towards American toxicity - "fux news" and Alec Baldwin and Fauci and "LIBFUK DEMOCRAPS". This is not a useful dataset for any project that needs to filter beyond US-centric toxicity (to the extent that a dataset of 1,000 comments is useful at all).
Toxicity isn’t defined, and it seems the dataset is mainly populated with inflammatory political statements.
I don’t really think you can judge toxic behavior accurately just by analyzing the words in individual comments. I have doubts that this dataset will be useful for “saving the Internet.” Honestly, I still don’t even think toxic comments are the main thing ruining the Internet right now.
Who on earth has appointed themselves as the arbiter of "toxicity"? Half the comments marked as "toxic" are just political opinions or sentences with bad language in them.
I tend to align with the person who suggested that the word "toxic" was just a secular isomorph of "sinful".
“Largest”
“500 comments”
I could scrape way more of that shit just from under one facebook post.
OTOH, I like how nonpartisan the list is, showing you can have bad apples no matter which color they are.
It’s only 500 entries. One could probably scrape more from controversial reddit comments. Good effort nonetheless.
Woof, maybe don't build a classifier off this. It's very much a snapshot in time sort of thing.
I was really hoping this was about chemistry or biology...
Then again, 4 commits, 0 forks, 3 watchers... I don't really care what these people get their knickers in a twist about. Someone on the internet is offended, more news at 11.
Pretty interesting idea, as when you scan down the list, a theme that emerges is concrete declarative statements which lack originality and sophistcation, and include values/tribe-centric epithets, watchwords, and jargon.
I'd be concerned that such a corpus would automatically discriminate against people below a slightly advanced cognitive ability, and against people well above it who use irony. Then again, from the perspective of the makers of this, that may be a feature.
Huh... reading through their dataset of "toxic comments", there's an awful lot of subjectivity and borderline comments that paint an... interesting picture of what they consider to be toxicity. There are definite examples of hate speech in there, but a lot of them are fairly benign too. I have to wonder what an internet without those kinds of comments would look like...
I liked one thing about this:
It is quite inclusive, including slurs against Democrats, Republicans, Covid conspiracy believers and even Christians.
Are certain people using "toxic" as a synonym for "adds virtually nothing intelligent to the conversation"?
GitHub needs a dislike button
Without getting into the specifics of this Project, I'd like to make a general observation, I just made on another post today.
Most of us didn't receive any formal education or training in communication or conflict resolution. Yet healthy communication is a skill most of us have to learn, just like reading and writing.
I'd like to suggest that we all consider investing proactively in ourselves, and our teams, to learn and practice healthy communication.
I believe such proactive learning can be a standard part of every team and organization, to reduce both the quantity and intensity of any conflicts. As well, such training provides effective tools for the team to deal with "conflict" in a constructive manner.
https://hbr.org/2005/03/want-collaboration-accept-and-active...
This is obviously going to be subjective. I'd argue that people prone to using the currently popular buzz-word "toxic" when referring to opinions hold certain common political and social viewpoints and beliefs. In other words, this is implicitly a partisan (not objective) project that is mostly an attempt to enumerate wrongthink.