Show HN: Fast HTML Sanitization in Python

  • Author here.

    Recently I was looking for a way to sanitize user generated HTML of malicious things like JavaScript.

    Solutions like bleach, html_sanitizer, and lxml's Cleaner all work but I found that their performance on complicated HTML snippets were lacking because they needed to rely on html5lib for parsing HTML5. And completely normal content would get mangled without using html5lib.

    I ended up writing these Python bindings to the bluemonday library. It seems to perform much better than existing Python solutions for the same problem[2]. I suspect because more of the work can be done in native code instead of having to pass an XML tree around.

    Hoping that this is useful to someone else but also looking for any feedback. Especially about how the bindings were written.

    [1] https://github.com/microcosm-cc/bluemonday

    [2] https://github.com/ColdHeat/pybluemonday#performance