BS turns HTML into tag soup

ospalh's Avatar


26 Apr, 2013 10:01 AM

For example when closing the HTML editor, BeautifulSoup turns correct HTML5 like
<figure><embed width="120" height="120" src="識-Kaisho.svg"><figcaption>Kaisho</figcaption></figure> into invalid
<figure><embed width="120" height="120" src="識-Kaisho.svg"><figcaption>Kaisho</figcaption></embed></figure>

Apparently BeautifulSoup doesn’t know that <embed> is a simple tag. (Using <embed /> with a slash at the end is allowed, but not required.)

  1. 1 Posted by ospalh on 26 Apr, 2013 10:21 AM

    ospalh's Avatar

    ... Just try validating the two examples.

  2. Support Staff 2 Posted by Damien Elmes on 28 Apr, 2013 06:24 AM

    Damien Elmes's Avatar

    Anki ships bs3. Is it an issue in bs4?

  3. 3 Posted by ospalh on 28 Apr, 2013 08:50 AM

    ospalh's Avatar

    I haven’t tried bs4.
    I was looking into using lxml instead, they claim it is “several times faster” and can handle at least some broken html.
    The problem is that i couldn’t stop it from percent-encoding src attributes.
    >>>from lxml import html

    >>>print(html.tostring(html.fromstring(u'<figure><embed width="120" height="120" src="識-Kaisho.svg"><figcaption>Kaisho</figcaption></figure>'), encoding=unicode))

    <figure><embed width="120" height="120" src="%E8%AD%98-Kaisho.svg"><figcaption>Kaisho</figcaption></embed></figure>

  4. 4 Posted by ospalh on 28 Apr, 2013 12:10 PM

    ospalh's Avatar

    I’ve tried bs4 now. Same deal.

    What does work, however is registering self-closing tags:

    >>> from BeautifulSoup import BeautifulSoup

    >>> BeautifulSoup.SELF_CLOSING_TAGS['embed'] = None

    >>> print(unicode(BeautifulSoup(u'<figure><embed width="120" height="120" src="識-Kaisho.svg"><figcaption>Kaisho</figcaption></figure>')))

    <figure><embed width="120" height="120" src="識-Kaisho.svg" /><figcaption>Kaisho</figcaption></figure>

    The problem of course is that you have to do this for each tag somebody might use.

    And afais bs4 wraps the output into <html><body> ... </body></html>.

    Looks like a real pain.

    On the other hand, the display of the figcaption inside the embed/ having the stray </embed> worked reasonably well, so i guess this can be seen as low priority.

  5. Support Staff 5 Posted by Damien Elmes on 30 Apr, 2013 03:28 PM

    Damien Elmes's Avatar

    I've added a note about it to the todo, but as you say it'll be a low priority.

  6. Damien Elmes closed this discussion on 30 Apr, 2013 03:28 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts


? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac