Showcase atom feed can be malformed

Simon Anciaux

#26933

October 13, 2022

I recently wrote an xml parser that follows the 1.0 xml spec to replace the quick and dirty parser I use in a feed reader. While testing it I got an error with the handmade network showcase feed. If you open the feed in your browser it should show the error, which is that the entry > content element contains a non closed element.

This is caused by a <br> element in the content element which isn't valid in xml (empty element in xml need to either have a / before the > or to have a close tag).

For reference here is a link to some of the relevant part of the atom spec.

I believe the best way to solve this issue, would be to change the type attribute in the content tag to be html instead of xhtml and to escape < and & in the element body or use a cdata section to avoid escaping. The handmade network post feed already uses html for content type.

<content type="html">
<![CDATA[ anything in here, even malformed html ]]>
</content>

An alternative would be to add a cdata section after (I believe before wouldn't be valid according to the atom spec) the div in the content element.

<content type="xhtml">
<div><![CDATA[ anything in here, even malformed xhtml ]]></div>
</content>

In my opinion (which is biased toward me making my feed reader) the xhtml type isn't useful (particularly without the cdata section) because it causes the xml parser to parse the xhtml (text -> xml tree), and if I want to pass the complete xhtml content to my program I need to reconstruct it (xml tree -> text) while the intermediate xml nodes where never used.

For example the dion.systems atom feed is well formed, but since it uses xhtml I need to reconstruct the text to display it.

Asaf Gartner

#26934

October 13, 2022

I'll take a look at it, but just FYI, the code for the website is available here and you can submit a pull request if you want.

Mārtiņš Možeiko

#26935

October 13, 2022

I think this is the same issue I mentioned here: https://discord.com/channels/239737791225790464/601850747872870401/1011139856094744606

Asaf Gartner

#26956

October 20, 2022

I added a CDATA section, but now FreshRSS isn't parsing the contents as HTML. We might just need to fix our markdown renderer to output valid xml.

Simon Anciaux

#26958

October 20, 2022

EDIT: The layout of xml was making it look like it wasn't well formed, but it is well formed. So the start of the reply isn't actually pointing any issues.

I just looked at it quickly and the content of the feed seems weird. The atom spec says that if you use "type=xhtml" the content element must contain a single child <div>, and that div can contain any xhtml (which means it's structure is valid in xml).

This is what I get in the feed.

<content type="xhtml">
    <div xmlns="http://www.w3.org/1999/xhtml">
        <div>
            <![CDATA[
                <p>Hey there, hello everyone, In this video we're going to create unique pipeline state objects that can be shared between the render items. <a href="https://youtu.be/kLQGfgZwYFw">https://youtu.be/kLQGfgZwYFw</a></p>
            ]]>
        </div>
        <div>
        </div>
    </div>
</content>

But I would expect

<content type="xhtml">
    <div xmlns="http://www.w3.org/1999/xhtml">
        <![CDATA[
            <p>Hey there, hello everyone, In this video we're going to create unique pipeline state objects that can be shared between the render items. <a href="https://youtu.be/kLQGfgZwYFw">https://youtu.be/kLQGfgZwYFw</a></p>
        ]]>
    </div>
</content>

Other entries contain things (like video) in the extra divs.

About FreshRSS not parsing the content as HTML it's expected and something I didn't thought about. Sorry.

As the content is directly xhtml (and thus XML), FreshRSS probably uses the result from the xml tree directly, and so the CDATA section is considered as a CDATA section in the xhtml tree and outputs the content as regular text (I don't use FreshRSS but I'm assuming that's what is happening).

If you use type=html and the CDATA section, it should work as expected (hopefully); because CDATA section would only be part of the xml tree, and only it's content would be passed to the "application". And there is no need for the div when using html. I'm assuming the content of posts are html5 anyway, that's one of the reason it was my preferred option in the OP. If you want to skip the CDATA section you could also escape '<' and '&' with < and &.

<content type="html">
    <![CDATA[
        <p>Hey there, hello everyone, In this video we're going to create unique pipeline state objects that can be shared between the render items. <a href="https://youtu.be/kLQGfgZwYFw">https://youtu.be/kLQGfgZwYFw</a></p>
    ]]>
</content>

Edited by Simon Anciaux on October 20, 2022, 4:39pm

Replying to AsafG (#26956)