1. I have corrected the e-mail settings so that outgoing e-mails from these forums should be sent now. If you tried to Register or Reset your Password, please try again!
    Dismiss Notice

[XML] Issue with newlines in character['bio']

Discussion in 'Census: General Discussion' started by Lantis, Feb 9, 2012.

  1. Lantis

    Lantis Guest

    <p style="text-align: left;">I'm running into an issue formatting the character bio for HTML display.  The character ''bio' element can content newlines (it is a paragraph of text written by the player).  However the XML feed returns these as actual newlines (n) rather than replacing them with a meta-entity (which would be
    ).  Once it goes through the SimpleXML parser, those newlines are lost.

    <p style="text-align: left;">I can probably work around it for now by doing a mass replace on newlines to replace them with the appropriate meta-entity before parsing the xml,  but I think valid XML should be using meta-entities (unless I'm mistaken)?

    <p style="text-align: left;"> 

     
  2. DanKinney

    DanKinney Guest

    Got it.  We'll take a look.  Thanks for finding it.

    -dan

     
  3. Lempo

    Lempo Guest

    Dan, you are really putting a lot of hours in on this. You can sleep some. <img src="/station/images/smilies/908627bbe5e9f6a080977db8c365caff.gif" border="0" />

    Edit - Had to change smiley

     
  4. Lantis

    Lantis Guest

    Agreed.  The amount of time you spend on this is simply crazy.  Something tells me this is a project that you are seriously enjoying working on, am I right? :)

     
  5. DanKinney

    DanKinney Guest

    The load is actually spread amongst a lot of folks - I'm just the visible person.  I'm getting way more sleep that I have in the past.  Checking in on email is pretty easy in the grand scope of things.

    This is a very important project for me, yes.  Not only what we are doing, but how we are doing it.  And I do appreciate the kind words.

    -dan

     
  6. Quicktiger

    Quicktiger Guest

    There are also some other items with HTML entities even in the JSON and YAML output.  If you make the change to the bio section, be careful not to just escape it all.

    I will hunt down an example or two, but they were in the achievement lists descriptions or events.

     
  7. Dethdlr

    Dethdlr Guest

    If you change this, please don't hose the json version of it in the process.  The php function nl2br() is working just fine to convert the new line characters in the bio into br tags.  :)

     
  8. feldon30

    feldon30 Guest

    Also, going whole hog and converting everything to html entities can sometimes lead to double-encoding which can be tricky to reverse.
     
  9. Lantis

    Lantis Guest

    That shouldn't be an issue (at least with XML, can't say for JSON as I'm not familiar with the format).  Pretty sure the folks who devised the XML standard thought about that :)

     
  10. DanKinney

    DanKinney Guest

    We will not be doing a global replace.  We'll be asking the game team to encode CRs as the appropriate entity within specific fields that need to support them (like the bio).

    -dan

     
  11. Quicktiger

    Quicktiger Guest

    This is good for XML, but for yaml and JSON, entity encoding makes things harder.  Encoding should be a transport-only issue, and the data encoded into XML output should be encoded such that a typical XML parser would pull the raw newlines out again.

    Otherwise, think of how I will present the bio (or item names, etc.)  If there was an item called:

      Some "cool" item

    right now that is encoded as:  Some "cool" item

    even in the JSON output.  This means to present it I must send it raw to the user, which is something that makes me nevious.  The JSON format is perfectly capable of transmitting a raw " mark, so this encoding is unneeded and in fact dangerous.

    If I send this to the user raw, which is safe so long as everything is encoded, I'm making a security decision to use largely untrusted data (sorry Dan...) in my web presentation.  Everything else coming from the database is not HTML-safe, and gets encoded, EXCEPT some of these pre-encoded fields.

     
  12. Dethdlr

    Dethdlr Guest

    Could you give us a sample of what the proposed new method would look like in xml and json for CRs?

    Right now it looks something like this in json:

    "bio": "First LinenSecond Line"

     
  13. Lantis

    Lantis Guest

    In XML, it would probably look like this:

    "First Line<span >
    Second Line"

    No idea as for JSON, as I don't know if the JSON specification requires newlines to be encoded as HTML entities (like XML does), or if it's up to the implementation to use whichever they want.

     
  14. DanKinney

    DanKinney Guest

    I haven't yet talked with Zoltaroth on this, but the suggestion was that we encode it with the HTML entity


    -dan

     
  15. Quicktiger

    Quicktiger Guest

    yes but please, only do this in XML.

    JSON can store the newlines natively, and there are ample methods to present this in a web page correctly.  If you encode them as #10; in JSON-format strings, this complicates the proper handling to avoid XSS and other issues.  It's already difficult enough to deal with achievements with HTML entities in their names when using JSON.

    This is a Big Deal and the current scheme used is broken.

     
  16. DanKinney

    DanKinney Guest

    The problem is that the data is consistent whether we deliver it through XML or JSON.

    What would happen if the data contains the entity but delivered through JSON?  Wouldn't that work?

    -dan

     
  17. DanKinney

    DanKinney Guest

    Nevermind - it will be correct for the format you get.

    json will get n

    xml will get


    -dan

     
  18. DanKinney

    DanKinney Guest

    What other fields needs this to be fixed, other than "bio"?

    -dan

     
  19. Quicktiger

    Quicktiger Guest

    JSON uses pretty standard C or Javascript-style escaping.  That is, no HTML entities are required at that layer.

    This is why this achievement is problematic:

    census.daybreakgames.com/json/get/eq2/ac...ement/186351809

    Because the " is put in the text fields rather than a literal " mark, this means it's pre-HTML escaped.

    Many items are also encoded on Sony's end even when sending them over non-XML channels:

    Item quick-list:

     1672197449 | "The Storm Shepherds - Darnalithenis of Felwithe"

     1684041159 | "a letter from the Coalition of Tradesfolke"

     1686826574 | "a message from the Freeport Militia"

     1689524419 | "The Pirate Queen and the Temple"

     1693970424 | "The Varsoon Collection, Volume 5 - The War of Plagues"

    If we can rely on this always being the case, I can undo this encoding on my end before storing it in the database, or on pulling it from the database again.  However, I believe the HTML encoding is a mistake unless you're all storing it in your actual database on the game-side that way.

    I might be too purist about this, but I follow the rule of "don't trust what you get from sources you don't control" -- I would prefer to display the strings HTML-escaped on my side, and treat them as dangrous otherwise.  

    To see why this is a problem, btw:

     1345642928 | Bangers & Mush

      865735431 | Simple Stove & Keg



    If " -> " then & -> & but this isn't the case.  There are several items which do not have properly escape & marks, so the escaping is not being done correctly.  It seems simplest to let the XML renderer on your side just ensure that all data is escaped properly, and let the game data use the liternals.

     
  20. Quicktiger

    Quicktiger Guest

    To be clear:

      JSON needs JSON-style escaping (really jsut newlines as n, " marks as ", etc) -- anything that takes a Ruby or Python hash or array and renders JSON will do this automatically.

      YAML needs YAML-style escaping.  It's similar to the JSON style, and again any library will do this for you.

      XML needs entity encoding in some cases.  In this case, you could use the #10; format, or you could use a CDATA block I suspect.  But whatever is chosen, if the #10; markup leaks over to the JSON or YAML format too, you're doing it wrong.

     

Share This Page