ruby - Nokogiri XML Parser with Bad Attribute Values -
i can't find documentation on difference between how nokogiri (or implication libxml) handles attribute values in xml vs. html. 1 of our projects still using defunct hpricot gem, because of it's lax acceptance of attributes.
the crux of problem seems our xml input has both unquoted , missing attribute values. i'm not spec lawyer, gather of html variants allow these attribute patterns , xml not.
if nokogiri (or libxml) going strict, shouldn't there option make less strict on attributes? if html parser not strip namespaces, maybe use that.
we can't team has xmlish formats aren't fish or fowl in between. if fix @ source might that, in meantime have handle format is.
this hack fix attributes before sending nokogiri:
attr_re = /[^\s=>]+\s*(?:=(?:[^\s'">]+|\s*"[^"]*"|\s*'[^']*'))?/mo element_re = /(<\s*[:\w]+)((?:\s+#{attr_re})*)(\s*>)/mo nokogiri::xml( data.gsub(element_re) |m| open, close = $1, $3 ([open] + $2.scan(attr_re).map |atr| if atr =~ /=[ '"]/ atr elsif atr =~ /=/ "#{$`.strip}=\"#{$'.strip}\"" else "#{atr.strip}=\"#{atr.strip}\"" end end ) * ' ' + close end )
Comments
Post a Comment