"Flatten" HTML Content (i.e strip tags) in Cocoa/Objective-C
I needed to strip the tags from some HTML that was embedded in an XML feed so I could display a short summary from the full content in a
UITableView. Rather than go through the effort of parsing HTML on the iPhone (as I already parsed the XML file) I built this simple method from some half-finished snippets I found. It has worked in all of the cases I have needed, but your mileage may vary. It is at least a working method (which cannot be said about most of the other examples). It works on the iPhone and in standard OS X coding.
- (NSString *)flattenHTML:(NSString *)html




I have tried for flattening
I have tried for flattening rss feeds. I was great, it is working.
Support for changing HTML Entities
Hi All,
I have modified the above to include support for HTML entities:
//Remove HTML special entities ( <>&¢£¥€)
NSArray *entities = [[NSArray alloc] initWithObjects:@" ",@"<",@">",@"&",@"¢"@"£",@"¥",@"€",nil];
NSArray *plainText = [[NSArray alloc] initWithObjects:@" ",@"<",@">",@"&",@"¢"@"£",@"¥",@"€",nil];
i = 0;
for (NSString *entity in entities) {
html = [html stringByReplacingOccurrencesOfString:[entities objectAtIndex:i] withString:[plainText objectAtIndex:i]];
i++;
}
Thanks Skeep
This code revealed a memory
This code revealed a memory leak when I used it on a large HTML file, which made the memory climb as high as 1.5GB. This change seemed to patch the leak.
- (NSString *) stripTags:(NSString *)str
{
NSMutableString *html = [NSMutableString stringWithCapacity:[str length]];
NSScanner *scanner = [NSScanner scannerWithString:str];
NSString *tempText = nil;
while (![scanner isAtEnd])
{
[scanner scanUpToString:@"<" intoString:&tempText];
if (tempText != nil)
[html appendString:tempText];
[scanner scanUpToString:@">" intoString:NULL];
if (![scanner isAtEnd])
[scanner setScanLocation:[scanner scanLocation] + 1];
tempText = nil;
}
return html;
}
This code works great, but it
This code works great, but it removes leading white spaces after the HTML tags are stripped, so "Some Text Some More Text", becomes "Some TextSome More Text" instead of "Some Text Some More Text".
This is because the scanner by default skips newlines and whitespaces.
To get around this, add this line after the instantiation of the scanner, so that only new lines are skipped:
[scanner setCharactersToBeSkipped:[NSCharacterSet newlineCharacterSet]];
Thanks for this great code. I
Thanks for this great code. I found it to be very useful as well as informative. Thanks for the share.
Thank you!
Thank you very much for this! It was very helpful!
Thanks a lot ...
This code worked fine... great help thanks...learned a new thing too.
Thanks for the code. You
Thanks for the code. You should capitalize it correctly, though. And you really should rethink unnecessary comments. Which is all of them in that method.
style prefs are style
style prefs are style prefs
and the comments were in lieu of a longer blog post and for newer Objective-C programmers (as I would expect more experienced ones not to need such a basic example)
Great stuff, thanks for the
Great stuff, thanks for the information.
Thank you very much!! its
Thank you very much!! its worked
You could always use regular
You could always use regular expressions to strip tags. I guess that would be way faster than scanning stuff in your own loop.
good post
good post
Thanks for reviewing such
Thanks for reviewing such nice block of program code It will really helpful.
wart removal
Simple put.. You rock!!!
Simple put.. You rock!!!
Works great. Thanks!
Works great. Thanks!
hi, thanks sir
hi, thanks sir
Thank you! Just what I
Thank you! Just what I searched for!
Hi, I'm having trouble
Hi, I'm having trouble reading your site in the Avant browser (the font size it way too small). I've tried raising the font size from my browser but that didn't do the trick. Do you have any tips on what I can do? (By the way, I'm using Windows Vista) - ways to lose weight after baby
Hey man, thanks for the
Hey man, thanks for the tip!
Some of my returned strings were starting off with spaces, so I tacked on a bit of code to the end:
// Trimmed return
return [html stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]];
THANK you very much!!! You
THANK you very much!!! You save me a lot of time!!! you make my day!!!!!!!
As others have wondered:
As others have wondered: where would this go? Would I put this in the parser?
It just takes a string as
It just takes a string as input, so it can go anywhere you want it. I'd grab the contents of a URL (slurp it into an NSString) and then pass it through here before doing anything else with it. Remember, it's for stripping tags from HTML pages, not sifting through an XML file.
this snippet is pretty good:)
this snippet is pretty good:) thanx for this
http://netdeveload.wordpress.com for .net development
Thank you very much, I have
Thank you very much,
I have used a code almost equal as this one for my projects, and it worked perfectly, but very slow. I need a more optimized solution.. I think I'll will try to convert to a C string, and do the parsing in plain C..
cheers
Wow, like everyone else has
Wow, like everyone else has said, Thank you! I've been looking for something like this and nothing ever seemed to really work right but this is great. Again, much appreciated!
Where did you put this
Where did you put this snippet? Any help would be appreciated.
*from a frustrated would-be app coder*
works perfect for me! Great
works perfect for me! Great thanks!
Where do you place this code?
Where do you place this code? In the view controller or rss parser? I would like to use this in my rss reader app and cannot get the html tags out of the blog detail view.
Thanks, that worked well for
Thanks, that worked well for flattening rss feeds.
Thank you so much!! THat has
Thank you so much!! THat has really solved a big problem for me.
Thanks a lot for this
Thanks a lot for this snippet! Had the same issue of already parsing an XML and not wanting to do it again :)
After used this code, i'm
After used this code, i'm getting my contents. But it is completely messed up with white spaces as it is big 4 paragraph srtring. I don't how to remove only the uneccessary white spaces. Author should have written that code sample too..waiting ..
Change: withString:@"
Change:
withString:@" "
to
withString:@"".
Should work fine.
Awesome, thanks a
Awesome, thanks a million!!!!! I tried a number of things... adapting to Objective C is great with a community that helps push adoption of techniques. Thanks again.
thanks. worked well for
thanks. worked well for flattening rss feeds.
Much thanks for this code
Much thanks for this code snippet. I have been messing around with the TouchXML libraries only to realize that they did not output XML (only read it) and was beginning to get frustrated. Its very bad when you know *how* to do something but you don't know the underlying libraries to accomplish it.
Great little snippet. Thanks again,
jb
Post new comment