java - Split up jSoup scraping result -

- July 15, 2014

i scraping this link using jsoup library on java. source works , want ask how split every elements get?

here source

package javaapplication1;  import java.io.ioexception; import java.sql.sqlexception; import org.jsoup.jsoup; import org.jsoup.nodes.document;  public class coba {      public static void main(string[] args) throws sqlexception  {     masukdb db=new masukdb();                 try {             document doc = null;             (int page = 1; page < 2; page++) {                 doc = jsoup.connect("http://hackaday.com/page/" + page).get();                 system.out.println("title : " + doc.select(".entry-title>a").text() + "\n");                 system.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n");                 system.out.println("body : " + string.join("", doc.select(".entry-content p").text()) + "\n");                 system.out.println("date : " + doc.select(".entry-date>a").text() + "\n");             }         } catch (ioexception e) {             e.printstacktrace();         }     } }

in result, every page of website becomes 1 line, how split guys? , how link on every article, think css selector on link side still wrong mate

 doc.select(".entry-title>a").text()

this search entire document , return list of links, scraping text node. however, wanting scrape every article , pertinent data each.

    document doc;     (int page = 1; page < 2; page++) {          doc = jsoup.connect("http://hackaday.com/page/" + page).get();          // list of articles on page         elements articles = doc.select("main#main article");          // iterate article list         (element article : articles) {              // find article header, includes title , date             element header = article.select("header.entry-header").first();              // find , scrape title/link header             element headertitle = header.select("h1.entry-title > a").first();             string title = headertitle.text();             string link = headertitle.attr("href");              // find , scrape date header             string date = header.select("div.entry-meta > span.entry-date > a").text();              // find , scrape every paragraph in article content             // want further refine logic here             // there may paragraphs don't want include             string body = article.select("div.entry-content p").text();              // view results             system.out.println(                     messageformat.format(                             "title={0} link={1} date={2} body={3}",                              title, link, date, body));         }     }

see css selectors more examples on how scrape kind of data.

Search This Blog

If cop

java - Split up jSoup scraping result -

Comments

Post a Comment

Popular posts from this blog

Android volley - avoid multiple requests of the same kind to the server? -

magento2 - Magento 2 admin grid add filter to collection -

Combining PHP Registration and Login into one class with multiple functions in one PHP file -