java - Split up jSoup scraping result -
i scraping this link using jsoup library on java. source works , want ask how split every elements get?
here source
package javaapplication1; import java.io.ioexception; import java.sql.sqlexception; import org.jsoup.jsoup; import org.jsoup.nodes.document; public class coba { public static void main(string[] args) throws sqlexception { masukdb db=new masukdb(); try { document doc = null; (int page = 1; page < 2; page++) { doc = jsoup.connect("http://hackaday.com/page/" + page).get(); system.out.println("title : " + doc.select(".entry-title>a").text() + "\n"); system.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n"); system.out.println("body : " + string.join("", doc.select(".entry-content p").text()) + "\n"); system.out.println("date : " + doc.select(".entry-date>a").text() + "\n"); } } catch (ioexception e) { e.printstacktrace(); } } }
in result, every page of website becomes 1 line, how split guys? , how link on every article, think css selector on link side still wrong mate
doc.select(".entry-title>a").text()
this search entire document , return list of links, scraping text node. however, wanting scrape every article , pertinent data each.
document doc; (int page = 1; page < 2; page++) { doc = jsoup.connect("http://hackaday.com/page/" + page).get(); // list of articles on page elements articles = doc.select("main#main article"); // iterate article list (element article : articles) { // find article header, includes title , date element header = article.select("header.entry-header").first(); // find , scrape title/link header element headertitle = header.select("h1.entry-title > a").first(); string title = headertitle.text(); string link = headertitle.attr("href"); // find , scrape date header string date = header.select("div.entry-meta > span.entry-date > a").text(); // find , scrape every paragraph in article content // want further refine logic here // there may paragraphs don't want include string body = article.select("div.entry-content p").text(); // view results system.out.println( messageformat.format( "title={0} link={1} date={2} body={3}", title, link, date, body)); } }
see css selectors more examples on how scrape kind of data.
Comments
Post a Comment