extract data from javascript using Python -
i new user python, , have inherited python notebook predecessor want improve. purpose of grab product details website.
how works:
it scrapes script website using beautiful soup:
source = urllib2.urlopen('http://www.testwebsite.html').read() soup = bs4.beautifulsoup(source) job_postings = soup.findall("script") job_postings = [jp jp in job_postings if not jp.get('type') none , ''.join(jp.get('type')) =="text/javascript" , ''.join(jp.get('type')) =="text/javascript"]
it returns script in webpage: (1st part of data)
window.wf=window.wf||{};wf.appdata=wf.appdata||{};wf.appdata.product_data_test123=wf.appdata.product_data_test123||{};wf.appdata.product_data_test123 = {"sku":"tes123","is_grid_view":false,,"default_img_display":0,"manufacturer_name":"supplier1","product_name":"product test","part_number":"1234","list_price":1000,"is_price_hidden":false,"base_price":1000,"has_opt":true,"opt_details":[{"option_ids":[],"regular_price":2681.25],"has_free_shipping":false,,"total_qty":1,"display_set_quantity":1,"is_standard_layout":true,"page_type":"productpage"};y_config.app.product_data_test123 = {"sku":"test123",........ same info here ....};
2 sd part of data:
\n wf.extend({"yui_config":{"app":{"pagealias":"productpage"}},"wf":{"appdata":{"pagealias":"productpage",,"mkcname":"au: furnitureroom","productreviews":{"b_show_review_tags":false,"kit_subgroup_price":null,"catalog_currency":"aud","price_model":null,"colors":"",,"available_after":{"date":"2016-07-28 18:05:16.000000","timezone":"australia\\/sydney"},"inventory_info":{"sku":"test123",,"latest_inventory_update":"2016-07-29 00:45:06","option_ids":[],"available_quantity":17,"display_quantity":17,","quantity_available_string":" more 10 in stock","short_lead_time_id":2,"short_lead_time_string":"leaves warehouse in 1 3 business days"}}};
then extract data need:
jsonfile = re.findall(r'wf.appdata.product_data_[a-z]{4}[0-9]{4} = (\{.*});yui_config.app.product_data_',str(job_postings))
i have this:
{"sku":"test123","is_grid_view":false,,"default_img_display":0,"manufacturer_name":"supplier1","product_name":"product test","part_number":"1234","list_price":1000,"is_price_hidden":false,"base_price":1000,"has_opt":true,"opt_details":[{"option_ids":[],"regular_price":2681.25],"has_free_shipping":false,,"total_qty":1,"display_set_quantity":1,"is_standard_layout":true,"page_type":"productpage"}
my problem now: want add "inventory_info" list data
i've tried:
jsonfile = re.findall(r'inventory_info' = (\{.*}),str(job_postings))
or
jsonfile = re.compile('inventory_info' = ({.*?});', re.dotall)
neither of work.
i'm knowledge of python limited i'm bit lost now. help.
Comments
Post a Comment