Deep scraping
The previous section shows how to define recipe to scrape list pages, where we locate blocks on the page and fields in the block, and declare how to do pagination for various scenario. Sometime, the list pages cover items' information in brief, and more details are coverd on the page after clicking the item. This section shows you how to scrape not only items on the list, but also the detail page behind each item.
Here we take Amazon for example.
List page: https://www.amazon.com/Smart-Watch-Accessories/b/ref=dp_bc_aui_C_3?ie=UTF8&node=7939902011
and the first item as detail example.
Here we repeat the step 1 to step 5 on Multiple paginated list scraping
and then
Step 6: click 'Next' wizard button on the bottom of the editor, and swith to 'Navigation' tab.
There are 3 options on the navigation tab, (2 options when the current node is not the last node on the flow).
There is a simple guide to select the correct navigation option:
- Do we need any deep scraping, or any additional actions to execute?
- If no, then select 'No further node needed'
- Do we need to click any element on the CURRENT page to continue deep scraping?
- If yes, select 'Click one Field to continue' mode, and select the field on the current node to click
- Do we want to open the page after clicking on a new tab?
- If yes, check the 'Open new tab'. NDS will TRY its best to open the page on a new tab. BUT it is not ensured and is depending on the link attributes.
- Do we need to do deep scraping, or additional actions to execute on the same page but with another independent node?
- If yes, select 'Enter next Node silently'
Here we describe some scenario to show the usage of navigation:
In a list page, except repeating blocks and fields, we also want to extract some one-time elements. Then we can create two data extraction nodes:
- First - Detail node:
- to extract all these one-time elements,
- Detail node's navigation we select 'Enter next Node silently'
Second - List node:
- to extract all repeating blocks and fields.
Here, when detail node is executed, NDS will extract data from all one-time elements on the current page, and then start execute List node, where all repeating blocks and field on the same CURRENT page will be extracted.
- to extract all repeating blocks and fields.
- First - Detail node:
In a list page, we would like click each item's title to enter corresponding detail page and extract more details there. Then we create two data extraction nodes:
- First - List node:
- to extract each block's fields
- and click the CURRENT block's title field to jump to the item's corresponding detail page.
Second - Detail node:
- to extract the fields on the detail page.
Here, NDS handles detail pages' opening, closing/ navigating backward to the list automatically.
- to extract the fields on the detail page.
- First - List node:
In a list page, we would like click each items' title, enter next detail page and extract more details there. But the detail page loading is slow, so we would like ensure the body of the detail page is ready before executing extraction. So we create three nodes :
- First - List node:
- to extract each block's fields
- and click the CURRENT block's title field to jump to the detail page
- Second - Transit node:
- the node is working on the detail page. so we add an action to wait the details page 's body ready
Third - Detail node:
- to extract details.
After clicking one block's title, NDS will execute the actions on the Transit node. Here the action will check if the target element is ready. If yes, NDS will continue to excute detail node.
- to extract details.
- First - List node:
Now based on the context, we have selected 'enter next node silently', or 'click one field to continue' to do deep scraping. We click 'Next' wizard button on the bottom of the editor, the editor will prompt us what kind of node to create for the next.
If the navigation mode is 'enter next node silently', then a new node is created, and the editor switchs to the new node immediately.
If the navigation mode is 'click one field to continue', then a new node is created. At the same time, one field (the first one if more than one block) is clicked, the editor will shown on the new page once loading done.
Step 7: now the new current node is ready, and we can add more actions or declare blocks or fields to scrape data from the current page.
Up to now, we learned how to link multiple nodes to do deep scraping. When NDS runs the recipe, it will open the start url automatically, and then excutes the actions on each node. When all done, NDS will end the scraping, and you can export data to local CSV from Data Center.