File and image scraping
During scraping, except text content, we want to scrape some files, such as images, video, pdf documents. When we surf web pages, we usually download file from pages manually via 4 ways:
- click a button/link, and the browser starts to download file automatically
- right click an image, and click the 'save image as' to save the image to local. Here not matter the target element is an image's src is a URL, or is a base64 string, we can download the image successfully
right click a file link, and click the 'save link as' item to download the file behind the link
Except these scenario, there is another special scenario where the web page displays images via css:
Here you find that the page uses a css feature, background-image attribute, to show the image. For such images, you cannot save it via right-click easily.
Now NDS supports scenario 1, 2 and 4.
For scenario 1, what we need to do is impose a click action on the target element. Here the click action can be in Transit node, or in field's pre-action list. When running the recipe, the element will be clicked, and the browser will start to load file automatically. The file can only be downloaded to local folder. To avoid to be asked where to save each file before downloading, you have to close the switch on the browser before running such recipe. Please refer to Tips-close download asking
For scenario 2 and 4, you declare a field for the target lement ,and NDS will show a download (D/L) checkbox when setting 'Src(Image)' or 'Css Background Image URL' as the field's extraction attribute. checking the box tell NDS to extect the image's URL, and then download the image.
When saving the recipe, we tell NDS where to store the images. Now the image can be saved to local or to remote cloud storage. Baidu Cloud Pan is supported.
|Set field image download||Select image saving target|
|Set attribute to 'Image URL' or 'CSS Background Image URL'|
Note: To execute downloading, NDS needs 'downloads' permission, which will be requested when you run recipes with image downloading for the first time.
Click 'Allow' to enable NDS to download images.
Granting download permission is one-time operation, and it will take into effect after restarting your browser.
Close browser's downloading asking If you plan to download the images to local folder, the downloading process is controlled by the browser. In default, your browser will ask you where to save each file before downloading, which will interrupt NDS's downloading. To stop the asking, taking Chrome for example, open the Chrome's Setting, and switch to Download section:
Select the target folder for NDS to download files to, and stop 'Ask where to save each file before downloading'.
- Save downloaded files to Baidu Pan If you want to scrape pictures and save it to Baidu Pan, what you need to do is creating a Baidu Pan storage app via Popup/Setting/3rd Party Apps, and then select the app as 'Save File to' when saving the recipe.
Video - How to scrape files to local
Video - How to scrape files to Baidu Pan