-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Collect Data
"Collect Data" operation is the core operation when designing web crawler tasks, which means to collect/extract data from any element (such as link, title, text box, etc.) the web page and then save them into data store, such as .csv or database.
It is very easy to use EasySpider to collect data via point-and-click.
Steps to define the collect/extract data operation are:
-
Select the element we want to collect/extract by right-click or F7.
-
Select the Extract element's text option, or other options like Collect Inner/Outer Html of this element, Collect address of this link, Extract text/selection box's text, etc., based on your requirements.
- The example of parameter will be shown at the Operation Toolbox, we can delete unused text if needed by clicking the × mark in the "Delete" field.
- Click the "Confirm Collect" option to confirm.
Then the "Collect Data" operation will be added to the Workflow Manager.
This section shows the available properties of "Collect Data" operation in the Workflow Manager.
-
Option Name: the option name, click the "Confirm" button to refresh the name after modification.
-
Parameter Box: displays all parameters of current operation, we can modify their parameter name, and delete a parameter, adjust parameter order, modify other information with corresponding options such as "Delete", "Up", "Down" and "Modify".
-
Current parameter name: shows the name of parameter being modified.
-
XPath: XPath of the element for current parameter, generated by EasySpider and can be modified freely by user. If the "Use relative XPATH" option is not selected, then this XPATH is a whole XPATH, which means EasySpider will use this single XPATH to locate the element; otherwise it should be combined with XPATH defined in the closest loop operation.
-
Use relative XPATH: whether combine XPath defined in the closest loop. This option will only appear when "Collect Data" operation is inside the "Loop" operation. E.g., if we want to loop collect product titles from a list, and the XPaths of the 1st, 2nd, 3rd, ..., 100-th product title inside this list is:
/html/body/div/a[1]
/html/body/div/a[2]
/html/body/div/a[3]
...
/html/body/div/a[100]
And we can also use a single XPath to express all these titles:
/html/body/div/a
We can see that this XPath can be divided into two parts: a single parent (/html/body/div) and its many children (/a). Under this condition, we can set "/html/body/div" XPATH in the "Loop" operation, and set "/a" XPATH in the "Collect Data" operation with "Use relative XPATH" selected to collect all data from "/html/body/div/a".
- Extraction Type: Type of data to be extracted. In total 4 options: Text (include child element), Text (exclude child element), outerHTML and innerHTML.
For example, if we have an element whose HTML source code is:
<div>123<p>456</p>789</div>
Then data extraction results for the four types will be:
-
Text (include child element): 123456789
-
Text (exclude child element): 123789
I.e., compared to 1, EasySpider will only collect text from directly layer of this element, not collect text from its child(s).
-
outerHTML:
<div>123<p>456</p>789</div>
-
innerHTML:
<p>456</p>
-
Node Type: Which kind of data do we want to collect, in total five options: Ordinary Node, Link Text, Link Address, Form Value and Image address. E.g., if we choose "Image address", then EasySpider will collect the image link address from the tag.
-
Parameter Description: description of current parameter.
-
Absent value when cannot find the element: when can not find the element by the defined XPATH, the default value of the current parameter at this time of extraction.
-
Seconds after executed: how long should EasySpider wait after automatically collected expected data.
When executing tasks, the text/html of the element will be automatically extracted and saved to the data store when we set the "Collect Data" operation.
Every parameter in the "Collect Data" operation will be mapped as an output parameter of the task:
Click Pages
above to see more pages