Skip to main content

Retrieve value(s) from a website

Retrieve value(s) from a website. Please refer to Configuration Rules for more details on using data extraction rules.

Application

  • Web Process Automation

Inputs (what you have)

NameDescriptionData TypeRequired?Example
Browser session IDThe unique identifier of the browser instance. Can be retrieved from "Open a browser session" Wrk ActionText (Short)Yesfb9bc380-146c-420e-9130-50ce93614e05
Extraction typeThe type of data extraction to useCSS Selector(s)AI ExtractionPredefined Choice ListYesCSS Selector (s)
If Extraction type = CSS Selector
Data extraction rulesInstructions for what data should be extracted from the site.Text(Long)Yes{ "title" : "h1"}
If Extraction type = AI Extraction
QueryA description of the information you want to extract from the webpage using natural languageText(Long)NoGrab the first three items in the list.
AI Data extraction rulesJSON Schema instructions for what data should be extracted from the site. Use this tool to build your JSON Schema: https://json.ophir.dev/Text(Long)Yes
{
"type":"object",
"properties":{
"tagline":{
"type":"string",
"description":"tagline of the company"
}
},
"required":[
"tagline"
]
}

Note: the value of inputs can either be a set value in the configuration of the Wrk Action within the Wrkflow, or a variable from the Data library section. These variables in the Data library section are the outputs of previous Wrk Actions in the Wrkflow.

Outputs (what you get)

NameDescriptionData TypeRequired?Example
Extracted contentJSON result of the data extracted from the websiteText(Long)Yes{"title" : "The Wrk Blog"}
Unsuccessful messageIf unsuccessful, a message stating what went wrongText (Long)Yes
Unsuccessful screenshotIf unsuccessful, a screenshot of the websiteFileYes
Fields to CaptureYes

Note: The "Fields to Capture" input feature can be used to create new outputs for the Wrk Action with the value of the corresponding key name in the first-level JSON output of the Wrk Action.

Example: If the "Response Body" output is equal to {"title": "The Wrk Blog"} Then you are able to add a "Fields to capture" input with the name "title" that will add an output to the Wrk Action called "title" which will have the value "The Wrk Blog"

Outcomes

NameDescription
SuccessThis status is selected in the event that data was successfully retrieved from the website
No resultThis status is selected in the event of the following scenario:There is no Extracted content
Impossible to completeThis status is selected in the event of the following scenario:Unable to connect to the session for some reason

Configuration Rules

Regarding the configuration of the "Retrieve data from a website" Wrk Action, we recommend a helpful tool to simplify the process of setting up data extraction rules. The tool is called "jQuery Unique Selector," and you can access it by installing the Chrome extension from this link: jQuery Unique Selector Chrome Extension.

Here's how to use it:

1. After installing the extension, click on the magnifying glass icon() in your browser. This will activate the selector mode.

2. Select the specific item you want to extract from the website by clicking on it.

3. Copy the "Selected Element Selector" that the extension provides, and use it in your data extraction rules for the "Retrieve data from a website" action. Repeat this process for all the content you need to retrieve from the website.

4. To disable the extension, simply click the magnifying glass icon again.

This tool should help you with most data extraction tasks for the majority of sites. However, please keep in mind that some situations may require additional configuration. If you come across any such scenarios, don't hesitate to reach out for further assistance.

Example of extraction rules using CSS Selectors

DescriptionCode Example
Simple rule to extract an element's h1 text content using a CSS selector.{"main_heading": "h1"}
Simple rule to extract an element's subtitle text content using an ID.{"sub_heading": "#subtitle"}
Rule to extract an HTML attribute, such as the href from a link.{"link": "a@href"}
Complex rule with a selector, specifying the desired output format.
{
"main_heading": {
"selector": "h1",
"output": "text"
}
}
Rule to extract the HTML content of an element.
{
"title_html": {
"selector": "h1",
"output": "html"
}
}
Rule to extract an attribute by using the "@" prefix in the output field.
{
"title_id": {
"selector": "h1",
"output": "@id"
}
}
Rule to extract data from a table and format it as a JSON object.
{
"table_json": {
"selector": "table",
"output": "table_json"
}
}
Rule to extract data from a table and format it as an array.
{
"table_array": {
"selector": "table",
"output": "table_array"
}
}
Simple syntax for extracting text content and attributes without specifying output or type.
{
"main_heading": "h1",
"link": "a@href"
}
Rule to extract and format information from a table using JSON representation. This works with both standard <table> markup and ARIA-style grid layouts built with <div> elements (for example, containers using role="grid", role="columnheader", and role="gridcell"). When a div-based grid is detected, the scraper reads column headers and row cells from those ARIA roles and returns them in the same array/JSON format as a normal HTML table.
{
"table_json": {
"selector": "#table_id",
"output": "table_json"
}
}
Rule to extract and format information from a table using array representation. This works with both standard <table> markup and ARIA-style grid layouts built with <div> elements (for example, containers using role="grid", role="columnheader", and role="gridcell"). When a div-based grid is detected, the scraper reads column headers and row cells from those ARIA roles and returns them in the same array/JSON format as a normal HTML table.
{
"table_array": {
"selector": "#table_id",
"output": "table_array"
}
}
Rule to extract the first matching element for a selector.
{
"first_post_title": {
"selector": ".post-title",
"type": "item"
}
}
Rule to extract all matching elements for a selector.
{
"all_post_titles": {
"selector": ".post-title",
"type": "list"
}
}
Rule for cleaning the extracted text content by default.
{
"first_post_description": {
"selector": ".card > div",
"clean": true
}
}
Rule to extract content without cleaning, preserving whitespace and special characters.
{
"first_post_description": {
"selector": ".card > div",
"clean": false
}
}
Nested extraction rule to gather detailed information from multiple items.
{
"articles": {
"selector": ".card",
"type": "list",
"output": {
"title": ".post-title",
"link": {
"selector": ".post-title",
"output": "@href"
},
"description": ".post-description"
}
}
}
Nested extraction rules to gather detailed information both from items inside the selected item and from the selected item itself. Note: This is especially useful when you want to extract information not only from elements nested in the the main element but also from the main element.
{
"Cards": {
"selector": "div>a",
"type": "list",
"output": {
"name": "h1.Card_name",
"link": {
"selector": "@",
"output": "@href"
}
}
}
}
Rule to extract all links from a page, returning an array of href attributes.
{
"all_links": {
"selector": "a",
"type": "list",
"output": "@href"
}
}
Rule to extract text and href for each link, providing a more detailed structure.
{
"all_links": {
"selector": "a",
"type": "list",
"output": {
"anchor": {
"selector": "@",
"output": "text"
},
"href": {
"selector": "@",
"output": "@href"
}
}
}
}
Rule to extract all textual content from a page body.{"text": "body"}
Rule to extract all email addresses from a page using mailto links.
{
"email_addresses": {
"selector": "a[href^='mailto']",
"output": "@href",
"type": "list"
}
}