dexi.io is the ultimate platform for web scraping!
But the dexi.io platform provides a lot more than just web scraping. In fact, Dexi can do anything you can do in your browser. Your imagination is the limit!
Dexi robots don’t just allow you to read data from a web page. The “standard” and most powerful robot in dexi.io (for web scraping), Extractor robots, allow you to perform logins and searches, select elements in dropdown lists and dates in calendars, hover over elements, click buttons, submit forms, wait for elements to appear, loop over and paginate through results and extract the results you want, plain text or binary, formatted the way you want. Extractor robots also support XML.
Building Extractor robots in dexi.io is done with a simple but powerful point-and-click editor showing the web page in the top part and a “developer console” you are used to in your favourite browser in the bottom part.
Building Extractor robots in dexi.io is done with a simple but powerful point-and-click editor
Classic web scraping examples
But Extractor robots can do even more! Examples of more complex interactions include:
- Solve CAPTCHAs to bypass “I am not a robot” prompts
- Performing infinite scrolling to loop through “endless” result pages, e.g. Twitter feeds
- Perform geocoding translating text addresses to GPS coordinates - and vice versa
- Downloading images and other binary files
- Taking screenshots, e.g. for auditing purposes
Use the CAPTCHA Add-On to bypass CAPTCHAs “I am not a robot” prompts
The platform also allows you to control network requests, e.g. block unneeded resources to optimise speed or to prevent an error in the website source code allowing the robot execution to succeed.
Advanced Robot Control Flow
Pipes robots allow you to define a graph of actions to allow for very flexible robot design. More on Pipes later.
Easily set up a sequence of steps, loops, to control the flow of execution.
Getting Results: Executing Robots
Once a robot is working as intended it can be executed to get actual results. Robots can be executed with different configurations, most importantly with multiple input values, effectively executing the robot multiple times, say, with different search values or dates.
Other values that can be configured include:
- The proxies to use for localised execution. Dexi provides a pool of proxies and you can bring your own. NB! Proxies can also be an important tool against anti-robot features on a site.
- A timetable and a schedule which control when the robot is executed. Schedules can be expressed in cron syntax.
- Destinations to send results. Specified via addons of types integrations and triggers.
- Concurrency and retry settings.
The results of an execution can be viewed directly in the UI, downloaded in common file formats (csv/xls/json) or, as mentioned above, sent to and stored in a number of different places. More on this later.
A single robot can have multiple configurations which can be executed independently.
Robot not working? Debugging to the rescue!
The Extractor robot editor shows the steps, loops and branches of the robot. Just like you would debug a program using a debugger (in an IDE), the state of an Extractor robot can be inspected and debugged directly in the editor (keyboard shortcuts are supported):
- The current step (instruction) is shown
- Move execution to the next step or play up to a certain step
- Loops can be “navigated” by incrementing (or decrementing) the index of the loop
- Results (values of variables) can be viewed at any point
If an execution of a robot has failed a log will show all events which can help you debug the robot.
Dexi uses sophisticated anonymization techniques to hide its presence but if the target site has detected the robot, a typical solution is to change the proxy.
More than “just” browser automation
Via Pipes robots it is possible to define a completely custom robot execution flow performing arbitrarily complex data processing and transformation logic. For example, a Pipes robot could execute an Extractor robot, loop over its results, call an external web service for a specific field in each result, do some custom formatting of the web service result and save the “enriched” results in a data set.
Other features in Pipes robots include:
- Grouping, filtering, sorting and counting rows
- Eliminating duplicates
- Performing pivot and reverse pivot operations on data
- Formatting, splitting and combining text
- Parsing RSS, Atom and RDF feeds
Pipes robots allow you to define a graph of actions to allow for very flexible robot designs
Web Scale Data & Data Normalisation
Extracting web data typically means huge amounts of data and that data is often heterogeneous and comes in various qualities
For example, one website might provide certain information about a product (name, price, description) whereas another website provides less information (name, price). Furthermore different spellings or e.g. formattings of a product name can provide a challenge for normalising/standardising data. Examples: “Samsung Galaxy” vs “Samsung Gallaxy” and “Tab S2” vs “S2 Tab”.
Dexi provides a number of different ways to overcome these challenges.
Data sets allow millions of rows to be stored and queried efficiently. A dexi.io data set can be seen as a table in a relational/SQL database, a collection in a NoSQL database or a sheet in a spreadsheet. A data set has a data type defining the fields for each row in the data set. Rows can be created (added), viewed (read), modified (updated) and deleted (ie. CRUDed).
A dexi.io data set includes an additional feature that makes it more advanced and powerful than its traditional counterpart: it contains a dynamic key configuration which allows data deduplication and record linkage operations to be performed.
The key configuration, or just key, can consist of multiple fields. Depending on the data type of the fields included in the key different comparison methods are available, e.g. Levenshtein (edit) distance for strings. A threshold defines whether two values are automatically considered duplicates, should be manually verified or are considered distinct.
The key configuration can be changed, e.g. “narrowing down” or “widening” the key, and another deduplication or record linkage operation run to update the data set to reflect the new key.
Data Types and AutoBots
The way to normalise the fields of the results from executions of different robots is to use data types. Data types can be used as both input and output for robots as well as for data sets and dictionaries.
Data types support standard primitive types, e.g. numbers and booleans, as well as complex/object values. AutoBots using data types answer the question of how to normalise the results across a number of robots extracting data from different domains, e.g. product information from amazon.com, alibaba.com and bestbuy.com. At least one example URL per domain is provided, an Extractor robot is created for each domain, and the output data type of the AutoBot ensures a common format for results.
With autobots you can normalise the results across a number of robots extracting data from different domains
A dictionary provides similar functionality to a data set with a key configuration but can, as the name implies, be used to easily perform lookups of keys to values. It is often used to correct misspellings like “Galaxy” vs “Gallaxy”.
Lookups in a dictionary can be exact or “fuzzy”, i.e. using the same Levenshtein distance as for key configurations described above, or can even be done by tokenizing the key and lookup, effectively performing a “contains” query word by word. Finally, keys can be regular expressions such that a lookup of “Tesla Model 3” will match the key “Tesla Model [S|X|3]”.
Results from robot executions can be delivered in a number of different ways for manual consumption by a human or automatic consumption by a program:
- Downloaded manually from the UI in common file formats (csv/xls/json)
- Sent to a variety of integrations to external services, e.g. Google Drive, Google Sheets, Amazon S3 and more - or to your own custom webhook endpoints.
- Saved to a data set or dictionary via triggers
- Written to a SQL database via Pipes robot action
- Retrieved via the dexi.io API (see below).
Besides retrieving the results of an execution as mentioned above the API also supports e.g.:
- Creating and updating robots and configurations (CRUD)
- Executing robots with inputs
- Searching data sets