Senior Data Scientist at Zyte

Posted on: 01/03/2022

Location: (REMOTE)

Original Source

Tags: coverage scrapy node pytorch python ml

At Zyte (formerly Scrapinghub), our goal is to help you get the data from the web, so we develop services such as smart rotating proxies, browser rendering API, data extraction API, a cloud for running your crawling jobs, etc. At the data science team, the main project we work on is data extraction API, which can extract articles, products, job postings and other data types from any website, and also do automatic crawling and discovery. We approach this as a machine learning problem, with a deep learning model combining web page screenshot, text, node information and other features, trained on hundreds of thousands of web pages. We work on improving the quality of extraction and increasing coverage of attributes and data types. I find this problem really fascinating to work on, as on one side, you get to work on a neural network which uses both image, text and graphs as inputs and can find inspiration from current ML literature, but on the other hand, web extraction is not so well studied, and a great deal of experimentation is required. Our tech stack on the ML side is Python and PyTorch. We love Open Source: Zyte founders are authors of a popular Scrapy framework, and we open source many libraries we heavily rely on internally, such as dateparser and extruct. The company has been fully remote since the start, and hires from a large number of countries. Please check more details at <>, and feel free to check other positions at <>