Site Reliability Engineer - Promoted.ai at Promoted.aiPosted on: 04/29/2022
[Promoted.ai](http://promoted.ai) powers marketplace search and feed to better match buyers and sellers. For example, when you search on apps like Airbnb, we would sort the results and promote the item you’re most likely to book to increase revenue. We're a [team](https://www.promoted.ai/about-us) of ads infra and ML engineers from Google, Facebook, and Pinterest, and we've built this before together. We're hiring an experienced SRE to lead our production engineering. Our current stack: AWS, Terraform, AWS managed services. We will expand to multi-cloud in the future; our current need is for AWS. Technical challenges: * Latency - Our services are in the path of generating search results, so they must be fast. We need end-to-end retrieval, full ML scoring, allocation, and ads auction in under 100ms, round trip. * Reliability - Our services are in the path of revenue and end-user experience for our customers. High reliability and mature failure management practices are critical. * Complexity - We run both streaming user metrics and machine learning systems that ingest these metrics to power marketplace search in different customer environments. * Scale - To power our recommendations, we collect a lot data. E.g. in our v1, we log impressions and join with ML features and clicks using streaming server-side joins for scale and security. We will push the boundaries of scale for our dependencies given the size of our enterprise customers. Responsibilities: * Own our production setup. Improve our plan. Execute it. Help us to hire to get it done.