Troops is fundamentally rethinking the way we do work. Autonomous drones and self-driving cars, that were once a product of science fiction are now a reality, meanwhile the software we use for work is still stuck in the stone age.
At Troops, we believe that the future of work looks a lot more like a conversation with your personal assistant than pushing buttons and filling out forms. Similar to IronMan’s Tony Stark, who used his computerized assistant JARVIS to save the world, humans of tomorrow will use Troops to become superheroes at their jobs.
Our role as engineers is to make this future a reality. To build a platform capable of supporting any job function, we first focused on a specific vertical. We chose a trillion dollar industry that has been consistently underserved by modern technology - sales.
The first version of our product takes the form of a Slack Bot that allows salespeople to quickly look up information about their customers and accounts, get alerts and notifications when a deal is going bad and get scheduled reports of the most important deals to focus on every day.
Troops uses multiple data sets to construct the best possible understanding of our customers’ sales teams. This includes constantly crunching CRM data (currently Salesforce) and combining it with various other data sources, including data from email and calendar. The resulting insights are accessible to our users through Slack.
In this article, we’ll examine how Troops has built the data processing and extraction platform that powers our AI, tools we’ve put in place to support it and how we think about building the organization to deliver on our vision.
To make our visition a reality, we assembled a small team of world-class engineers in NYC. Technology is the core of our business, which is why we look for top technical chops. In addition to technical breadth and depth, we focus our search around those who embody our core values:
- Positive - have fun, stay positive, listen to others, show empathy, teach and learn from others.
- World Class - be the best at your craft through attention to detail and lifelong learning.
- Self Starting - take initiative. Bring your ideas to the table and run with them.
- Radically Transparent - speak your mind and be honest with yourself and others.
Screening for these characteristics has helped us build a highly-collaborative team of amazing human beings that are up for any challenge that comes their way.
Similar to many modern startups, Troops has a flat organizational structure. This allows us to quickly align on customer goals and have full creative freedom to execute on the solution. We are pragmatists. Every day we look to strike a balance between shipping quickly and maintaining a high bar for engineering standards. During our weekly sprints, we use tools like GitHub code review, unit and integration testing, and frequent deployments to production to ensure that our product always delights our users.
Why Did We Choose Scala?
Our advisor, Gokul Rajaram (Square’s Product and Engineering Lead as well as the "Godfather" of AdSense) always asks: "What are you optimizing for"? The engineering team is optimizing for easy product iterations, while building scalable infrastructure. Scala's powerful type system, functional idioms, and concise patterns allow our engineering team to quickly model the business solution. The first class immutable data structures and safe asynchronous/parallelized constructs allow us to easily build distributed systems. Best of all, many great teams (Twitter, SoundCloud, Gilt, etc.) have already trail blazed the path to successfully managing large Scala codebases in production.
Like any solution, Scala comes with its own set of tradeoffs. Its learning curve is steeper than many imperative languages and writing in a purely functional style requires an adjustment in thinking. Given that Scala is a relative newcomer to the mainstream development stage, experienced talent can be harder to find. Despite these drawbacks, in our experience, when we found developers that are attracted to functional programing, they often tend to be very strong.
To power our intelligence platform, we had to find a way to collect and process relevant information at scale. In order to operate in real time, the product relies heavily on CRM data intake and constant application of rulesets to drive the AI. To this end, at first we wrote individual batch processing applications in Scala and ran them using supervisord on Amazon EC2 instances. However, as our feature set grew we needed a more robust system that would support horizontally scalable data ingestion with rapid development cycles. It needed to support on-demand onboarding of large organizations and synchronization of their sales data over multiple sources. This is how we arrived at our solution that came in the form of the Troops distributed data processing framework, code named "Troops Servant."
Our data ingestion/processing framework has to provide:
- Real time data processing
- Monitoring of job status and schedules
- Concurrency control
- Horizontal Scalability
- Resiliency against ever changing 3rd party datasets
- Intelligent throttling of data flow accounting for API integration limits
- Data security
- Extensibility as product needs grow
Even in our closed beta, we were already processing millions of Salesforce and Email records a month with only 200 customer. To accomplish this, we set up the system with three modules
Troops Scheduler — Decides what jobs to run and when to run them
The scheduler is responsible for creating new jobs. Each job will do one thing: sending scheduled messages, synching 3rd party data, processing email, etc. If the scheduler decides a job needs to run it will 1) create a database record to track status and completion and 2) add the Job ID to a SQS queue. By centralizing the creation of jobs in the scheduler, the engineering team can easily control load and abstract logic shared across different job types. It also allows for intelligent retry and exponential backoff when issues like 3rd party service failures arise.
Troops Servant Pool — Each Servant does a particular type of work
There are many servants in the system, they are written in Scala, and managed by supervisor to provide resiliency in the case of an unexpected JVM process failure. When a servant application starts, it binds to Amazon SQS and listens for and performs its specific type of work. The types of work include syncing and processing data from Salesforce, email and calendar. In the future we plan to expand the number of data sources and specific use cases. Troops Servants write out the results of the job to a PostgreSQL database. They also provide their own health checks by sending a heartbeat and current job metadata to the Troops Manager as processing states change.
We configure multiple servants for more intensive jobs to increase throughput and provide real time ingestion and processing. Each servant application runs in its own JVM so we can tune and scale each type of work independently based on resource overhead of workload.
Troops Manager — Used to manage the system at large
The manager provides a high level view of the components within the system. It shows completed, running, and pending jobs as well as servant status for each type of job. If running jobs encounter data anomalies or integration issues that may need attention, this is where they are surfaced.
To support user onboarding and product customization, we had to build out a set of web-based tools and dashboards. Originally we started with AngularJS, but it quickly became difficult to manage state and share components. This is when we switched to using React and ReduxJS. This allowed us to rapidly develop new functionality while maintaining separation of concerns and maintainable code.
Error Detection & Support
Immediate feedback from our applications, infrastructure, and users is necessary to succeed when quickly developing and shipping a highly-available enterprise product. At Troops, we've invested in a few tools to help us understand the running system:
Alerts are essential — if your process relies on humans it will fail! We're using AWS CloudWatch, Rollbar and New Relic to alert the engineering team of high-resource usage, un-responsive systems and backend errors but we've also built an application specific alerting framework for the customer facing (non-engineering) team. These alerts allow our support team to reach out proactively and help our customers get back online.
Alerts can’t always tell you that users are having problems - and many users will give up rather than sending an email to the support team. Drift is an essential tool that allows our users to reach out with feedback and questions from within the application. The Drift-Slack integration is great - everyone on the team can see user messages and collaborate on a response or delegate support duty.
Rather than grepping logs and ad-hoc SQL queries, we've built a few Operator Tools that allow us to easily diagnose and remediate problems. The most useful is "Assume Account" — when customers contact us and describe a problem we "assume their account" and view our product from their perspective. Because "Assume Account" is a core construct in user object: we can disallow writes, mask sensitive information and record accurate audit trail.
When you need to explain an alert or understand complex behavior, having detailed logs quickly become invaluable. We make this simple by adding consistent logging to our application code. All of our programs log to console and a logstash appender. Supervisord handles log file rotation on each server and Logmatic aggregates and indexes all the informaton in the cloud for easy lookup, auditing and alerting.
We've followed HootSuite's lead and wrapped our Scala ExecutionContext to automatically propagate meta-information about our state (RequestId, AccountId, etc.) through our call stack and automatically include it in any log message.
Early on, we invested in configuring and automating our infrastructure because we knew product velocity depended on a solid engineering foundation and an easy deployment process.
We use AWS CloudFormation to describe our entire environment — CloudFront Distribution, VPC, Auto Scaling Groups, Route53 Records etc. — and can spin up a new environment of any size in twenty minutes. Rather than writing the JSON ourselves we use troposphere to describe the parameters/resources and generate the template.
In addition to the AWS resources, we also use CloudFormation to describe all the services that make up our backend. Each follows the "Troops App" convention — which outlines instances that will run the service, where to place the artifacts on the filesystem, how to pull config files, how to keep the processes running, etc. Adding a new "Troops App" service is as easy as adding a couple of lines to our troposphere scripts.
As the number of backend services grows we are planning on moving from "Troops Apps" to Docker containers.
From the beginning, releases to our QA environment have been automatic and releases to production have been "push-button". The
develop branches and any branches associated with a Jira issues are automatically built by CircleCI and deployable artifacts are created. If we are building from the develop branch, CircleCI automatically starts a AWS CodeDeploy process to our QA environment. When
master is built, we create the CodeDeploy revision but stop before actually deploying - an engineer needs to affirmatively "press the button" to continue. The best test was asking our summer interns to deploy to production — it went off without a hitch.
CodeDeploy struck a nice balance between simple scripted SCP/SSH deployments and more complicated tools like Puppet/Chef/Ansible. In a few hours, we had an automated, scalable deployment infrastructure without any extra software to manage.
We've created some small AWS Lambda functions that trigger when a CodeDeploy starts, succeeds or fails and push alerts into our #DevOps Slack Channel — everyone in the company can watch deployments go out!
We take security seriously at Troops - it starts by securing our API and making sure our network, instances and databases are isolated from the public internet.
Our API is only available over SSL but it’s only as secure as the code behind it - we’ve built some internal libraries that make it easy to check user permissions. Our goal is to ensure that unsafe code feels unnatural to write and is easy for reviewers to spot!
Our network is separated into public and private subnets. The public subnet hosts the API endpoint, a VPN server and a NAT instance. All the backend services are in the private subnet and communicate with the internet through the NAT. Any access to our environment is through the VPN - once inside they can view internal reporting tools, SSH into individual hosts or connect to the database.
Our solution is out in the wild, processing millions of Salesforce records every month. Despite still being early on our journey of making an ultimate work assistant, our early customers are already excited. As more of them embrace the future of work, we will continue to face more new and interesting challenges to tackle.
Since the area of bots and AI is so new, a lot of our work revolves around solutions never seen before, especially when having to do them at scale. Currently, we are working on not only reading and delivering data on demand, but also listening to user input, interpreting it and storing it in a structured format for future recall. This is why we are looking for fun and awesome people to join our team. If you comfortable with the unknown, think creatively about difficult problems and are ready to work on all parts of the stack, drop us a line, we want to hear from you!