Transforming the Management of Application Configurations & Secrets at 24 Hour Fitness

1,711
HashiCorp
Powering the software-managed datacenter. Maker of Vagrant, Packer, Terraform, Consul, Serf, and Vault

At 24 Hour Fitness, for many years, operations and development teams have gone through the pain of trying to manage and deploy application configurations with data stored in many files and locations across the ecosystem. The DevOps team was tasked with architecting and implementing a simple, reliable, highly available, testable solution to meet the growing needs of their applications. Through the combined use of Consul and Vault, they have successfully transformed the business.

In this talk, 24 Hour Fitness Senior DevOps Engineer Jason Yoe will describe the challenges faced, the overall design, and the implementation of the solution.

Transcript

It was a Wednesday afternoon before Thanksgiving, and I was finishing a few things up before I was heading out, when the team received an email talking about an intermittent problem that was happening with one of our sales applications.

A user, when selecting an option in our web application, was receiving an error. Not everybody would receive an error when they'd select the option, and as a matter of fact, when this user clicked the option again, they were successful. From the email, it was uncertain when this problem had begun and how many sales had been affected.

Flash forward an hour later, the issue had been escalated. The DevOps team is on a WebEx call with many agitated people. Agitated managers are in our cubicle aisle; they're looking for answers.

It's understandable. It's the day before Thanksgiving, it's a production issue, and it's our sales application. So the team's frantically looking through logs, we're checking deployments that had happened the previous days, any sign for what could happen. Finally, we stumble upon an email about a configuration change that had happened the previous week.

This change was made directly to an application instance, but unfortunately those instances were not bounced at the time of the change. When a code deploy occurred a few days later, the applications were restarted and the change was applied.

Unfortunately, the configuration change was missed on one of the application instances, hence the intermittent issue. Of course, this change had been applied in other environments and had been tested, but unfortunately the person applying this change in production had made a few small mistakes. So, we had to transform our configuration management.

Transformation is continuous improvement

How do we accomplish this transformation? Well, transformation is really all about continuous improvement. We all envision this glorious end state, where all of our application instances are containerized and we have systems that detect and fix errors while we go get more coffee.

But the reality is nowhere near that. We never really reach that nirvana where there are no challenges, there are no issues. There are always going to be challenges. The goal is to face better challenges.

When I was 15 years old, I had to walk everywhere. I had to walk to school, I had to walk to work, I had to walk to my friends' houses. When I improved my situation and saved up and bought a car, I no longer had to walk anywhere. I still had to pay for gas, I had to pay for car insurance, and I had to get my oil changed.

But these challenges were definitely better than having to walk everywhere.

Today I'm going to discuss the steps we use for continuous improvement. I will also dive into the 24 Hour Fitness case study, where we use Consul and Vault as an integral part of the solution, and then I'll go over some of the new, better challenges in our continuous improvement journey.

My name is Jason Yoe. I'm a senior DevOps engineer at 24 Hour Fitness. 24 Hour Fitness is the second-largest fitness chain in the world. In my previous roles, I worked at Cisco as a cloud engineer, and then as a technical architect at AT&T, so I have about 20 years' experience in the industry.

The continuous improvement process

The 4 steps in the continuous improvement process are:

  • Identify the challenges
  • Find the value
  • Define the path
  • Walk the path

The first step is to identify the challenges, and this step is really about communication. We had many sessions with development teams, operation teams. It's really about gathering all the positives, the negatives, issues, roles, responsibilities, all the processes that currently were in place for configuration and secrets control.

And as you will see in our case study, we found some common challenges that came from these sessions.

The second step is to find the value. The idea is to make the simple changes that provide the greatest benefit. We don't have to solve for everything. Because in continuous improvement, we will eventually face those other challenges, and we'll overcome them. It's really about identifying those key changes that bring the most value right now.

Also remember that value changes are definitely momentum-builders, especially with executive leadership, and they help us along that journey of continuous improvement.

Once we find the value challenges, we end up defining the path, and this is really about building the requirements, it's about architecting the solution, it's about defining those processes, those business processes that other teams need to follow.

It's also defining the tasks that need to be executed. This step also includes the research that goes into it, the technology, the processes. The key point is to look at the successes that are inside your organization and outside your organization.

The next step is walk the path. This is obviously merely the execution of the tasks, it's implementing the technology, and it's socializing the business processes, so it's making sure everyone in the organization's on the same page, that we're all moving in the same direction.

And remember, the improvement journey doesn't end with this step; it continues back again.

The case study

Let us dive into the case study at 24 Hour Fitness. We held sessions across the organization with development teams and our operations teams, and we wound up with a pretty extensive list.

There were a lot of issues that people brought up, but we identified 4 challenges that solving would provide the most benefit for our company:

  • Eliminate configuration sprawl
  • Eliminate secrets sprawl
  • Define a consistent lifecycle process for configuration management
  • Define a consistent lifecycle process for secrets management

The first challenge was to eliminate configuration sprawl. In our environment, we had configurations in many different files, in many different directories, and it wasn't consistent across applications. Different locations and different applications.

We also had different configuration management tools. Some configurations were managed by Chef, some were managed by SVN, some were stored locally on the instance and were managed there.

Our second challenge was to eliminate secrets sprawl. We had secrets that were stored in multiple files and multiple locations, and it wasn't consistent across applications.

One of the big problems was changing passwords. To stay compliant, we have to change our passwords every so often, particularly our database passwords, and our database passwords were stored on local application instances, albeit encrypted. But this process really was tough at 3:00 in the morning on a Sunday when an application wouldn't start because a password change had been missed.

We also had to define a consistent lifecycle process for both configurations and secrets. In talking to different teams, we had multiple processes for how to create, update, store, delete these configurations and secrets.

A lot of it had to do with the fact that each team had multiple tools to manage this, but there was also no clear role definition or responsibility definition for the teams.

What was needed in the solution

Once we identify the value challenges, we were able to define the path. So we built our requirements, we did our research, we architected a solution with specific tasks to be executed. I've highlighted some of the high-level directives for the solution:

  • Implement a single source for configurations using the Consul KV store
  • Implement a single source for secrets using Vault
  • All applications reference the single source for configurations and secrets
  • Applications can receive runtime configuration changes
  • Implement consistent configuration and secrets data structure and naming
  • Consul agent running in client mode on all application instances
  • Secure access to Consul by implementing ACLs
    • Only admin team has access to UI
    • Apps have own token and policy to access specific key prefixes
  • Secure access to Vault
    • Only admin team has access to UI
    • Apps have own token and policy to access specific secrets

The first one is to implement a single source for configurations using the Consul KV store. The next one is to implement a single source for secrets using Vault. What these both resolve is that issue of configuration sprawl, with configurations in multiple files, multiple locations.

When somebody needed to find out where a configuration was, there was no documentation, and we had to hunt through directories to find it. Now we're implementing a single source of truth, where we can go and make changes and manage these configurations.

To go along with that, all of our applications need to be able to access that single source of truth, so if you make a configuration change, all of the instances get that change, instead of having to go from instance to instance and validate that they've got that change.

Another requirement was these applications need to receive runtime configuration changes. In today's world, we want to be a 24-hour shop. Our applications always need to be up for our customers, especially at 24 Hour Fitness. We have people working out at the gym at 2:00 in the morning.

And so we want our applications available to everybody, and the idea is, if we're making configuration changes, we need these applications to take them hot; we don't have time to restart our applications for that change to take effect.

We also need to implement a consistent configuration and secrets data structure and naming. Different teams had different naming standards for their keys and their values and their configurations.

We also wanted to build a structure of, when we talk about a particular value, where is that value found? So that everybody understands and speaks the same language.

To implement all this, we need a Consul agent running on all our application instances, so that they can access their data. We also want to secure access to Consul by implementing ACLs (Access Control Lists).

With the Consul UI, anybody has access to change, modify, add, delete any kind of keys. This creates a problem. We don't have an audit trail, we don't know when it happened.

The idea is to lock the UI down and provide another management tool for developers to create configurations, delete configurations, add configurations.

Applications will access their data using a token and policy procedure.

We also want to secure access to Vault. We want to secure that access to the UI. We don't want everybody to come in and add secrets. We want a single admin team to be able to manage that, and then applications are going to use tokens and policies to access their data.

Being a 24-hour shop, we need our applications up all the time, and a key to that is to always have a functional path between our application instances and the Consul and Vault servers.

If an application needs to get a configuration change at 2:00 in the morning, we need to have that path available. During maintenance windows, we still need a path available from the instances to our Vault or Consul servers.

We also need to implement a consistent process for managing these configurations and secrets across all the teams, so everybody understands and knows this is how you accomplish this task.

Along those lines, we wanted to create a single point of management, one that had a historical view of configuration changes, so we know who changed what and when they changed it.

We needed the ability to roll back changes, and perform some sort of code review of the configurations before implementation.

And lastly, for our production environment, we wanted to insert some approval process, so that we get approval before we roll changes out to production.

The first iteration

Our initial implementation started in about November of 2018. That month we spent a week or 2 to implement the Consul and Vault architecture, then we proceeded on building the processes for configuration and secrets management.

Then we spent time socializing those processes across the organization. That took a little while.

Since then it's been onboarding the applications into our process and into our architecture. And of course the team continues to go through the continuous improvement process of identifying the challenges, finding the value, defining the path, and walking the path.

This slide show’s an overview of our Consul architecture. Pretty simple.

We used the open-source version of Consul, version 1.4.2. We have 2 datacenters and a single Consul cluster that spans both those datacenters, so we have Consul servers in both Datacenter 1 and 2.

We have a global entry point that's load-balanced between the 2 datacenters, so any communication coming in can either go to Datacenter 1 or Datacenter 2, and hit Consul servers in either datacenter. So if we're doing maintenance work in Datacenter 1, or there are network issues, we still have that single path from application instances to Consul.

The same architecture applies with Vault. We use the open-source version, 0.11.2. Again, 2 datacenters. We have a single Consul cluster backend; it's for the database for Vault.

We have Vault servers in Datacenter 1 and Vault servers in Datacenter 2, in the active standby, and a single point of entry globally that's load-balanced between both datacenters so that we always have that single point of access to Vault, so applications can get their secrets.

Defining a naming convention

Once we built out the architecture, we had to define a consistent configuration-structure naming convention. This is the language that we communicate to all the teams: "When you're going to create key-values, your application's going to use these key-values. What paths do I access these on?"

We came up with a standard that was consistent across all the teams. In Consul, the KV store endpoints are referenced by a keypath, and for configurations across applications, we decided to use a keypath that starts with /default. These keys are referenced by any application, so any keys along this path, any application has access to them.

For configurations for specific applications, the keypath would start with /env, an environment name, and the application name. This would specify a path for an application for them to get their key-values.

For an example, for app1, its keypath would start with /env/dev/app1, and that would be different in a different environment. For example, for QA, /env/qa/app1. So not only do we segment this by application, but we do by environment as well.

For host-specific configurations and those instances where a specific host needs something a little bit different from the application itself, we start with /host, and then the hostname and the key.

This is just a consistent naming convention that we wanted to put out there across all of our teams in the organization so we were speaking the same language. For our key names, it's the words separated by a dash, so the key name would be application-url.

We went ahead and generated a similar situation for Vault. In Vault, our secrets engines simply store application secrets; we don't use it for anything else today. Each application is tied to a Vault secret.

The way we structured it—we have 4 secrets engines, 1 for each environment—we have /apps/dev, /apps/qa/staging, /prod. And within those secrets engines then we have the secrets, which are tied to each application. For example, we have a secret for app1, a secret for app2, and their values will be in the key-values within each secret.

The naming convention for key names is words separated by a dot. This is different from the configuration value names. For example, it's spring.datasource.properties.user. What this allows us to do is, when we have applications that are trying to get values from Vault or Consul, we can specify the actual keypaths that will locate the values that they need.

How applications get their configurations

This slide shows the configuration load at application startup. Our application instance starts up, and it's going to get 3 values. We have Java applications, so I'm showing you the Java ops.

We'll talk a little bit more about these 3 values that the application needs. Basically it's, How am I going to get my values from Consul, and what's going to happen if I need to update my configurations?

Once the application gets those values, it starts up, and it's going to contact the local Consul agent that's sitting on the application instance, and it's going to be able to pull its values. We're going to use these keypaths or key prefixes that then locate the specific configuration information for the application instance.

We want our application instances to get changes hot, so the bottom half of this slide shows the process for the configuration reload, when the application's running.

The way we implemented it was to use watches and handlers. A watch detects if there's a change along a keypath. These keypaths in our architecture become really important, because it identifies what we're looking for, it identifies where our values are, so that the applications know, so that Consul knows, and so that our teams know.

This watch, when it detects a change—a create, an update, a delete along a specific keypath—it initiates a handler. This handler is simply a script. It's going to pass a signal to our application, our application's going to listen on a specific port, the handler is going to send this signal to that port, and it's going to tell the application, "Reload your configuration data."

This slide shows the 3 properties that I talked about that the application needs to pull at startup:

  • Server list
  • Key prefix
  • Listener port

The first one is the server list. This is just the connection that it looks for to find the data. Because we talk to the local Consul agents that are hosted on the application instance, we specify the localhost on port 8500. This could be the VIP for the Consul cluster.

We also specify a key prefix. This goes back to that idea of this standard path for the applications, of how they're going to pull their values. For each application instance, we can list a set of paths, of where they're going to look to get their data.

In this case the key prefix is /env/dev/app1. This would be the instance of application 1 in development, and it's going to look on this path to pull its key-values.

The last piece is the listener port. This is the port that the application's going to listen on when our handler sends a signal to it to reload its application, or the configuration data.

More on watches and handlers

A quick dive into watches and handlers for those that aren't familiar. Watches are a way of specifying a view of data which is monitored for updates.

In this case the watcher's looking on /apps/env/dev/app1, and if a key's been created or changed, it's going to initiate the handler, which then sends a signal to a port that the application's listening on to tell it to reload its configuration.

Here's the config.json, which is a configuration file for the Consul agent locally on the application instance, and here we specify the watch. We can see that the path name's been specified, so when any key-value's been changed on this path, the handler script gets executed.

How applications get their secrets

This is a similar process. The application starts up, it's going to get a few properties to know how to contact Vault, and get its information that it needs.

Once it starts up, it contacts the Vault URL to try to load its secrets. The way we manage that is token and policy and the keypath. We're giving it a path, and it's looking for specific secrets along that path, and we can lock that down with policies and tokens.

Taking a look at the properties, the first one is the Vault URI. This is how the application instance is going to connect to the Vault cluster to get its data.

The Vault path is simply that path to the data or the secrets that the application needs. And the Vault token is how we secure that data. So this application has a specific token to a specific path, and no other applications can access this data.

This slide shows a deeper dive into it. Each application has an associated secret, as we saw before, and the secrets engines are apps/dev, apps/staging, apps/qa. The secrets underneath are associated to an application. And each application has an ACL policy, which then says, "Who has access to this secret?"

There's a token associated with that. When the application calls Vault, it uses that token, which is then validated by the policy to say, "Yes, you have access to the data along this path."

Updating, creating, and deleting configurations and secrets

For the new process, we wound up choosing GitLab for source control. It does branching really well, it allows a semblance of code review through merge requests, it has an audit history so we know who changed configurations and when they changed it, and it does rollback, so we can roll back the changes.

The big piece, though, in our organization and in many organizations, is there wasn't a high learning curve. Because developers already use this for source control for the code that they write.

And we chose Jenkins as the orchestrator to run the scripts that will call the Consul APIs to implement these changes. It integrates with GitLab, and the teams that would manage this are already familiar with Jenkins and the processes.

Let's take a look at the process of what we socialize to the teams about how you're going to add and update and delete these configurations.

A developer's going to come in and log into GitLab. We have a properties repository, and the developer's going to update a specific property file, which we segment by environment, for now at least.

The developer pushes the branch to origin. The developer then submits a merge request, which allows a peer review or a team review to come in and check and validate the configuration. The merge is committed, webhook triggers a Jenkins job, and control's passed over to Jenkins.

We have a script that runs in Jenkins, Jenkins job calls a script, and this script creates a list of all the changes that have just occurred, any additions, any modifications, any deletions, any actions that are going to be taken against the Consul database.

Once it generates that list, a second script calls the Consul APIs, and it's going to perform those actions. Regardless of how many additions, modifications, or deletions we have, the second script is going to use the Consul APIs to make those modifications.

Then control's handed over to Consul, where the APIs actually do the work.

The only difference in this process between environments is that, in our production environment, we want some sort of approval gate before these configurations are passed on. So an approval process goes through, people make the approvals, and then the merge happens and it kicks off the stuff to the Jenkins job.

This slide shows an example of our property file that houses the key-values for the application. This is in a GitLab repository. Each one of these property files contains the key and value. Each environment has its own property file. Here's an example of the consul-config-dev.txt file.

As you can see, we have keys and values for app1 in the dev environment. It only shows 1 application, but in ours it would be either app2, app3, app4, and so on.

The process for managing secrets

This is a little bit different. It starts with a requester sending encrypted emails with some information: the secrets engine, the secret, and a key, and the key's going to be a username, client ID.

The great thing about the structure and naming convention that we implemented is, our secrets engine is the environment name, right? And the secret itself is the application name. So the developer doesn't have to figure out these random names for secrets engines and secrets.

They simply go, "I want to update app1 in dev, so..." There you go: secrets engine is dev, the secret is app1, and then they can provide the key. They send a second encrypted email containing the value, which in this case could be a password or client secret, and these emails go to the Vault admin team. Vault admin team gathers the data, they log into Vault, and they update and make the changes.

Improving the processes

As part of our continuous improvement journey, we constantly cycle through these steps that we've been going over. Here's a snippet of some of the new challenges that we face:

  • Issues with different Consul agents in different VLANs communicating with each other for health status (network segmentation)
  • Dynamic password generation for applications and database
  • Devise a better process for secrets administration
  • Split property files into individual app files instead of environment

One is issues with Consul agents in different VLANs communicating with each other for health status. Obviously, we have application instances in multiple VLANs, and a lot of them are protected by firewalls. We have some application instances in the DMZ. The problem becomes when the Consul agents are communicating with each other for health status.

We have all these instances with Consul agents, they're trying to talk to each other for health status, and they're being blocked by firewalls. We could open up the firewall ports, which we did initially, but one of the problems we saw with this is for our Vault cluster.

On the Vault servers, we have a Consul agent, because the Consul database is the backend for Vault. What we saw is, when applications in the DMZ couldn't communicate with the Consul agents on the Vault server, they were reporting them as unhealthy. And Vault was going through a process of electing a new active server.

Normally it's not a problem, except when the applications start up, they need to grab their secrets. So they go to the Vault cluster, they try to grab their secret, there's no active server at the moment, they can't start up, they can't get their secrets, applications fail.

Obviously, one of the resolutions is to implement some sort of network segmentation to separate agents talking to other agents, create separate network segmentations in different VLANs so only those agents can talk to each other, and not across.

Another challenge is that our password rotation is a manual process. Even though we now house passwords in Vault, the creation of those passwords, the sending of the email, the implementing of those passwords is a manual process.

So we need to move to a more dynamic password generation for our applications and database. And along with that, a better process for secrets administration. Sending encrypted emails is great, but when you're doing this a lot of times, it starts getting confusing. "Wait, this password goes to which other email?" It's a challenge that we have to overcome.

The last thing is with the property files. As we saw, we have our property files segmented by environment. All the key-values for dev are in a single file, and all the key-values for QA are in a single file.

In the beginning this was great, because it was the one spot developers could look, and they can say, "Oh yeah, here are all my values."

But as these configurations proliferated and we brought more applications on board, this file is just untenable; it's very big. One solution is to split these property files out, per application. So app1 would have its own property file that would contain its own key-values. That way we'd use the same process, but now we'd have separate property files per application.

The value of these processes

Today I talked about 24 Hour Fitness' transformation using Vault and Consul. We identified the challenges, we found the value, we defined the path, and we walked the path. But how did overcoming these challenges bring value to us? That's the key point, right? What value did we get from this?

Let's take a look back at the opening story where a configurations change was missed on a single application instance. Because we implemented that single source of truth, today that scenario would never exist.

We aren't going to individual instances and making changes. We make a configuration change in Consul through our process, and all the application instances get that immediately. So that scenario doesn't even exist, and everybody's happy, they get to start vacation early and go home.

Of course, solving these challenges also allows us at 24 Hour Fitness to dedicate more time to what's important: working out. We're a fitness company; we're supposed to work out. That is, until the next challenge.

Thank you very much.

HashiCorp
Powering the software-managed datacenter. Maker of Vagrant, Packer, Terraform, Consul, Serf, and Vault
Tools mentioned in article
Open jobs at HashiCorp
Sr. Enablement Engineer
United States
<p><strong>About the role...</strong></p> <p><span style="font-weight: 400;">We are looking for a Sr. Enablement Engineer to join our growing Professional Services team! As a Sr. Enablement Engineer, you will help our customers succeed through production, maintenance and occasional delivery of our Instructor Led training for HashiCorp Enterprise products. The ideal Enablement Engineer has a depth of knowledge from hands-on experience in a DevOps or infrastructure automation role, familiarity with the HashiCorp product suite and a passion for education and customer enablement.&nbsp;</span></p> <p><strong>In this role you can expect to…</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Work with Professional Services Engineers, Product and Subject Matter Experts to identify topics for enablement material</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Keep training material up-to-date as our products and our customers’ needs evolve</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Be a trusted advisor for customers on HashiCorp architecture, tools, and products</span></li> <li style="font-weight: 400;">Assist with onboarding and training partners</li> <li style="font-weight: 400;">Build and maintain labs on the Instruqt platform</li> <li style="font-weight: 400;">Onboard and mentor junior Enablement engineers</li> <li style="font-weight: 400;">Show agile development experience</li> <li style="font-weight: 400;">Provide real life experience and practical examples in relative context for live situations.&nbsp;</li> </ul> <p><strong>Occasionally you may ( up to 25% of your time ):</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Be an engaging and energetic instructor with an audience of customers.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Deliver Instructor Led Training (ILT) in a virtual classroom setting.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Travel onsite to conduct in-person private trainings.</span></li> </ul> <p><strong>You may be a good fit for our team if you have...</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">HashiCorp Certification(s), or experience with adjacent technologies.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience creating demos, workshops, or other technical materials.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Training experience, including vetting and onboarding trainers.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Good written, verbal, communication, and presentation skills.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience with adult learning theory or instructional design.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Linux and cloud platform experience (Azure, AWS, GCP).</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Knowledge of any programming or scripting languages.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Practical experience with Hashicorp’s tools.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Built and managed CI/CD pipelines.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">5+ years Implementation or Professional Services, Customer Support experience, Sales Engineering experience or equivalent experience in a customer facing role.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Ability to work with autonomy in a fast-paced environment and a strong sense of accountability as well as a proven track record of driving results through indirect influence and delivering results.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Adaptability, flexibility, and a willingness to work within changing priorities including ad hoc requests.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Commitment to excellence in delivery of customer service.</span></li> </ul> <p>&nbsp;</p> <p><span style="font-weight: 400;">HashiCorp embraces diversity and equal opportunity. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We believe the more inclusive we are, the better our company will be.</span></p> <p><span style="font-weight: 400;">#LI-JO1 #LI-REMOTE</span></p><div class="content-pay-transparency"><div class="pay-input"><div class="title">The base pay range for this role in the SF Bay Area / NYC area is:</div><div class="pay-range"><span>$157,300</span><span class="divider">&mdash;</span><span>$185,000 USD</span></div></div><div class="pay-input"><div class="title">The base pay range for this role in Seattle Metro, Denver / Boulder Metro, New York (excluding NYC), or California (excluding SF Bay Area) is:</div><div class="pay-range"><span>$144,200</span><span class="divider">&mdash;</span><span>$169,600 USD</span></div></div><div class="pay-input"><div class="title">The base pay range for this role in Colorado (excluding Denver / Boulder Metro) and Washington (excluding Seattle Metro) is:</div><div class="pay-range"><span>$131,100</span><span class="divider">&mdash;</span><span>$154,200 USD</span></div></div></div>
Sr Manager, Data Engineering
India - Noida
<p>We are looking for a Sr. Manager, Data Engineering to be part of our FP&amp;A’s Digitization team in Noida, Uttar Pradesh, India. This role is expected to be 30% hands on execution building the solutions while the rest is overseeing the delivery and solutioning for the team.</p> <p><strong>In this role you can expect to drive the following-</strong></p> <p>Data Strategy and Alignment</p> <ul> <li>Work closely with Lead- business analysis and analytics to understand requirements and provide data ready for analysis and reporting.</li> <li>Apply, help define, and champion data governance : data quality, testing, documentation, coding best practices and peer reviews.</li> <li>Continuously discover, transform, test, deploy, and document data sources and data models.</li> <li>Develop and execute data roadmap (and sprints) - with a keen eye on industry trends and direction.</li> </ul> <p>Data Stores and System Development</p> <ul> <li>Design and implement high-performance, reusable, and scalable data models for our data warehouse to ensure our end-users get consistent and reliable answers when running their own analyses.</li> <li>Focus on test driven design and results for repeatable and maintainable processes and tools.</li> <li>Create and maintain optimal data pipeline architecture - and data flow logging framework.</li> <li>Build the data schema, features, tools, and frameworks that enable and empower BI and Analytics teams across FP&amp;A function.</li> </ul> <p>Project Management</p> <ul> <li>Drive project execution using effective prioritization and resource allocation.</li> <li>Resolve blockers through technical expertise, negotiation, and delegation.</li> <li>Strive for on-time complete solutions through stand-ups and course-correction.</li> </ul> <p>Team Management</p> <ul> <li>Manage and elevate team of 2 members.</li> <li>Do regular one-on-ones with teammates to ensure resource welfare.</li> <li>Periodic assessment and actionable feedback for progress.</li> <li>Recruit new members with a view to long-term resource planning through effective collaboration with the hiring team.</li> </ul> <p>Process design</p> <ul> <li>Set the bar for the quality of technical and data-based solutions the team ships.</li> <li>Enforce code quality standards and establish good code review practices - using this as a nurturing tool.</li> <li>Set up communication channels and feedback loops for knowledge sharing and stakeholder management.</li> <li>Explore the latest best practices and tools for constant up-skilling.</li> </ul> <p>Data Engineering Stack</p> <ul> <li>Programming : <strong>Python</strong> ( expert)level. Ability to create API’s on python.</li> <li>Database : PostgreSQL, Amazon Redshift</li> <li>Warehouse : <strong>Snowflake</strong>, S3</li> <li>ETL : <strong>DBT</strong> + Custom-made Python</li> <li>Business Intelligence / Visualization : M+ Google Data Studio</li> <li>Frameworks : Spark + Dash + <strong>Stream Lit</strong></li> <li>Collaboration : Git, Notion</li> <li>Cloud Platform- AWS</li> </ul> <p>Qualification Prerequisites</p> <ul> <li>Industry experience of minimum 12 years (2 years+ in snowflake)</li> <li>Experience managing a team of at least 4 developers end-to-end</li> <li>Strong hands-on data modelling and data warehousing skills</li> <li><strong>Snowflake Certification is mandatory</strong>.</li> <li>Strong experience applying software engineering best practices to data and analytics scope (e.g. version control, testing, and CI/CD)</li> <li>Strong attention to detail to highlight and address data quality issues</li> <li>Excellent time management and proactive problem-solving skills to meet critical deadlines <strong>#LI-Onsite #LI-SG1</strong></li> </ul> <p>&nbsp;</p>
Senior Software Development Engineer ...
India - Bangalore
<p><strong>About HashiCorp</strong></p> <p><span style="font-weight: 400;">HashiCorp is a fast-growing startup that solves development, operations, and security challenges in infrastructure so organizations can focus on business-critical tasks.&nbsp; We build tools to ease these decisions by presenting solutions that span the gaps.&nbsp;</span></p> <p><span style="font-weight: 400;">At HashiCorp, we value top-notch collaboration and communication skills, both among internal teams and in how we interact with our users. We take care to balance and be responsive to the needs of our open source community as well as our enterprise level customers.</span></p> <p><span style="font-weight: 400;">Engineering at HashiCorp is largely a remote team. While prior experience working remotely isn't required, we are looking for team members who perform well given a high level of independence and autonomy.</span></p> <p><strong>About the Role</strong></p> <p><span style="font-weight: 400;">On the Consul team, we help organizations automate network configurations, discover services, and enable secure connectivity across any cloud or runtime. The customers and large community of users of our tools are operators, infrastructure engineers, and software developers that encounter novel performance, scaling, and usability challenges that we help them solve.</span></p> <p><span style="font-weight: 400;">Consul started as an infrastructure management tool for service discovery and health checking, and has evolved to become a full-featured service mesh. Some of the functionality you’ll be working on will include </span><a href="https://www.consul.io/docs/connect/proxies/integrate"><span style="font-weight: 400;">proxy integrations</span></a><span style="font-weight: 400;">, Envoy’s </span><a href="https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol"><span style="font-weight: 400;">xDS</span></a><span style="font-weight: 400;"> APIs, </span><a href="https://www.consul.io/docs/connect/ca"><span style="font-weight: 400;">certificate management</span></a><span style="font-weight: 400;"> for mutual TLS connectivity, and security through service-oriented </span><a href="https://www.consul.io/docs/connect/intentions"><span style="font-weight: 400;">Intentions</span></a><span style="font-weight: 400;">. You’ll be an active contributor to the service mesh ecosystem, following new developments in emerging technology and competitive offerings, looking for opportunities for product differentiation, and rethinking product architecture to meet new global scale and organizational demands.</span></p> <p>&nbsp;</p> <p><strong>In this role you can expect to:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Build and architect distributed systems for service connectivity across heterogeneous environments (Kubernetes, VMs, bare metal data center or edge deployments).</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Interface directly with internal teams, users and HashiCorp customers, as well as the larger Consul community.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Participate in user research studies and discussions with product managers and customers to better understand the network topologies, challenges, and constraints for which operators are trying to solve, and leverage those insights when approaching feature design and implementation.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Propose new functionality or substantive changes through written documents in an async process, describing the problem background, proposed implementation and example UX, then iterating on peer feedback collaboratively.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Own the full lifecycle of feature development from design through testing, release and support.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Program mostly in </span><strong>Go</strong><span style="font-weight: 400;">, learning from and contributing to a team committed to continually improving their skills.</span></li> </ul> <p><strong>You may be a good fit for our team if you have:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Familiarity with service-oriented architectures, and ideally have worked on an infrastructure or platform team building internal tooling to deploy, connect and monitor them.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Empathy for the people operating, learning, teaching and supporting software you write, and consider their experience when making design decisions and performance, security or complexity tradeoffs.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience in a lower-level language like </span><strong>Go</strong><span style="font-weight: 400;">.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Awareness of the broader service mesh ecosystem and an interest in contributing to a full-featured product offering while reducing complexity and barriers to adoption for practitioners.</span></li> </ul> <p><span style="font-weight: 400;">&nbsp;#LI-Hybrid</span></p>
Resident Reliability Engineer - The 1...
Japan
<h1><strong>Resident Reliability Engineer</strong></h1> <h2>About the Role</h2> <p>As​ ​a Resident​ Reliability Engineer ​at​ ​HashiCorp, you will work with a high-performing team dedicated to the long-term success of our Tier-1 customers. Combine your hands-on expertise in managing and automating large, complex systems with a passion for continuous improvement to implement, integrate, and operate our HashiCorp products for our strategic customers. The Resident Reliability Engineer will work directly with our Resident Solutions Architects, Sales Engineers, and customers to execute activities that enable our customers to have sustainable growth and success with our products as part of their cloud transformation journey.</p> <p>The typical Resident Reliability Engineer's responsibilities will include hands-on development efforts to deploy, manage, and validate HashiCorp tools, as well as integrate third-party systems with those tools. Candidates must be motivated to self-manage project priorities, deadlines, and deliverables.</p> <p>An ideal RRE candidate will have 6-8 years of experience working on platform development or RRE teams and direct hands-on experience with infrastructure automation and configuration management solutions, mentorship or team-lead experience, and a proven track record for driving complex projects to successful delivery, and requires no supervision to prioritize and deliver multiple effective projects on time.</p> <h2>In this role, you can expect to:</h2> <ul> <li>Collaborate with the Resident Solutions Architect (RSA) on the design and execution plans regarding implementation and associated integrations of the HashiCorp tool(s) defined in the scope of the engagement</li> <li>Execute on the technical customer solution plan designed by the Resident Solutions Architect (RSA) assigned to the customer account</li> <li>Design, implementation and management of HashiCorp enterprise solutions</li> <li>Assist in design and configuration of third-party integrations with HashiCorp enterprise solutions</li> <li>Creation of support documentation, runbooks, and similar collateral associated with HashiCorp enterprise solutions</li> <li>Assist in the application or execution of software patches and upgrades for HashiCorp enterprise solutions in non-production environments</li> <li>Consult with various application, platform, and operations teams on how to integrate their tools, applications, and services with HashiCorp tools</li> <li>Execute specific tasks throughout our customer’s cloud transformation journey as it relates to the HashiCorp product suite.</li> <li>Build automation for the deployment, management, and operation of one or more of our HashiCorp solutions</li> <li>Assist customers in aligning people, processes, and workflows with the Cloud Operating Model</li> <li>Articulate and redefine the cloud consumption model in relation to HashiCorp Tao of workflows</li> <li>Directly impact developer velocity whilst satisfying the objectives of governance, risk &amp; controls</li> <li>Be cognizant of and effectively communicate the business value of the HashiCorp product suite to customers</li> <li>Articulate the technical functionality of HashiCorp products and additional 3rd party integrations to practitioners, product owners, and managers</li> <li>Educate and advise customer users on how HashiCorp tools can enable and even simplify the adoption of the Cloud Operating Model as part of their cloud transformation journey</li> <li>Educate individual contributors on best practices regarding system administration, service administration, automation, and observability</li> <li>Be subject to additional background checks and screening performed by the customer</li> </ul> <h2>You may be a good fit for our team if you have:</h2> <ul> <li>Intermediate experience with Terraform including an understanding of the deployment workflow and module development</li> <li>Working proficiency with one or more of the tools in the HashiCorp product suite</li> <li>4-7 years experience managing Linux/container environments</li> <li>Intermediate understanding of system/service monitoring, observability, and alerting</li> <li>Intermediate understanding of managing shared, multi-tenant services</li> <li>Intermediate understanding of managing public cloud infrastructure</li> <li>Intermediate understanding of industry-standard configuration management tools</li> <li>Deep understanding of software development lifecycle</li> <li>Intermediate understanding of security and compliance requirements for shared services</li> <li>The ability to work well in group environments and independently</li> <li>A proven track record of assisting with or leading cloud transformation at one or more organizations</li> <li>Familiarity with cloud service providers and their offerings, common technical platforms, and shared service management at scale</li> <li>The ability to work independently, executing on a provided roadmap</li> <li>The ability to explain complex technical topics in simple terms</li> <li>Experience as a technical lead on a platform engineering/SRE team is a plus</li> <li>Business Proficiency level of English language skill</li> <li>Fluent level of Japanese language skill</li> <li>Occasionally international travel might required</li> </ul>
You may also like