Cultivating your Data Lake

2,619
Segment
Segment provides the customer data infrastructure that businesses use to put their customers first. With Segment, companies can collect, unify, and connect their first-party data to over 200 marketing, analytics, and data warehousing tools.

This post is by Lauren Reeder


All too often, we hear that businesses want to do more with their customer data. They want to be data-informed, they want to provide better customer experiences, and—most of all—they just want to understand their customers.

Getting there isn’t easy. Not only do you need to collect and store the data, you also need to identify the useful pieces and act on the insights.

At Segment, we’ve helped thousands of businesses walk the path toward becoming more data-informed. One successful technique we’ve seen time and time again is establishing a working data lake.

A data lake is a centralized repository that stores both structured and unstructured data and allows you to store massive amounts of data in a flexible, cost effective storage layer. Data lakes have become increasingly popular both because businesses have more data than ever before, and it’s never been cheaper and easier to collect and store it all.

In this post, we’ll dive into the different layers to consider when working with a data lake.

  • We’ll start with an object store, such as S3 or Google Cloud Storage, as a cheap and reliable storage layer.
  • Next is the query layer, such as Athena or BigQuery, which will allow you to explore the data in your data lake through a simple SQL interface.
  • A central piece is a metadata store, such as the AWS Glue Catalog, which connects all the metadata (its format, location, etc.) with your tools.
  • Finally, you can take advantage of a transformation layer on top, such as EMR, to run aggregations, write to new tables, or otherwise transform your data.

As heavy users of all of these tools in AWS, we’ll share some examples, tips, and recommendations for customer data in the AWS ecosystem. These same concepts also apply to other clouds and beyond.

Storage Layer: S3

If you take one idea away from this blog post, let it be this: store a raw copy of your data in S3.

It’s cheap, scalable, incredibly reliable, and plays nicely with the other tools in the AWS ecosystem. It’s very likely your entire storage bill for S3 will cost you less than a hundred dollars per month. If we look across our entire customer base, less than 1% of our customers have an S3 bill over $100/month for data collected by Segment.

That said, the simplicity of S3 can be a double-edged sword. While S3 is a great place to keep all of your data, it often requires a lot of work to collect the data, load it, and actually get to the insights you’re looking for.

There are three important factors to keep in mind when collecting and storing data on S3:

  • encoding – data files can be encoded any number of ways (CSV, JSON, Parquet, ORC), and each one can have big performance implications.
  • batch size – file size has important ramifications, both for your uploading strategy (and data freshness) and for your query times.
  • partition scheme – partitions refers to the ‘hierarchy’ for data, and the way your data is partitioned or structured can impact search performance.

Structuring data within your data lake

We’ll discuss each of these in a bit more depth, but first it’s worth understanding how data first enters your data lake.

There are a number of ways to get data into S3, such as uploading via the S3 UI or CLI. But if you’re talking customer data, it’s easy to start delivering your data to S3 via the Segment platform. The Segment platform provides the infrastructure to collect, clean, and control your first party customer data and send exactly what you need to all the tools you need it in.

Encoding

The encoding of your files has a significant impact on the performance of your queries and data analysis. For large workloads, you’ll want to use a binary format like Parquet or ORC (we’re beginning to support these natively. If you’d like beta access, please get in touch!).

To understand why, consider what a machine has to do to read JSON vs Parquet.

When looking at JSON, the data looks something like this:

{ 'userId': 'user-1', 'name': 'Lauren', 'company': 'Segment' }
{ 'userId': 'user-2', 'name': 'Parsa', 'company': 'Segment }
{ 'userId': 'user-3', 'company': 'Microsoft', 'name': 'Satya' }
{ 'userId': 'user-4', 'name': 'Elon', 'company': 'Tesla' }

Here, we must parse not only the whole message, but each key individually, and each value. Because each JSON object might have a different schema (and is totally unordered), we have to do roughly the same work for each row.

Additionally, even if we are just picking out companies, or names, we have to parse all of our data. There’s no ‘shortcut’ where we can jump to the middle of a given row.

Contrast that with Parquet, and we see a much different schema. In Parquet, we’ve pre-defined the schema ahead of time, and we end up storing columns of data together. Below is an example of the previous JSON document transformed in Parquet format. You can see the users are stored together on the right, as they are all in the same column.

See users stored together on the right

A reader doesn’t have to parse out and keep a complicated in-memory representation of the object, nor does it have to read entire lines to pick out one field. Instead it can quickly jump to the section of the files it needs and parse out the relevant columns.

Instead of just taking my word for it, below are a few concrete benchmarks which query both JSON and Parquet.

In each of the four scenarios, we can see big gains from using Parquet.

As you can see, the data we need to query in each instance is limited for Parquet. For JSON, we need to query the full body of each JSON event every time.

Batch Size

Batch size, or the amount of data in each file, is tricky to tune. Having too large of a batch means that you will have to re-upload or re-process a lot of data in the case of a hiccup or machine failure. Having a bunch of files which are too small means that your query times will likely be much longer.

Batch size is also tied to encoding, which we discussed above. Certain formats like Parquet and ORC are ‘splittable’, where files can be split and re-combined at runtime. JSON and CSV can be splittable under certain conditions, but typically cannot be split for faster processing.

Generally, we try and target files with sizes ranging from 256 MB to 1 GB. We find this gives the best overall performance mix.

Partitioning

When you start to have more than 1GB of data in each batch, it’s important to think about how a data set is split, or partitioned. Each partition contains only a subset of the data. This increases performance by reducing the amount of data that must be scanned when querying with a tool like Athena or processing data with EMR. For example, a common way to partition data is by date.

Querying

Finally, it’s worth understanding that just having your data in S3 doesn’t really directly help you do any of the things we talked about at the beginning of the post. It’s like having a hard drive, but no CPU.

There are many ways to examine this data — you could download it all, write some code, or try loading it into some other database.

But the easiest is just to write SQL. That’s where Athena comes in.

Query Layer: Athena 🔎

Once you’ve got your data into S3, the best way to start exploring what you’ve collected is through Athena.

Athena is a query engine managed by AWS that allows you to use SQL to query any data you have in S3, and works with most of the common file formats for structured data such as Parquet, JSON, CSV, etc.

In order to get started with Athena, you just need to provide the location of the data, its format, and the specific pieces you care about. Segment events in particular have a specific format, which we can leverage when creating tables for easier analysis.

Setup

Below is an example to set up a table schema in Athena, which we’ll use to look at how many messages we’ve received by type:

CREATE EXTERNAL TABLE IF NOT EXISTS segment_logs.eventlogs (
anonymousid                 string                  ,  # pick columns you care about!
context                     map<string,string>      ,  # using a map for nested JSON
messageid                   string                  ,   
timestamp                   Timestamp               ,   
type                        string                  ,   
userid                      string                  ,   
traits                      map<string,string>      ,  
event                       string                   
)
PARTITIONED BY (sourceid string)               # partition by the axes you expect to query often, sourceid here is associated with each source of data
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://your-s3-bucket/segment-logs'    # location of your data in S3

In addition to creating the table, you will need to add the specific partitions:

ALTER TABLE eventlogs ADD
    PARTITION (sourceid='source1') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source1/'  # sourceid here is associated with each source of data
    PARTITION (sourceid='source2') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source2/'
    PARTITION (sourceid='source3') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source3/'
    PARTITION (sourceid='source4') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source4/'

There are many ways to partition your data. Here, we’ve partitioned by source for each customer. This works for us when we’re looking at a specific customer, but if you’re looking across all customers over time, you may want to partition by date instead.

Query time!

Let’s answer a simple question from the table above. Let’s say we want to know how many messages of each type we saw for a given data source in the past day - we can simply run some SQL to find out from the table we just created in Athena:

  select  type, count(messageid) 
    from  eventlogs 
   where  sourceid='source1' 
     and  date_trunc('day', timestamp) = current_date 
group by  1 
order by  2 desc

For all queries, the cost of Athena is tightly related to how you partition your data and its format. It is also driven by how much data is scanned ($5 per TB).

When scanning JSON, you will be scanning the entire record every time due to how it’s structured (see above for an example). Alternatively, you can set up Parquet for a subset of your data containing only the columns you care about, which is great for limiting your table scans and therefore limiting cost. This is also why Parquet can be so much faster—it has direct access to specific columns without scanning the full JSON.

Metadata: AWS Glue 🗺

Staying current

One challenge with Athena is keeping your tables up to date as you add new data to S3. Athena doesn’t know where your new data is stored, so you need to either update or create new tables, similar to the query above, in order to point Athena in the right direction. Luckily there are tools to help manage your schema and keep the tables up to date.

The AWS Glue Catalog is a central location in which to store and populate table metadata across all your tools in AWS, including Athena. You can populate the catalog either using out of the box crawlers to scan your data, or directly populate the catalog via the Glue API or via Hive. You can see how these all fit together in the diagram below.

Once this is populated with your metadata, Athena and EMR can reference the Glue Catalog for the location, type, and more when querying or otherwise accessing data in S3.

From: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html

Compute Layer: EMR

Moving beyond one-off queries and exploratory analysis, if you want to modify or transform your data, a tool like EMR (Elastic Map Reduce) gives you the power to not only read data but transform it and write into new tables. You may need to write if you want to transform the format of your data from JSON to Parquet, or if you want aggregate % of users completed the signup flow the past month and write it to another table for future use.

Operating EMR

EMR provides managed Hadoop on top of EC2 (AWS’s standard compute instances). Some code and config is required - internally we use Spark and Hive heavily on top of EMR. Hive provides a SQL interface over your data and Spark is a data processing framework that supports many different languages such as Python, Scala, and Java. We’ll walk through an example and more in-depth explanation of each below.

Pattern-wise, managing data with EMR is similar to how Athena operates. You need to tell it where your data is and its format. You can do this each time you need to run a job or take advantage of a central metastore like the AWS Glue Catalog mentioned earlier.

Building on our earlier example, let’s use EMR to find out how many messages of each type we received not only over the past day, but for every day over the past year. This requires going through way more data than we did with Athena, which means we should make a few optimizations to help speed this up.

Data Pre-processing

The first optimization we should make is to transform our data from JSON to Parquet. This will allow us to significantly cut down on the amount of data we need to scan for the final query, as shown previously!

For this JSON to Parquet file format transformation, we’ll want to use Hive, then turn to Spark for the aggregation steps.

Hive is a data warehousing system with a SQL interface for processing large amounts of data and has been around since 2010. Hive really shines when you need to do heavy reads and writes on a ton of data at once, which is exactly what we need when converting all our historical data from JSON into Parquet.

Below is an example of how we would execute this JSON to Parquet transformation.

First, we create the destination table with the final Parquet format we want, which we can do via Hive.

CREATE EXTERNAL TABLE `test_parquet`(
  anonymousid                 string                  ,  
  context                     map<string,string>      ,  
  messageid                   string                  ,   
  timestamp                   Timestamp               ,   
  type                        string                  ,   
  userid                      string                  ,   
  traits                      map<string,string>      ,  
  event                       string                   
)
PARTITIONED BY (dt string)  -- dt will be the prefix on your output files, i.e. s3://your-data-lake/parquet/dt=1432432423/object1.gz
STORED AS PARQUET  -- specify the format you want here
location 's3://your-data-lake/parquet/';

Then we simply need to read from the original JSON table and insert into the newly created Parquet table:

INSERT INTO test_parquet partition (dt) 
SELECT anonymousid, context, messageId, `timestamp`, `type`, userid, traits, event 
FROM test_json;

To actually run this step, we will need to create an EMR job to put some compute behind it. You can do this, by submitting a job to EMR via the UI:

Or, by submitting a job via the CLI:

# EMR CLI example job, with lots of config!
aws emr add-steps \
  --cluster-id j-xxxxx \
  --steps Type=spark, Name=SparkWordCountApp, \ 
  Args=[
    --deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/], \ 
  ActionOnFailure=CONTINUE

Aggregations

Now that we have our data in Parquet format, we can take advantage of Spark to sum how many messages of each type we received and write the results into a final table for future reference.

Spark is useful to run computations or aggregations over your data. It supports languages beyond SQL such as Python, R, Scala, Java, etc. which have more complex logic and libraries available. It also has in memory caching, so intermediate data doesn’t write to disk.

Below is a Python example of a Spark job to do this aggregation of messageid by type.

from datetime import datetime, timezone, timedelta
from pyspark.sql.functions import col, when, count, desc


# S3 buckets for reading Segment logs and writing aggregation output
read_bucket_prefix = 's3://bizops-data-lake/segment-logs/source1'
write_bucket_prefix = "s3://bizops-data-lake-development/tmp/segment-logs-source1"

# get datestamp for prior year
today = datetime.now()
last_year_partition = datetime.strftime(today - timedelta(years=today.weekday(), years=1), '%Y-%m-%d')
last_year_ds = datetime.strptime(last_year_partition, '%Y-%m-%d')


"""
  obtain all logs partitions of the year
  sample filenames:
  [
    's3://bizops-data-lake/segment-logs/source1/1558915200000/', 
    's3://bizops-data-lake/segment-logs/source1/1559001600000/', 
    's3://bizops-data-lake/segment-logs/source1/1559088000000/',
    ...
  ]
"""
read_partitions = []
for day in range(365):
    next_ds = last_year_ds + timedelta(days=day)
    ts_partition = int(1000*next_ds.replace(tzinfo=timezone.utc).timestamp())
    read_year_partitions.append("{}/{}/".format(read_bucket_prefix, ts_partition))

# bucket partition for aggregation output
# sample 's3://bizops-data-lake-development/tmp/segment-logs-source1/week_ds=2019-05-27/'
write_year_ds = "{}/week_start_ds={}/".format(write_bucket_prefix, last_year_partition)

# read logs of last year, from pre-processing step. Faster with parquet!
df = spark.read.parquet(read_year_partitions)

# aggregate by message type
agg_df = df.select("type", "messageid").groupBy("type").agg(
  count(messageid).alias("message_count"),
)

# writing Spark output dataframe to final S3 bucket in parquet format
agg_df.write.parquet(path=write_year_ds, compression='snappy', mode='overwrite')

It is this last step, agg_df.write.parquet, that takes the updated aggregations that are stored in an intermediate format, a DataFrame, and writes these aggregations to a new bucket in Parquet format.

Conclusion

All in, there is a robust ecosystem of tools that are available to get value out of the large amounts of data that can be amassed in a data lake.

It all starts with getting your data into S3. This gives you an incredibly cheap, reliable place to store all your data.

From S3, it’s then easy to query your data with Athena. Athena is perfect for exploratory analysis, with a simple UI that allows you to write SQL queries against any of the data you have in S3. Parquet can help cut down on the amount of data you need to query and save on costs!

AWS Glue makes querying your S3 data even easier, as it serves as the central metastore for what data is where. It is already integrated with both Athena and EMR, and has convenient crawlers that can help map out your data types and locations.

Finally, EMR helps take your data lake to the next level, with the ability to transform, aggregate, and create new rollups of your data with the flexibility of Spark, Hive, and more. It can be more complex to manage, but its data manipulation capabilities are second to none.

At Segment, we help enable a seamless integration with these same systems. Our S3 destination enables customers to have a fresh copy of all their customer and event data in their own AWS accounts.

We’re working on making this even easier by expanding the file format options, as well as integrating with the AWS Glue metastore, so you always have an up to date schema to keep up to date with your latest data. Drop us a line if you want to be part of the beta!

Special thanks to Parsa Shabani, Calvin French-Owen, and Udit Mehta

Segment
Segment provides the customer data infrastructure that businesses use to put their customers first. With Segment, companies can collect, unify, and connect their first-party data to over 200 marketing, analytics, and data warehousing tools.
Tools mentioned in article
Open jobs at Segment
Customer Success Engineer
Denver, CO
<div> <h1 data-usually-unique-id="311306544086994380973378"><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90za3pjz70z29z85zz88zmz71z7z89zz86zwz86zz81zz77zz88zz68z4rudz67zj7z85ztsl"><strong>Overview</strong></span></h1> </div> <div><span class="thread-589598507808646262733993 attrcomment attrcommentfirst thread-589598507808646262733993-first author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z953z85zrz81zenjcz86zca3z89zi0z83zlyz86z8rz78zz66zz79zjhpz72zz65z"><span class="comment-extra-inner-span">At Segment, we believe companies should be able to send their data wherever they want, whenever they want, with no fuss. </span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span">Unfortunately, most product managers, analysts, and marketers spend too much time searching for the data they need, while engineers are stuck integrating the tools they want to use. S</span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z953z85zrz81zenjcz86zca3z89zi0z83zlyz86z8rz78zz66zz79zjhpz72zz65z"><span class="comment-extra-inner-span">e</span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span">gment</span></span> <span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span">standardizes and streamlines data infrastructure with a single platform</span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z953z85zrz81zenjcz86zca3z89zi0z83zlyz86z8rz78zz66zz79zjhpz72zz65z"><span class="comment-extra-inner-span"> that collects, </span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span">unifies</span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z953z85zrz81zenjcz86zca3z89zi0z83zlyz86z8rz78zz66zz79zjhpz72zz65z"><span class="comment-extra-inner-span">,</span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span"> and</span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z953z85zrz81zenjcz86zca3z89zi0z83zlyz86z8rz78zz66zz79zjhpz72zz65z"><span class="comment-extra-inner-span"> sends data to hundreds of business tools with the flip of a switch. </span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span">That way, our customers can focus on building amazing products and personalized messages for </span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span"><em>their</em></span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span"> customers, </span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z953z85zrz81zenjcz86zca3z89zi0z83zlyz86z8rz78zz66zz79zjhpz72zz65z"><span class="comment-extra-inner-span">letting us take care of the complexities of processing their customer data reliably at scale</span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span">. </span></span><span class="thread-589598507808646262733993 attrcomment author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znuj75z82zz70zz77zz76ziz84zz72zijz77zz65zz72zvmxz74zl0wz71zz65zy3z86zz71zz90z5"><span class="comment-extra-inner-span">We’re in the running to </span></span><span class="thread-589598507808646262733993 attrcomment author-d-gz71zz89zz86zqz85zz83zsyvz122zz73zz79zz80zcs9xg16z73zz76z4q4y7z73zz86zz70zz81zz66zz73zz84zz86zz87zaz65z9nz89z4z79zm8"><span class="comment-extra-inner-span">power</span></span><span class="thread-589598507808646262733993 attrcomment author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znuj75z82zz70zz77zz76ziz84zz72zijz77zz65zz72zvmxz74zl0wz71zz65zy3z86zz71zz90z5"><span class="comment-extra-inner-span"> the entire customer data ecosystem, and we need the best people to take the market.&nbsp;</span></span></div> <div>&nbsp;</div> <div><span class="thread-589598507808646262733993 attrcomment attrcommentfirst thread-589598507808646262733993-first author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span">Our goal is to make life easier for our customers and to leave them with the</span></span> <span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6 h-ldquo"><span class="comment-extra-inner-span">“wow”</span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span"> feeling of having solved their technical issues with ease.</span></span></div> <div>&nbsp;</div> <div><span class="thread-589598507808646262733993 attrcomment attrcommentfirst thread-589598507808646262733993-first author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span">We’re looking for a </span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span"><strong>passionate engineer with customer facing experience</strong></span></span><span class="thread-589598507808646262733993 attrcomment author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span">, who loves helping people, solving problems, and wants to be a major factor in the success of our customers. This position is for someone who is technical and experienced in web development, but also wants to stay close to customers at a fast-moving startup.</span></span></div> <div>&nbsp;</div> <div><span class="thread-589598507808646262733993 attrcomment attrcommentfirst thread-589598507808646262733993-first author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span">Daily, you’re likely to interact with multiple languages, web frameworks, APIs, data warehouses, SQL queries, and more. One customer might be combining web data with iOS data, while another sends point of sale purchases through a Go backend to multiple analytics tools and a data warehouse. Every problem is unique, and in a normal day you might interact with dozens of languages, analytics and marketing tools, and third party APIs.</span></span></div> <div>&nbsp;</div> <div><span class="thread-589598507808646262733993 attrcomment attrcommentfirst thread-589598507808646262733993-first author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span">The main responsibility of a success engineer is to provide answers, share standard methodologies and solve technical issues that Segment customers are facing. Most customer interactions happen within Zendesk, with a few phone calls, in-person meetings, and Slack chats sprinkled in. Success Engineers also spend time improving documentation, building scalable resources and finding ways to more effectively and when possible proactively resolve customer questions.</span></span></div> <div>&nbsp;</div> <div><span class="thread-589598507808646262733993 attrcomment attrcommentfirst thread-589598507808646262733993-first author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><span class="comment-extra-inner-span">You’d be joining a team of talented individuals that care deeply about Segment’s customers, partners, each other and the broader community. We judge ourselves on how well we serve each of those stakeholders. So if you enjoy working with smart people and helping to build a company that cares about quality, you’ve found the right place.</span></span></div> <div> <h2><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><strong>Responsibilities</strong></span></h2> </div> <ul class="listtype-bullet listindent1 list-bullet1"> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Help customers </span><span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z">u</span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">t</span><span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z">ilize </span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Segment’s API </span><span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z">across many platforms</span> <span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z h-lparen">(</span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6 h-lparen">web,</span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"> mobile, server)&nbsp;</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90za3luz88zz86zz71zz76zbz83zrz75zz66zz70zq9b7z81z2iz66zrvz66z382z84zz81zz76z">M</span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">aximize the value generated from the many </span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90za3luz88zz86zz71zz76zbz83zrz75zz66zz70zq9b7z81z2iz66zrvz66z382z84zz81zz76z">destination</span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">s Segment supports</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90za3luz88zz86zz71zz76zbz83zrz75zz66zz70zq9b7z81z2iz66zrvz66z382z84zz81zz76z">Maintain customer promises by keeping tickets updated</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90za3luz88zz86zz71zz76zbz83zrz75zz66zz70zq9b7z81z2iz66zrvz66z382z84zz81zz76z">Utilize feedback to improve customer experience and debugging strategies&nbsp;</span></li> <li><span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z">Improve team efficiency by creating internal content and improving public documentation</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Take our support tooling</span><span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z"> and analysis</span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"> to the next level by building simulators and visualizations</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Work closely with the product team </span><span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z">and partners </span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">to improve customer </span><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90za3luz88zz86zz71zz76zbz83zrz75zz66zz70zq9b7z81z2iz66zrvz66z382z84zz81zz76z">satisfaction&nbsp;</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Become a guide on the code base and functionality of the Segment platform, libraries, and integrations. And help improve it!</span></li> <li><span class=" author-d-1gg9uz65z1iz85zgdz68zmqkz84zo2qowz81zz66zbz83zz83zabz71zt9aiz76zz77z2z65zqx2iuz77zz90zz84zpfohdjb">Participate in an on-call rotation to support our Enterprise level customers with paid support plans</span></li> </ul> <div> <h2><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6"><strong>Requirements</strong></span></h2> </div> <ul class="listtype-bullet listindent1 list-bullet1"> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Experience working with sophisticated clients on a technical product, bonus points if it was enterprise software in a related industry</span></li> <li><span class=" author-d-4z65zz66zl57z75zyiz66zfr2fz87zwz89znujo5gz65zfz73zfyez86zz67zz71zlc0fkz85zz75zqhz70zz84zxz72zz65zz75zpz86z">Ability to effectively communicate technical concepts and identify patterns in customer experience</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Solid client-side Javascript skills and experience working with APIs and server-side languages</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Basic understanding of SQL, query-writing skills is a big asset</span></li> <li><span class=" author-d-iz88z86z86za0dz67zz78zz78zz74zz68zjz80zz71z9iz90z9z84zhz75z2z77zz66ztiz78zlciz79zz74zavz80zz67zz90zia5z68zxz76zz72zgz65zv6">Strong commitment to learning the ins and outs of a complicated technical product</span></li> <li><span class=" author-d-1gg9uz65z1iz85zgdz68zmqkz84zo2qowz81zz66zbz83zz83zabz71zt9aiz76zz77z2z65zqx2iuz77zz90zz84zpfohdjb">Ability to take part in an on-call rotation, requiring some availability outside of standard business hours</span></li> </ul> <div>&nbsp;</div> <div> <p><span style="font-weight: 400;">We encourage you to apply if this role excites you - even if you think you may not meet all of the qualifications. At Segment, we live by four values: karma, drive, tribe, and focus. We are always looking for outstanding individuals with diverse backgrounds and perspectives who embody these values. To learn more about life at Segment and our commitment to diversity, equity, and inclusion, visit our LinkedIn page. We’re excited to meet you!</span></p> </div> <div><hr></div> <div><br> <p><span style="font-weight: 400;">Segment is an equal opportunity employer. We believe that everyone should receive equal consideration and treatment in all terms and conditions of employment regardless of sex, gender (including pregnancy, childbirth, breastfeeding or related medical conditions), sexual orientation, gender identity, gender expression, race, color, religion, creed, national origin, ancestry, age (over 40), physical disability, mental disability, medical condition, genetic information, marital status, domestic partner status, military or veteran status, height, weight, AIDS/HIV status, and any other protected category under federal, state or local law. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.</span></p> <p>&nbsp;#LI-Remote- Florida, Georgia, Illinois, Louisiana, Maryland, Massachusetts, Nevada, New Jersey, New York, North Carolina, Oregon, Philadelphia, Tennessee, Texas, Utah, Washington&nbsp;</p> </div>
Verified by
CEO and Co-founder
Engineering Manager
Sr. Manager, Acquisition
You may also like