Alternatives to Amazon Athena logo

Alternatives to Amazon Athena

Presto, Amazon Redshift Spectrum, Amazon Redshift, Cassandra, and Spectrum are the most popular alternatives and competitors to Amazon Athena.
492
837
+ 1
49

What is Amazon Athena and what are its top alternatives?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. It allows users to run SQL queries on data stored in S3 without the need to set up servers or manage infrastructure. Key features include easy integration with S3, support for various data formats, pay-as-you-go pricing, and compatibility with popular BI tools. However, some limitations of Amazon Athena include slower query performance on large datasets and limited support for complex query operations.

  1. Google BigQuery: Google BigQuery is a fully managed, serverless data warehouse that enables businesses to analyze large datasets quickly using SQL queries. Key features include scalability, real-time analytics, and integration with Google Cloud Platform services. Pros include faster query performance and support for complex queries, while cons include higher pricing compared to Amazon Athena.
  2. Snowflake: Snowflake is a cloud-based data warehouse solution that allows users to store and analyze data without the need for managing infrastructure. Key features include instant elasticity, automatic scaling, and support for multiple data formats. Pros include high performance and concurrency, while cons include higher pricing compared to Amazon Athena.
  3. Presto: Presto is an open-source distributed SQL query engine that allows users to query data where it lives, including in relational databases and Hadoop. Key features include fast query execution, support for various data sources, and extensibility. Pros include high query performance and flexibility in querying multiple data sources, while cons include more complex setup compared to Amazon Athena.
  4. Redshift: Amazon Redshift is a fully managed data warehouse service that enables businesses to run complex queries on large datasets. Key features include fast query performance, scalability, and integration with various BI tools. Pros include high performance for analytical workloads, while cons include higher pricing compared to Amazon Athena.
  5. Databricks: Databricks is a unified analytics platform that allows users to process and analyze big data using SQL, Python, and Scala. Key features include collaboration tools, scalable data processing, and integration with Apache Spark. Pros include ease of use and collaborative features, while cons include higher pricing compared to Amazon Athena.
  6. Cognite Data Fusion: Cognite Data Fusion is a data integration platform that enables businesses to unify and analyze industrial data sources. Key features include data contextualization, asset hierarchies, and time series analysis. Pros include support for industrial data formats and advanced analytics capabilities, while cons include specialized focus on industrial use cases.
  7. Qubole: Qubole is a cloud-based big data platform that enables users to run and optimize Apache Spark, Presto, and Hive jobs. Key features include auto-scaling, job orchestration, and interactive notebooks. Pros include ease of use and optimization for big data workloads, while cons include higher pricing compared to Amazon Athena.
  8. Trino: Trino, formerly known as PrestoSQL, is an open-source distributed SQL query engine that enables users to query data across multiple data sources. Key features include fast query performance, support for various data formats, and extensibility. Pros include high query performance and flexibility in querying multiple data sources, while cons include more complex setup compared to Amazon Athena.
  9. Mode: Mode is a collaborative analytics platform that allows users to analyze data and create visualizations using SQL, Python, and R. Key features include real-time collaboration, visual query builder, and customizable dashboards. Pros include ease of use and collaboration features, while cons include limited support for complex queries compared to Amazon Athena.
  10. PrestoDB: PrestoDB is an open-source distributed SQL query engine that enables users to query data where it lives, including in relational databases and Hadoop. Key features include fast query execution, support for various data sources, and extensibility. Pros include high query performance and flexibility in querying multiple data sources, while cons include more complex setup compared to Amazon Athena.

Top Alternatives to Amazon Athena

  • Presto
    Presto

    Distributed SQL Query Engine for Big Data

  • Amazon Redshift Spectrum
    Amazon Redshift Spectrum

    With Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond data stored on local disks in your data warehouse to query vast amounts of unstructured data in your Amazon S3 “data lake” -- without having to load or transform any data. ...

  • Amazon Redshift
    Amazon Redshift

    It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions. ...

  • Cassandra
    Cassandra

    Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL. ...

  • Spectrum
    Spectrum

    The community platform for the future.

  • Amazon Quicksight
    Amazon Quicksight

    Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. ...

  • Google BigQuery
    Google BigQuery

    Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python. ...

  • Elasticsearch
    Elasticsearch

    Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack). ...

Amazon Athena alternatives & related posts

Presto logo

Presto

394
1K
66
Distributed SQL Query Engine for Big Data
394
1K
+ 1
66
PROS OF PRESTO
  • 18
    Works directly on files in s3 (no ETL)
  • 13
    Open-source
  • 12
    Join multiple databases
  • 10
    Scalable
  • 7
    Gets ready in minutes
  • 6
    MPP
CONS OF PRESTO
    Be the first to leave a con

    related Presto posts

    Ashish Singh
    Tech Lead, Big Data Platform at Pinterest · | 38 upvotes · 3.1M views

    To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

    Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

    We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

    Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

    Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

    #BigData #AWS #DataScience #DataEngineering

    See more
    Eric Colson
    Chief Algorithms Officer at Stitch Fix · | 21 upvotes · 6.1M views

    The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

    Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

    At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

    For more info:

    #DataScience #DataStack #Data

    See more
    Amazon Redshift Spectrum logo

    Amazon Redshift Spectrum

    101
    147
    3
    Exabyte-Scale In-Place Queries of S3 Data
    101
    147
    + 1
    3
    PROS OF AMAZON REDSHIFT SPECTRUM
    • 1
      Good Performance
    • 1
      Great Documentation
    • 1
      Economical
    CONS OF AMAZON REDSHIFT SPECTRUM
      Be the first to leave a con

      related Amazon Redshift Spectrum posts

      Amazon Redshift logo

      Amazon Redshift

      1.5K
      1.4K
      108
      Fast, fully managed, petabyte-scale data warehouse service
      1.5K
      1.4K
      + 1
      108
      PROS OF AMAZON REDSHIFT
      • 41
        Data Warehousing
      • 27
        Scalable
      • 17
        SQL
      • 14
        Backed by Amazon
      • 5
        Encryption
      • 1
        Cheap and reliable
      • 1
        Isolation
      • 1
        Best Cloud DW Performance
      • 1
        Fast columnar storage
      CONS OF AMAZON REDSHIFT
        Be the first to leave a con

        related Amazon Redshift posts

        Julien DeFrance
        Principal Software Engineer at Tophatter · | 16 upvotes · 3.2M views

        Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

        I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

        For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

        Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

        Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

        Future improvements / technology decisions included:

        Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

        As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

        One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

        See more
        Ankit Sobti

        Looker , Stitch , Amazon Redshift , dbt

        We recently moved our Data Analytics and Business Intelligence tooling to Looker . It's already helping us create a solid process for reusable SQL-based data modeling, with consistent definitions across the entire organizations. Looker allows us to collaboratively build these version-controlled models and push the limits of what we've traditionally been able to accomplish with analytics with a lean team.

        For Data Engineering, we're in the process of moving from maintaining our own ETL pipelines on AWS to a managed ELT system on Stitch. We're also evaluating the command line tool, dbt to manage data transformations. Our hope is that Stitch + dbt will streamline the ELT bit, allowing us to focus our energies on analyzing data, rather than managing it.

        See more
        Cassandra logo

        Cassandra

        3.6K
        3.5K
        507
        A partitioned row store. Rows are organized into tables with a required primary key.
        3.6K
        3.5K
        + 1
        507
        PROS OF CASSANDRA
        • 119
          Distributed
        • 98
          High performance
        • 81
          High availability
        • 74
          Easy scalability
        • 53
          Replication
        • 26
          Reliable
        • 26
          Multi datacenter deployments
        • 10
          Schema optional
        • 9
          OLTP
        • 8
          Open source
        • 2
          Workload separation (via MDC)
        • 1
          Fast
        CONS OF CASSANDRA
        • 3
          Reliability of replication
        • 1
          Size
        • 1
          Updates

        related Cassandra posts

        Thierry Schellenbach
        Shared insights
        on
        GolangGolangPythonPythonCassandraCassandra
        at

        After years of optimizing our existing feed technology, we decided to make a larger leap with 2.0 of Stream. While the first iteration of Stream was powered by Python and Cassandra, for Stream 2.0 of our infrastructure we switched to Go.

        The main reason why we switched from Python to Go is performance. Certain features of Stream such as aggregation, ranking and serialization were very difficult to speed up using Python.

        We’ve been using Go since March 2017 and it’s been a great experience so far. Go has greatly increased the productivity of our development team. Not only has it improved the speed at which we develop, it’s also 30x faster for many components of Stream. Initially we struggled a bit with package management for Go. However, using Dep together with the VG package contributed to creating a great workflow.

        Go as a language is heavily focused on performance. The built-in PPROF tool is amazing for finding performance issues. Uber’s Go-Torch library is great for visualizing data from PPROF and will be bundled in PPROF in Go 1.10.

        The performance of Go greatly influenced our architecture in a positive way. With Python we often found ourselves delegating logic to the database layer purely for performance reasons. The high performance of Go gave us more flexibility in terms of architecture. This led to a huge simplification of our infrastructure and a dramatic improvement of latency. For instance, we saw a 10 to 1 reduction in web-server count thanks to the lower memory and CPU usage for the same number of requests.

        #DataStores #Databases

        See more
        Thierry Schellenbach
        Shared insights
        on
        RedisRedisCassandraCassandraRocksDBRocksDB
        at

        1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

        Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

        RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

        This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

        #InMemoryDatabases #DataStores #Databases

        See more
        Spectrum logo

        Spectrum

        22
        32
        0
        A community platform for the future.
        22
        32
        + 1
        0
        PROS OF SPECTRUM
          Be the first to leave a pro
          CONS OF SPECTRUM
            Be the first to leave a con

            related Spectrum posts

            From a StackShare Community member: “We’re about to start a chat group for our open source project (over 5K stars on GitHub) so we can let our community collaborate more closely. The obvious choice would be Slack (k8s and a ton of major projects use it), but we’ve seen Gitter (webpack uses it) for a lot of open source projects, Discord (Vue.js moved to them), and as of late I’m seeing Spectrum more and more often. Does anyone have experience with these or other alternatives? Is it even worth assessing all these options, or should we just go with Slack? Some things that are important to us: free, all the regular integrations (GitHub, Heroku, etc), mobile & desktop apps, and open source is of course a plus."

            See more
            Amazon Quicksight logo

            Amazon Quicksight

            206
            392
            5
            Fast, easy to use business analytics at 1/10th the cost of traditional BI solutions
            206
            392
            + 1
            5
            PROS OF AMAZON QUICKSIGHT
            • 1
              Dataset versionning
            • 1
              Good integration with aws Glue ETL services
            • 1
              More features (table calculations, functions, insights)
            • 1
              Better integration with aws
            • 1
              Super cheap
            CONS OF AMAZON QUICKSIGHT
            • 1
              Very basic BI tool
            • 1
              Only works in AWS environments (not GCP, Azure)

            related Amazon Quicksight posts

            Julien DeFrance
            Principal Software Engineer at Tophatter · | 16 upvotes · 3.2M views

            Back in 2014, I was given an opportunity to re-architect SmartZip Analytics platform, and flagship product: SmartTargeting. This is a SaaS software helping real estate professionals keeping up with their prospects and leads in a given neighborhood/territory, finding out (thanks to predictive analytics) who's the most likely to list/sell their home, and running cross-channel marketing automation against them: direct mail, online ads, email... The company also does provide Data APIs to Enterprise customers.

            I had inherited years and years of technical debt and I knew things had to change radically. The first enabler to this was to make use of the cloud and go with AWS, so we would stop re-inventing the wheel, and build around managed/scalable services.

            For the SaaS product, we kept on working with Rails as this was what my team had the most knowledge in. We've however broken up the monolith and decoupled the front-end application from the backend thanks to the use of Rails API so we'd get independently scalable micro-services from now on.

            Our various applications could now be deployed using AWS Elastic Beanstalk so we wouldn't waste any more efforts writing time-consuming Capistrano deployment scripts for instance. Combined with Docker so our application would run within its own container, independently from the underlying host configuration.

            Storage-wise, we went with Amazon S3 and ditched any pre-existing local or network storage people used to deal with in our legacy systems. On the database side: Amazon RDS / MySQL initially. Ultimately migrated to Amazon RDS for Aurora / MySQL when it got released. Once again, here you need a managed service your cloud provider handles for you.

            Future improvements / technology decisions included:

            Caching: Amazon ElastiCache / Memcached CDN: Amazon CloudFront Systems Integration: Segment / Zapier Data-warehousing: Amazon Redshift BI: Amazon Quicksight / Superset Search: Elasticsearch / Amazon Elasticsearch Service / Algolia Monitoring: New Relic

            As our usage grows, patterns changed, and/or our business needs evolved, my role as Engineering Manager then Director of Engineering was also to ensure my team kept on learning and innovating, while delivering on business value.

            One of these innovations was to get ourselves into Serverless : Adopting AWS Lambda was a big step forward. At the time, only available for Node.js (Not Ruby ) but a great way to handle cost efficiency, unpredictable traffic, sudden bursts of traffic... Ultimately you want the whole chain of services involved in a call to be serverless, and that's when we've started leveraging Amazon DynamoDB on these projects so they'd be fully scalable.

            See more

            Hi, I'm working on a project to integrate dat from Shopify (e-commerce platform) to Amazon Quicksight. I'm thinking about which database to use, either Amazon S3 or MySQL.

            See more
            Google BigQuery logo

            Google BigQuery

            1.7K
            1.5K
            152
            Analyze terabytes of data in seconds
            1.7K
            1.5K
            + 1
            152
            PROS OF GOOGLE BIGQUERY
            • 28
              High Performance
            • 25
              Easy to use
            • 22
              Fully managed service
            • 19
              Cheap Pricing
            • 16
              Process hundreds of GB in seconds
            • 12
              Big Data
            • 11
              Full table scans in seconds, no indexes needed
            • 8
              Always on, no per-hour costs
            • 6
              Good combination with fluentd
            • 4
              Machine learning
            • 1
              Easy to manage
            • 0
              Easy to learn
            CONS OF GOOGLE BIGQUERY
            • 1
              You can't unit test changes in BQ data
            • 0
              Sdas

            related Google BigQuery posts

            Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.

            Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!

            Check Out My Architecture: CLICK ME

            Check out the GitHub repo attached

            See more
            Tim Specht
            ‎Co-Founder and CTO at Dubsmash · | 14 upvotes · 981.5K views

            In order to accurately measure & track user behaviour on our platform we moved over quickly from the initial solution using Google Analytics to a custom-built one due to resource & pricing concerns we had.

            While this does sound complicated, it’s as easy as clients sending JSON blobs of events to Amazon Kinesis from where we use AWS Lambda & Amazon SQS to batch and process incoming events and then ingest them into Google BigQuery. Once events are stored in BigQuery (which usually only takes a second from the time the client sends the data until it’s available), we can use almost-standard-SQL to simply query for data while Google makes sure that, even with terabytes of data being scanned, query times stay in the range of seconds rather than hours. Before ingesting their data into the pipeline, our mobile clients are aggregating events internally and, once a certain threshold is reached or the app is going to the background, sending the events as a JSON blob into the stream.

            In the past we had workers running that continuously read from the stream and would validate and post-process the data and then enqueue them for other workers to write them to BigQuery. We went ahead and implemented the Lambda-based approach in such a way that Lambda functions would automatically be triggered for incoming records, pre-aggregate events, and write them back to SQS, from which we then read them, and persist the events to BigQuery. While this approach had a couple of bumps on the road, like re-triggering functions asynchronously to keep up with the stream and proper batch sizes, we finally managed to get it running in a reliable way and are very happy with this solution today.

            #ServerlessTaskProcessing #GeneralAnalytics #RealTimeDataProcessing #BigDataAsAService

            See more
            Elasticsearch logo

            Elasticsearch

            34.4K
            26.8K
            1.6K
            Open Source, Distributed, RESTful Search Engine
            34.4K
            26.8K
            + 1
            1.6K
            PROS OF ELASTICSEARCH
            • 328
              Powerful api
            • 315
              Great search engine
            • 231
              Open source
            • 214
              Restful
            • 200
              Near real-time search
            • 98
              Free
            • 85
              Search everything
            • 54
              Easy to get started
            • 45
              Analytics
            • 26
              Distributed
            • 6
              Fast search
            • 5
              More than a search engine
            • 4
              Great docs
            • 4
              Awesome, great tool
            • 3
              Highly Available
            • 3
              Easy to scale
            • 2
              Potato
            • 2
              Document Store
            • 2
              Great customer support
            • 2
              Intuitive API
            • 2
              Nosql DB
            • 2
              Great piece of software
            • 2
              Reliable
            • 2
              Fast
            • 2
              Easy setup
            • 1
              Open
            • 1
              Easy to get hot data
            • 1
              Github
            • 1
              Elaticsearch
            • 1
              Actively developing
            • 1
              Responsive maintainers on GitHub
            • 1
              Ecosystem
            • 1
              Not stable
            • 1
              Scalability
            • 0
              Community
            CONS OF ELASTICSEARCH
            • 7
              Resource hungry
            • 6
              Diffecult to get started
            • 5
              Expensive
            • 4
              Hard to keep stable at large scale

            related Elasticsearch posts

            Tim Abbott

            We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.

            We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).

            And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.

            I can't recommend it highly enough.

            See more
            Tymoteusz Paul
            Devops guy at X20X Development LTD · | 23 upvotes · 9.5M views

            Often enough I have to explain my way of going about setting up a CI/CD pipeline with multiple deployment platforms. Since I am a bit tired of yapping the same every single time, I've decided to write it up and share with the world this way, and send people to read it instead ;). I will explain it on "live-example" of how the Rome got built, basing that current methodology exists only of readme.md and wishes of good luck (as it usually is ;)).

            It always starts with an app, whatever it may be and reading the readmes available while Vagrant and VirtualBox is installing and updating. Following that is the first hurdle to go over - convert all the instruction/scripts into Ansible playbook(s), and only stopping when doing a clear vagrant up or vagrant reload we will have a fully working environment. As our Vagrant environment is now functional, it's time to break it! This is the moment to look for how things can be done better (too rigid/too lose versioning? Sloppy environment setup?) and replace them with the right way to do stuff, one that won't bite us in the backside. This is the point, and the best opportunity, to upcycle the existing way of doing dev environment to produce a proper, production-grade product.

            I should probably digress here for a moment and explain why. I firmly believe that the way you deploy production is the same way you should deploy develop, shy of few debugging-friendly setting. This way you avoid the discrepancy between how production work vs how development works, which almost always causes major pains in the back of the neck, and with use of proper tools should mean no more work for the developers. That's why we start with Vagrant as developer boxes should be as easy as vagrant up, but the meat of our product lies in Ansible which will do meat of the work and can be applied to almost anything: AWS, bare metal, docker, LXC, in open net, behind vpn - you name it.

            We must also give proper consideration to monitoring and logging hoovering at this point. My generic answer here is to grab Elasticsearch, Kibana, and Logstash. While for different use cases there may be better solutions, this one is well battle-tested, performs reasonably and is very easy to scale both vertically (within some limits) and horizontally. Logstash rules are easy to write and are well supported in maintenance through Ansible, which as I've mentioned earlier, are at the very core of things, and creating triggers/reports and alerts based on Elastic and Kibana is generally a breeze, including some quite complex aggregations.

            If we are happy with the state of the Ansible it's time to move on and put all those roles and playbooks to work. Namely, we need something to manage our CI/CD pipelines. For me, the choice is obvious: TeamCity. It's modern, robust and unlike most of the light-weight alternatives, it's transparent. What I mean by that is that it doesn't tell you how to do things, doesn't limit your ways to deploy, or test, or package for that matter. Instead, it provides a developer-friendly and rich playground for your pipelines. You can do most the same with Jenkins, but it has a quite dated look and feel to it, while also missing some key functionality that must be brought in via plugins (like quality REST API which comes built-in with TeamCity). It also comes with all the common-handy plugins like Slack or Apache Maven integration.

            The exact flow between CI and CD varies too greatly from one application to another to describe, so I will outline a few rules that guide me in it: 1. Make build steps as small as possible. This way when something breaks, we know exactly where, without needing to dig and root around. 2. All security credentials besides development environment must be sources from individual Vault instances. Keys to those containers should exist only on the CI/CD box and accessible by a few people (the less the better). This is pretty self-explanatory, as anything besides dev may contain sensitive data and, at times, be public-facing. Because of that appropriate security must be present. TeamCity shines in this department with excellent secrets-management. 3. Every part of the build chain shall consume and produce artifacts. If it creates nothing, it likely shouldn't be its own build. This way if any issue shows up with any environment or version, all developer has to do it is grab appropriate artifacts to reproduce the issue locally. 4. Deployment builds should be directly tied to specific Git branches/tags. This enables much easier tracking of what caused an issue, including automated identifying and tagging the author (nothing like automated regression testing!).

            Speaking of deployments, I generally try to keep it simple but also with a close eye on the wallet. Because of that, I am more than happy with AWS or another cloud provider, but also constantly peeking at the loads and do we get the value of what we are paying for. Often enough the pattern of use is not constantly erratic, but rather has a firm baseline which could be migrated away from the cloud and into bare metal boxes. That is another part where this approach strongly triumphs over the common Docker and CircleCI setup, where you are very much tied in to use cloud providers and getting out is expensive. Here to embrace bare-metal hosting all you need is a help of some container-based self-hosting software, my personal preference is with Proxmox and LXC. Following that all you must write are ansible scripts to manage hardware of Proxmox, similar way as you do for Amazon EC2 (ansible supports both greatly) and you are good to go. One does not exclude another, quite the opposite, as they can live in great synergy and cut your costs dramatically (the heavier your base load, the bigger the savings) while providing production-grade resiliency.

            See more