Alternatives to MarkLogic logo

Alternatives to MarkLogic

MongoDB, Neo4j, Oracle, Cassandra, and HBase are the most popular alternatives and competitors to MarkLogic.
42
70
+ 1
26

What is MarkLogic and what are its top alternatives?

MarkLogic is a NoSQL database that combines the benefits of a document database, a relational database, and a search engine all in one platform. Its key features include ACID transactions, horizontal scaling, flexible data model, and powerful search capabilities. However, some limitations include high cost, steep learning curve, and limited community support.

  1. MongoDB: MongoDB is a popular NoSQL database known for its scalability, flexibility, and high performance. Key features include document-based data model, horizontal scaling, and strong community support. Pros include ease of use and integration capabilities, while cons include lack of ACID transactions in earlier versions.
  2. Couchbase: Couchbase is an open-source NoSQL database with features like JSON support, distributed architecture, and strong query capabilities. Pros include high performance and scalability, while cons include limited secondary indexes and analytics capabilities.
  3. Cassandra: Apache Cassandra is a distributed NoSQL database known for its linear scalability and fault tolerance. Key features include decentralized architecture, tunable consistency levels, and support for massive amounts of data. Pros include high availability and fault tolerance, while cons include complex configuration and management.
  4. Elasticsearch: Elasticsearch is a distributed search and analytics engine known for its speed, scalability, and flexibility. Key features include full-text search, real-time data analysis, and horizontal scaling. Pros include powerful search capabilities and ease of use, while cons include lack of ACID transactions.
  5. Amazon DynamoDB: DynamoDB is a fully managed NoSQL database service offered by AWS, providing key-value and document data storage. Key features include automatic scaling, low latency, and seamless integration with other AWS services. Pros include high availability and low maintenance overhead, while cons include costs for read and write operations.
  6. Redis: Redis is an open-source, in-memory data structure store known for its speed and flexibility. Key features include data structures like strings, hashes, and lists, as well as support for high availability and clustering. Pros include low latency and high performance, while cons include limited data persistence options.
  7. Oracle NoSQL Database: Oracle NoSQL Database is a distributed, highly available NoSQL database from Oracle. Key features include automatic partitioning, replication, and failover, as well as ACID transactions. Pros include scalability and reliability, while cons include high cost and complex setup.
  8. Google Cloud Bigtable: Google Cloud Bigtable is a highly scalable, fully managed NoSQL database service offered by Google Cloud Platform. Key features include high performance, scalability to petabytes of data, and low latency. Pros include seamless integration with other GCP services and high availability, while cons include costs for storage and operations.
  9. Neo4j: Neo4j is a graph database known for its high performance and scalability in handling complex relationships. Key features include graph-based data model, ACID transactions, and powerful query language. Pros include fast query performance and strong relationship modeling capabilities, while cons include limited support for large-scale data processing.
  10. TiDB: TiDB is a distributed SQL database that combines the horizontal scalability of NoSQL with the ACID compliance of traditional relational databases. Key features include automatic sharding, strong consistency, and compatibility with MySQL protocols. Pros include scalability and ACID compliance, while cons include limited ecosystem and tooling compared to established databases.

Top Alternatives to MarkLogic

  • MongoDB
    MongoDB

    MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding. ...

  • Neo4j
    Neo4j

    Neo4j stores data in nodes connected by directed, typed relationships with properties on both, also known as a Property Graph. It is a high performance graph store with all the features expected of a mature and robust database, like a friendly query language and ACID transactions. ...

  • Oracle
    Oracle

    Oracle Database is an RDBMS. An RDBMS that implements object-oriented features such as user-defined types, inheritance, and polymorphism is called an object-relational database management system (ORDBMS). Oracle Database has extended the relational model to an object-relational model, making it possible to store complex business models in a relational database. ...

  • Cassandra
    Cassandra

    Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL. ...

  • HBase
    HBase

    Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop. ...

  • Hadoop
    Hadoop

    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...

  • Couchbase
    Couchbase

    Developed as an alternative to traditionally inflexible SQL databases, the Couchbase NoSQL database is built on an open source foundation and architected to help developers solve real-world problems and meet high scalability demands. ...

  • Elasticsearch
    Elasticsearch

    Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack). ...

MarkLogic alternatives & related posts

MongoDB logo

MongoDB

93.6K
4.1K
The database for giant ideas
93.6K
4.1K
PROS OF MONGODB
  • 828
    Document-oriented storage
  • 593
    No sql
  • 553
    Ease of use
  • 464
    Fast
  • 410
    High performance
  • 255
    Free
  • 218
    Open source
  • 180
    Flexible
  • 145
    Replication & high availability
  • 112
    Easy to maintain
  • 42
    Querying
  • 39
    Easy scalability
  • 38
    Auto-sharding
  • 37
    High availability
  • 31
    Map/reduce
  • 27
    Document database
  • 25
    Easy setup
  • 25
    Full index support
  • 16
    Reliable
  • 15
    Fast in-place updates
  • 14
    Agile programming, flexible, fast
  • 12
    No database migrations
  • 8
    Easy integration with Node.Js
  • 8
    Enterprise
  • 6
    Enterprise Support
  • 5
    Great NoSQL DB
  • 4
    Support for many languages through different drivers
  • 3
    Schemaless
  • 3
    Aggregation Framework
  • 3
    Drivers support is good
  • 2
    Fast
  • 2
    Managed service
  • 2
    Easy to Scale
  • 2
    Awesome
  • 2
    Consistent
  • 1
    Good GUI
  • 1
    Acid Compliant
CONS OF MONGODB
  • 6
    Very slowly for connected models that require joins
  • 3
    Not acid compliant
  • 2
    Proprietary query language

related MongoDB posts

Jeyabalaji Subramanian

Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.

We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient

Based on the above criteria, we selected the following tools to perform the end to end data replication:

We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.

We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.

In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.

Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.

In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!

See more
Robert Zuber

We use MongoDB as our primary #datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and #ETL.

As we pull #microservices from our #monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).

When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.

See more
Neo4j logo

Neo4j

1.2K
351
The world’s leading Graph Database
1.2K
351
PROS OF NEO4J
  • 69
    Cypher – graph query language
  • 61
    Great graphdb
  • 33
    Open source
  • 31
    Rest api
  • 27
    High-Performance Native API
  • 23
    ACID
  • 21
    Easy setup
  • 17
    Great support
  • 11
    Clustering
  • 9
    Hot Backups
  • 8
    Great Web Admin UI
  • 7
    Powerful, flexible data model
  • 7
    Mature
  • 6
    Embeddable
  • 5
    Easy to Use and Model
  • 4
    Highly-available
  • 4
    Best Graphdb
  • 2
    It's awesome, I wanted to try it
  • 2
    Great onboarding process
  • 2
    Great query language and built in data browser
  • 2
    Used by Crunchbase
CONS OF NEO4J
  • 9
    Comparably slow
  • 4
    Can't store a vertex as JSON
  • 1
    Doesn't have a managed cloud service at low cost

related Neo4j posts

Shared insights
on
Neo4jNeo4jKafkaKafkaMySQLMySQL

Hello Stackshare. I'm currently doing some research on real-time reporting and analytics architectures. We have a use case where 1million+ records of users, 4million+ activities, and messages that we want to report against. The start was to present it directly from MySQL, which didn't go well and puts a heavy load on the database. Anybody can suggest something where we feed the data and can report in realtime? Read some articles about ElasticSearch and Kafka https://medium.com/@D11Engg/building-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33 EDIT: also considering Neo4j

See more
Stephen Gheysens
Lead Solutions Engineer at Inscribe · | 7 upvotes · 473.4K views

Google Maps lets "property owners and their authorized representatives" upload indoor maps, but this appears to lack navigation ("wayfinding").

MappedIn is a platform and has SDKs for building indoor mapping experiences (https://www.mappedin.com/) and ESRI ArcGIS also offers some indoor mapping tools (https://www.esri.com/en-us/arcgis/indoor-gis/overview). Finally, there used to be a company called LocusLabs that is now a part of Atrius and they were often integrated into airlines' apps to provide airport maps with wayfinding (https://atrius.com/solutions/personal-experiences/personal-wayfinder/).

I previously worked at Mapbox and while I believe that it's a great platform for building map-based experiences, they don't have any simple solutions for indoor wayfinding. If I were doing this for fun as a side-project and prioritized saving money over saving time, here is what I would do:

  • Create a graph-based dataset representing the walking paths around your university, where nodes/vertexes represent the intersections of paths, and edges represent paths (literally paths outside, hallways, short path segments that represent entering rooms). You could store this in a hosted graph-based database like Neo4j, Amazon Neptune , or Azure Cosmos DB (with its Gremlin API) and use built-in "shortest path" queries, or deploy a PostgreSQL service with pgRouting.

  • Add two properties to each edge: one property for the distance between its nodes (libraries like @turf/helpers will have a distance function if you have the latitude & longitude of each node), and another property estimating the walking time (based on the distance). Once you have these values saved in a graph-based format, you should be able to easily query and find the data representation of paths between two points.

  • At this point, you'd have the routing problem solved and it would come down to building a UI. Mapbox arguably leads the industry in developer tools for custom map experiences. You could convert your nodes/edges to GeoJSON, then either upload to Mapbox and create a Tileset to visualize the paths, or add the GeoJSON to the map on the fly.

*You might be able to use open source routing tools like OSRM (https://github.com/Project-OSRM/osrm-backend/issues/6257) or Graphhopper (instead of a custom graph database implementation), but it would likely be more involved to maintain these services.

See more
Oracle logo

Oracle

2.3K
113
An RDBMS that implements object-oriented features such as user-defined types, inheritance, and polymorphism
2.3K
113
PROS OF ORACLE
  • 44
    Reliable
  • 33
    Enterprise
  • 15
    High Availability
  • 5
    Hard to maintain
  • 5
    Expensive
  • 4
    Maintainable
  • 4
    Hard to use
  • 3
    High complexity
CONS OF ORACLE
  • 14
    Expensive

related Oracle posts

Hi. We are planning to develop web, desktop, and mobile app for procurement, logistics, and contracts. Procure to Pay and Source to pay, spend management, supplier management, catalog management. ( similar to SAP Ariba, gap.com, coupa.com, ivalua.com vroozi.com, procurify.com

We got stuck when deciding which technology stack is good for the future. We look forward to your kind guidance that will help us.

We want to integrate with multiple databases with seamless bidirectional integration. What APIs and middleware available are best to achieve this? SAP HANA, Oracle, MySQL, MongoDB...

ASP.NET / Node.js / Laravel. ......?

Please guide us

See more

I recently started a new position as a data scientist at an E-commerce company. The company is founded about 4-5 years ago and is new to many data-related areas. Specifically, I'm their first data science employee. So I have to take care of both data analysis tasks as well as bringing new technologies to the company.

  1. They have used Elasticsearch (and Kibana) to have reporting dashboards on their daily purchases and users interactions on their e-commerce website.

  2. They also use the Oracle database system to keep records of their daily turnovers and lists of their current products, clients, and sellers lists.

  3. They use Data-Warehouse with cockpit 10 for generating reports on different aspects of their business including number 2 in this list.

At the moment, I grab batches of data from their system to perform predictive analytics from data science perspectives. In some cases, I use a static form of data such as monthly turnover, client values, and high-demand products, and run my predictive analysis using Python (VS code). Also, I use Google Datastudio or Google Sheets to present my findings. In other cases, I try to do time-series analysis using offline batches of data extracted from Elastic Search to do user recommendations and user personalization.

I really want to use modern data science tools such as Apache Spark, Google BigQuery, AWS, Azure, or others where they really fit. I think these tools can improve my performance as a data scientist and can provide more continuous analytics of their business interactions. But honestly, I'm not sure where each tool is needed and what part of their system should be replaced by or combined with the current state of technology to improve productivity from the above perspectives.

See more
Cassandra logo

Cassandra

3.6K
507
A partitioned row store. Rows are organized into tables with a required primary key.
3.6K
507
PROS OF CASSANDRA
  • 119
    Distributed
  • 98
    High performance
  • 81
    High availability
  • 74
    Easy scalability
  • 53
    Replication
  • 26
    Reliable
  • 26
    Multi datacenter deployments
  • 10
    Schema optional
  • 9
    OLTP
  • 8
    Open source
  • 2
    Workload separation (via MDC)
  • 1
    Fast
CONS OF CASSANDRA
  • 3
    Reliability of replication
  • 1
    Size
  • 1
    Updates

related Cassandra posts

Thierry Schellenbach
Shared insights
on
RedisRedisCassandraCassandraRocksDBRocksDB
at

1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

#InMemoryDatabases #DataStores #Databases

See more

Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

  1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
  2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
  3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
  4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
  5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
  6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
  7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
  8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
See more
HBase logo

HBase

462
15
The Hadoop database, a distributed, scalable, big data store
462
15
PROS OF HBASE
  • 9
    Performance
  • 5
    OLTP
  • 1
    Fast Point Queries
CONS OF HBASE
    Be the first to leave a con

    related HBase posts

    I am researching different querying solutions to handle ~1 trillion records of data (in the realm of a petabyte). The data is mostly textual. I have identified a few options: Milvus, HBase, RocksDB, and Elasticsearch. I was wondering if there is a good way to compare the performance of these options (or if anyone has already done something like this). I want to be able to compare the speed of ingesting and querying textual data from these tools. Does anyone have information on this or know where I can find some? Thanks in advance!

    See more

    Hi, I'm building a machine learning pipelines to store image bytes and image vectors in the backend.

    So, when users query for the random access image data (key), we return the image bytes and perform machine learning model operations on it.

    I'm currently considering going with Amazon S3 (in the future, maybe add Redis caching layer) as the backend system to store the information (s3 buckets with sharded prefixes).

    As the latency of S3 is 100-200ms (get/put) and it has a high throughput of 3500 puts/sec and 5500 gets/sec for a given bucker/prefix. In the future I need to reduce the latency, I can add Redis cache.

    Also, s3 costs are way fewer than HBase (on Amazon EC2 instances with 3x replication factor)

    I have not personally used HBase before, so can someone help me if I'm making the right choice here? I'm not aware of Hbase latencies and I have learned that the MOB feature on Hbase has to be turned on if we have store image bytes on of the column families as the avg image bytes are 240Kb.

    See more
    Hadoop logo

    Hadoop

    2.5K
    56
    Open-source software for reliable, scalable, distributed computing
    2.5K
    56
    PROS OF HADOOP
    • 39
      Great ecosystem
    • 11
      One stack to rule them all
    • 4
      Great load balancer
    • 1
      Amazon aws
    • 1
      Java syntax
    CONS OF HADOOP
      Be the first to leave a con

      related Hadoop posts

      Shared insights
      on
      KafkaKafkaHadoopHadoop
      at

      The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.

      For databases, a custom Hadoop streamer pulled database data and wrote it to S3.

      Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.

      See more
      Conor Myhrvold
      Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 3M views

      Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

      Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

      https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

      (Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

      See more
      Couchbase logo

      Couchbase

      478
      110
      Document-Oriented NoSQL Database
      478
      110
      PROS OF COUCHBASE
      • 18
        High performance
      • 18
        Flexible data model, easy scalability, extremely fast
      • 9
        Mobile app support
      • 7
        You can query it with Ansi-92 SQL
      • 6
        All nodes can be read/write
      • 5
        Equal nodes in cluster, allowing fast, flexible changes
      • 5
        Both a key-value store and document (JSON) db
      • 5
        Open source, community and enterprise editions
      • 4
        Automatic configuration of sharding
      • 4
        Local cache capability
      • 3
        Easy setup
      • 3
        Linearly scalable, useful to large number of tps
      • 3
        Easy cluster administration
      • 3
        Cross data center replication
      • 3
        SDKs in popular programming languages
      • 3
        Elasticsearch connector
      • 3
        Web based management, query and monitoring panel
      • 2
        Map reduce views
      • 2
        DBaaS available
      • 2
        NoSQL
      • 1
        Buckets, Scopes, Collections & Documents
      • 1
        FTS + SQL together
      CONS OF COUCHBASE
      • 3
        Terrible query language

      related Couchbase posts

      Gabriel Pa

      We implemented our first large scale EPR application from naologic.com using CouchDB .

      Very fast, replication works great, doesn't consume much RAM, queries are blazing fast but we found a problem: the queries were very hard to write, it took a long time to figure out the API, we had to go and write our own @nodejs library to make it work properly.

      It lost most of its support. Since then, we migrated to Couchbase and the learning curve was steep but all worth it. Memcached indexing out of the box, full text search works great.

      See more
      Ilias Mentzelos
      Software Engineer at Plum Fintech · | 9 upvotes · 243.9K views
      Shared insights
      on
      MongoDBMongoDBCouchbaseCouchbase

      Hey, we want to build a referral campaign mechanism that will probably contain millions of records within the next few years. We want fast read access based on IDs or some indexes, and isolation is crucial as some listeners will try to update the same document at the same time. What's your suggestion between Couchbase and MongoDB? Thanks!

      See more
      Elasticsearch logo

      Elasticsearch

      34.5K
      1.6K
      Open Source, Distributed, RESTful Search Engine
      34.5K
      1.6K
      PROS OF ELASTICSEARCH
      • 328
        Powerful api
      • 315
        Great search engine
      • 231
        Open source
      • 214
        Restful
      • 200
        Near real-time search
      • 98
        Free
      • 85
        Search everything
      • 54
        Easy to get started
      • 45
        Analytics
      • 26
        Distributed
      • 6
        Fast search
      • 5
        More than a search engine
      • 4
        Great docs
      • 4
        Awesome, great tool
      • 3
        Highly Available
      • 3
        Easy to scale
      • 2
        Potato
      • 2
        Document Store
      • 2
        Great customer support
      • 2
        Intuitive API
      • 2
        Nosql DB
      • 2
        Great piece of software
      • 2
        Reliable
      • 2
        Fast
      • 2
        Easy setup
      • 1
        Open
      • 1
        Easy to get hot data
      • 1
        Github
      • 1
        Elaticsearch
      • 1
        Actively developing
      • 1
        Responsive maintainers on GitHub
      • 1
        Ecosystem
      • 1
        Not stable
      • 1
        Scalability
      • 0
        Community
      CONS OF ELASTICSEARCH
      • 7
        Resource hungry
      • 6
        Diffecult to get started
      • 5
        Expensive
      • 4
        Hard to keep stable at large scale

      related Elasticsearch posts

      Tim Abbott

      We've been using PostgreSQL since the very early days of Zulip, but we actually didn't use it from the beginning. Zulip started out as a MySQL project back in 2012, because we'd heard it was a good choice for a startup with a wide community. However, we found that even though we were using the Django ORM for most of our database access, we spent a lot of time fighting with MySQL. Issues ranged from bad collation defaults, to bad query plans which required a lot of manual query tweaks.

      We ended up getting so frustrated that we tried out PostgresQL, and the results were fantastic. We didn't have to do any real customization (just some tuning settings for how big a server we had), and all of our most important queries were faster out of the box. As a result, we were able to delete a bunch of custom queries escaping the ORM that we'd written to make the MySQL query planner happy (because postgres just did the right thing automatically).

      And then after that, we've just gotten a ton of value out of postgres. We use its excellent built-in full-text search, which has helped us avoid needing to bring in a tool like Elasticsearch, and we've really enjoyed features like its partial indexes, which saved us a lot of work adding unnecessary extra tables to get good performance for things like our "unread messages" and "starred messages" indexes.

      I can't recommend it highly enough.

      See more
      Tymoteusz Paul
      Devops guy at X20X Development LTD · | 23 upvotes · 10M views

      Often enough I have to explain my way of going about setting up a CI/CD pipeline with multiple deployment platforms. Since I am a bit tired of yapping the same every single time, I've decided to write it up and share with the world this way, and send people to read it instead ;). I will explain it on "live-example" of how the Rome got built, basing that current methodology exists only of readme.md and wishes of good luck (as it usually is ;)).

      It always starts with an app, whatever it may be and reading the readmes available while Vagrant and VirtualBox is installing and updating. Following that is the first hurdle to go over - convert all the instruction/scripts into Ansible playbook(s), and only stopping when doing a clear vagrant up or vagrant reload we will have a fully working environment. As our Vagrant environment is now functional, it's time to break it! This is the moment to look for how things can be done better (too rigid/too lose versioning? Sloppy environment setup?) and replace them with the right way to do stuff, one that won't bite us in the backside. This is the point, and the best opportunity, to upcycle the existing way of doing dev environment to produce a proper, production-grade product.

      I should probably digress here for a moment and explain why. I firmly believe that the way you deploy production is the same way you should deploy develop, shy of few debugging-friendly setting. This way you avoid the discrepancy between how production work vs how development works, which almost always causes major pains in the back of the neck, and with use of proper tools should mean no more work for the developers. That's why we start with Vagrant as developer boxes should be as easy as vagrant up, but the meat of our product lies in Ansible which will do meat of the work and can be applied to almost anything: AWS, bare metal, docker, LXC, in open net, behind vpn - you name it.

      We must also give proper consideration to monitoring and logging hoovering at this point. My generic answer here is to grab Elasticsearch, Kibana, and Logstash. While for different use cases there may be better solutions, this one is well battle-tested, performs reasonably and is very easy to scale both vertically (within some limits) and horizontally. Logstash rules are easy to write and are well supported in maintenance through Ansible, which as I've mentioned earlier, are at the very core of things, and creating triggers/reports and alerts based on Elastic and Kibana is generally a breeze, including some quite complex aggregations.

      If we are happy with the state of the Ansible it's time to move on and put all those roles and playbooks to work. Namely, we need something to manage our CI/CD pipelines. For me, the choice is obvious: TeamCity. It's modern, robust and unlike most of the light-weight alternatives, it's transparent. What I mean by that is that it doesn't tell you how to do things, doesn't limit your ways to deploy, or test, or package for that matter. Instead, it provides a developer-friendly and rich playground for your pipelines. You can do most the same with Jenkins, but it has a quite dated look and feel to it, while also missing some key functionality that must be brought in via plugins (like quality REST API which comes built-in with TeamCity). It also comes with all the common-handy plugins like Slack or Apache Maven integration.

      The exact flow between CI and CD varies too greatly from one application to another to describe, so I will outline a few rules that guide me in it: 1. Make build steps as small as possible. This way when something breaks, we know exactly where, without needing to dig and root around. 2. All security credentials besides development environment must be sources from individual Vault instances. Keys to those containers should exist only on the CI/CD box and accessible by a few people (the less the better). This is pretty self-explanatory, as anything besides dev may contain sensitive data and, at times, be public-facing. Because of that appropriate security must be present. TeamCity shines in this department with excellent secrets-management. 3. Every part of the build chain shall consume and produce artifacts. If it creates nothing, it likely shouldn't be its own build. This way if any issue shows up with any environment or version, all developer has to do it is grab appropriate artifacts to reproduce the issue locally. 4. Deployment builds should be directly tied to specific Git branches/tags. This enables much easier tracking of what caused an issue, including automated identifying and tagging the author (nothing like automated regression testing!).

      Speaking of deployments, I generally try to keep it simple but also with a close eye on the wallet. Because of that, I am more than happy with AWS or another cloud provider, but also constantly peeking at the loads and do we get the value of what we are paying for. Often enough the pattern of use is not constantly erratic, but rather has a firm baseline which could be migrated away from the cloud and into bare metal boxes. That is another part where this approach strongly triumphs over the common Docker and CircleCI setup, where you are very much tied in to use cloud providers and getting out is expensive. Here to embrace bare-metal hosting all you need is a help of some container-based self-hosting software, my personal preference is with Proxmox and LXC. Following that all you must write are ansible scripts to manage hardware of Proxmox, similar way as you do for Amazon EC2 (ansible supports both greatly) and you are good to go. One does not exclude another, quite the opposite, as they can live in great synergy and cut your costs dramatically (the heavier your base load, the bigger the savings) while providing production-grade resiliency.

      See more