Announcing MongoDB 2.8.0-rc0 Release Candidate and Bug Hunt

Nov 12 • Posted 1 month ago

Bug Hunt Extended!

There’s still time to submit your bugs! Along with the recent announcement about MongoDB’s acquisition of WiredTiger, we’ve extended the Bug Hunt. You can file issues until January 6.

Announcing MongoDB 2.8.0-rc0 Release Candidate

We’re truly excited to announce the availability of the first MongoDB 2.8 release candidate (rc0), headlined by improved concurrency (including document-level locking), compression, and pluggable storage engines.

We’ve put the release through extensive testing, and will be hard at work in the coming weeks optimizing and tuning some of the new features. Now it’s your turn to help ensure the quality of this important release. Over the next three weeks, we challenge you to test and uncover any lingering issues by participating in our MongoDB 2.8 Bug Hunt. Winners are entitled to some great prizes (details below).

MongoDB 2.8 RC0

In future posts we’ll share more information about all the features that make up the 2.8 release. We will begin today with our three headliners:

Pluggable Storage Engines

The new pluggable storage API allows external parties to build custom storage engines that seamlessly integrate with MongoDB. This opens the door for the MongoDB Community to develop a wide array of storage engines designed for specific workloads, hardware optimizations, or deployment architectures.

Pluggable storage engines are first-class players in the MongoDB ecosystem. MongoDB 2.8 ships with two storage engines, both of which use the pluggable storage API. Our original storage engine, now named “MMAPv1”, remains as the default. We are also introducing a new storage engine, WiredTiger, that fulfills our desire to make MongoDB burn through write-heavy workloads and be more resource efficient.

WiredTiger was created by the lead engineers of Berkeley DB and achieves high concurrency and low latency by taking full advantage of modern, multi-core servers with access to large amounts of RAM. To minimize on-disk overhead and I/O, WiredTiger uses compact file formats, and optionally, compression. WiredTiger is key to delivering the other two features we’re highlighting today.

Improved Concurrency

MongoDB 2.8 includes significant improvements to concurrency, resulting in greater utilization of available hardware resources, and vastly better throughput for write-heavy workloads, including those that mix reading and writing.

Prior to 2.8, MongoDB’s concurrency model supported database level locking. MongoDB 2.8 introduces document-level locking with the new WiredTiger storage engine, and brings collection-level locking to MMAPv1. As a result, concurrency will improve for all workloads with a simple version upgrade. For highly concurrent use cases, where writing makes up a significant portion of operations, migrating to the WiredTiger storage engine will dramatically improve throughput and performance.

The improved concurrency also means that MongoDB will more fully utilize available hardware resources. So whereas CPU usage in MongoDB has been traditionally fairly low, it will now correspond more directly to system throughput.

Compression

The WiredTiger storage engine in MongoDB 2.8 provides on-disk compression, reducing disk I/O and storage footprint by 30-80%. Compression is configured individually for each collection and index, so users can choose the compression algorithm most appropriate for their data. In 2.8, WiredTiger compression defaults to Snappy compression, which provides a good compromise between speed and compression rates. For greater compression, at the cost of additional CPU utilization, you can switch to zlib compression.

For more information, including how to seamlessly upgrade to the WiredTiger storage engine, please see the 2.8 Release Notes.

The Bug Hunt

The Bug Hunt rewards community members who contribute to improving MongoDB releases through testing. We’ve put the release through rigorous correctness, performance and usability testing. Now it’s your turn to test MongoDB against your development environment. We challenge you to test and uncover any remaining issues in MongoDB 2.8.0-rc0. Bug reports will be judged on three criteria: user impact, severity and prevalence.

All issues submitted against 2.8.0-rc0 will be candidates for the Bug Hunt. Winners will be announced on the MongoDB blog and user forum by December 9. There will be one first place winner, one second place winner and at least two honorable mentions. Awards are described below.

During the first Bug Hunt, for MongoDB 2.6, the community’s efforts were instrumental in improving the release and we’re hoping to get even more people involved in the Bug Hunt this time!

Bug Hunt Rewards

First Prize:

  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt

Second Prize:

  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Amazon Gift Card
  • MongoDB Contributor T-shirt

Honorable Mentions:

  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

How to get started

  • Download MongoDB 2.8 RC0: You can download this release at MongoDB.org/downloads.
  • Deploy in your test environment: It is best to test software in a real environment with realistic data volumes and load. Help us see how 2.8 works with your code and data so that others can build and run applications on MongoDB 2.8 successfully.
  • Test new features and improvements: There is a lot of new functionality in MongoDB 2.8. See the 2.8 Release Notes for a full list.
  • Log a ticket: If you find an issue, create a report in Jira (Core Server project). See the documentation for a guide to submitting well written bug reports.

Don’t Hunt Alone

If you’re new to database testing, you don’t have to do it alone your first time. Join one of our MongoDB User Groups this month to try hacking on the release candidate with MongoDB Performance and QA engineers. Here are some of the upcoming events:

Want to run a Bug Hunt at your local user group or provide a space for the community to hunt? Get in touch with the MongoDB Community team to get started.

If you are interested in doing this work full time, consider applying to join our engineering teams in New York City, Palo Alto and Austin, Texas.

Happy hunting!

Eliot, Dan, Alvin and the MongoDB Team

Narrative Science Quill™ and MongoDB: A Match Made In Heaven

Nov 10 • Posted 1 month ago

This is a guest post by Craig Booth and Katy De Leon of Narrative Science

Big Problems with Big Data

Billions of dollars have been invested in Big Data that have produced innumerable business intelligence and visualization tools. No amount of data is too large. No combination of algorithms is too many. No visualization is too complex. But no spreadsheet or dashboard can tell you what you need to know to improve your business. This you must figure out for yourself. Or rely on someone to figure it out for you.

At Narrative Science, we approach data in a different way. We find the insight hidden in data and communicate it in a form that makes natural sense to everyone: a written narrative. Whether it’s an earnings report, a system performance review or a product description, narratives leave you with nothing left to decipher. All you have to do is read.

Since Big Data is vital to what we do, we require a data storage solution that is reliable, high performing, scalable, flexible and expressive regarding the structure and hierarchy necessary for generating plain English language. MongoDB meets all of these criteria and has been the backend of our entire infrastructure since 2010. Over the years, our MongoDB replica sets have remained available even during region-wide outages in the Amazon cloud computing centers. We also leverage our MongoDB deploys as a job queuing system, and as a store for temporary data for our web services (using TTL indexes).

MongoDB also offers us schematic flexibility — a critical feature since we are constantly adding new data from a diverse range of industries and clients. Changes can be made iteratively and instantaneously, a huge advantage for us over a traditional SQL database. As our data requirements and client demands have grown ever more complex, MongoDB has grown with them; providing access to the data structures necessary to generate natural language directly from data.

Quill Automatically Explains MongoDB Usage

Quill™ is our automated narrative generation platform, powered by Artificial Intelligence. Quill integrates structured data from disparate sources, understands what is important to the end user and then generates perfectly written narratives that convey meaningful information to any intended audience. In essence, we do what data visualizations cannot. Although charts and dashboards look appealing, they still require people to explain them. Quill, on the other hand, adds value to data by identifying relevant data points and relaying them through professional, conversational language that people can immediately comprehend, act on and trust.

In order to diagnose potential performance and scaling issues before they affect us or our clients, we decided to put Quill on the task of monitoring our MongoDB usage. Here is a snapshot from a recent Quill-generated report explaining the weekly performance of one of our environments:

Data Use

This report, generated on July 25, 2014, compares today’s MongoDB usage to that from one week ago. Between last week and today:

  • The database “prod_ns4” has grown the most in the past week, increasing by 4.07 Gb. The collection inside of “prod_ns4” that drove most of this growth was “metadata”, which now contains 531,126 documents.
  • The total data size increased from 354.79 Gb to 365.31 Gb.
  • The total number of collections remained constant at 8407.
  • The number of documents has increased from 119.4 million to 121.1 million.
  • The following databases are the largest of those in your MongoDB instance: “prod_pub_service” (with 196 collections, containing a total of 18.4 million objects in 128 Gb) and “prod_harpoon” (with 25 collections, containing a total of 9.9 million objects in 44.54 Gb).
  • The following databases were dropped: “prod_assignment_service” and “stg_sla_monitor”.
  • No new databases were created in the past week.

Slow Queries

The three collections that accounted for the largest number of slow queries this week were ‘stg_finance.qm’, ‘stg_cft.tweets’, and ‘stg_statprovider.latest’ with 8675, 1405 and 601 slow queries respectively.

72% of the total number of slow queries were on the collection ‘stg_finance.qmc. Consider adding indexes. The slowest queries were:

  • Wed Jul 16 05:01:48 [conn61548] query stg_finance.qqm query: { $query: { ticker: “LTG”, exchange: “NYSE” }, $orderby: { date: -1 } } ntoreturn:1 ntoskip:0 nscanned:643 scanAndOrder:1 keyUpdates:0 numYields: 20 locks(micros) r:9929394 nreturned:1 reslen:110 7102ms
  • Wed Jul 16 04:53:55 [conn61548] query stg_finance.qm query: { $query: { ticker: “ATR”, exchange: “NYSE” }, $orderby: { date: -1 } } ntoreturn:1 ntoskip:0 nscanned:643 scanAndOrder:1 keyUpdates:0 numYields: 18 locks(micros) r:13086372 nreturned:1 reslen:110 7050ms

Better Transparency and Decision-Making

This machine-generated explanation distills thousands of numbers, about over 5TB of data, into an easily digestible, relevant, and informative story that is sent to our engineering team every day and every week. It allows them to zero in on only those issues that are important and ignore the information that’s irrelevant to performance issues. And, unlike what you get from other monitoring tools, even a non-techy type can immediately understand the report. It gives our team what we need, when we need it - actionable insights specific to the performance of our MongoDB deployment.

These reports have already allowed us to make performance tweaks to our deployments. For example, when inefficient MongoDB queries make it into production code, Quill immediately flags them so we can place indexes in the relevant places. It also makes our lives easier by providing transparency into which projects are hogging data space so we can make the relevant collections smaller, either by making different choices about what data we store, or adding TTL indexes that allow irrelevant data to expire.

MongoDB is a powerful tool for dealing with the deeply nested documents that are needed for Natural Language Generation (NLG). It is a tool that is essential to Quill’s performance. At the same time, natural language reporting from Quill provides us with a powerful way to continually understand and improve our MongoDB usage.

Want to learn how Quill can immediately start adding value to your business? We’ve written a white paper that explains how Quill works and outlines examples of how companies are using it today. You can also visit our website, or contact us directly to see a demonstration.

Four Ways to Optimize Your Cluster With Tag-Aware Sharding

Nov 5 • Posted 1 month ago

By Francesca Krihely, Community Manager and Asya Kamsky, Principal Developer Advocate

MongoDB supports partitioning data using either a user-defined shard key index (composed of one or more fields present in all documents) or based on a hashed index on a single field (which also much be present in all documents). A lesser known feature is tag-aware sharding, which allows you to create (and adjust) policies that affect where documents are stored based on user-defined tags and tag ranges.

Tag-aware sharding is a very versatile feature: it can be used to maximize the performance of your cluster, optimize storage, and implement data center-awareness in your application. In this post, we’ll go through a few use cases for tag aware sharding.

Optimizing Physical Resources

In some cases, an application needs to store data that is not frequently accessed. Financial institutions are often required to save transaction details or claims for a number of years to stay in compliance with government regulation. It can be very expensive to keep several years worth of data on high-performance hardware, so a useful approach would be to use periodic batch jobs to move data to a more cost-effective archive storage—an operational headache for most applications. There are a number of failure points when moving data from one machine to the next and there is a lot of tedious application logic needed in order to execute these operations. Using MongoDB’s Tag-Aware sharding feature, you could assign tags to different storage tiers, associate a unique shard key range to a tag, and documents will move from one shard to the next during normal balancing operations. To ensure you’re moving the right number of document from one machine to the next over time, you could write a script to update shard tags once a month. This method is great because it keeps all the data in one database, so your application code doesn’t need to change as data moves between storage tiers. You can read an in-depth overview of this implementation in our blog post on tiered storage.

Location-Based Separation of Data

If you have a large, global user base, it is often preferable to store specific user information in localized data centers. In this case, you can use shard tags to keep documents to specified data center(s). There is a great tutorial on this in the (MongoDB Manual)[http://docs.mongodb.org/manual/data-center-awareness/]. While this is a classic use case for tag-aware sharding, there are a few things to look out for when implementing this new approach yourself.

Remember that Shard Keys are immutable: you cannot change the value of your shard key, so if a user happens to change their location, they will in turn change their shard key, and you will need to delete and re-insert that document to change that information. See more about Shard Keys in our blog post here.

You should include the shard key with every operation to avoid scattering the query to all shards. This may seem counter-intuitive if your shard key is “region, customerId” but it’s necessary to make sure that only shards holding a particular value of “region” field are sent the query.

If you’re interested in learning more about this example, take a look at our new white paper on multi datacenter deployments which discusses sharding along with other configurations for distributing databases across regions.

Balancing unsharded collections across a sharded cluster

Sharding can help with many dimensions of scalability, including read and write scalability. With tag aware sharding, you can take this one step further, by balancing collections across shards to distribute read and write load. This would happen in a case where you have a large number of medium sized collections on a single shard — not typically big enough to warrant sharding— that make your workload unbalanced. Instead of splitting all of these medium-sized collections across all shards, you can distribute them “round-robin” style—one or several collections living on each shard. Some examples of how to implement that are here

In the wild, we have seen this implemented for a “platform as a service” (i.e. an application development platform). If you have many tenants—or individual apps— and none of them are large enough to need sharding, there still may be too many to live on a single replica set. Having each tenant’s application on an individual shard would help distribute the load by splitting individual collections on to specific shards or a subset of shards. You would not have to partition any of the collections (which you would do if you sharded the collection) but they would live on individual shards, enabling you to balance your read and write load.

Targeting data to individual servers

Balancing collections can also be used to optimize physical resources. In some cases, your collections will have heavy indexes that are taking up a lot of memory. Rather than spreading them out on different machines, you can target that collection to a shard with more RAM, allowing you to separate it from other collections to reduce contention for limited physical resources.

There are a number of ways to implement tag-aware sharding into your infrastructure to increase read and write efficiency, get better use of your hardware and improve overall performance in your application. If you want to learn more:

Need help with your sharding configuration? MongoDB’s Technical Support Engineers can help you plan and architect your sharding setup and offer you guidance to ensure you’re on the right track. Learn more about MongoDB’s 24x365 Production Support, which will offer you proactive advice on your MongoDB deployment.

MongoDB LDAP and Kerberos Authentication with Centrify

Nov 4 • Posted 1 month ago

By Alex Komyagin at MongoDB with the help of Felderi Santiago at Centrify and Robertson Pimentel at Centrify

Overview

Centrify provides unified identity management solutions that result in single sign-on (SSO) for users and a simplified identity infrastructure for IT. Centrify’s Server Suite integrates Linux systems into Active Directory domains to enable centralized authentication, access control, privilege user management and auditing access for compliance needs.

Since version 2.4, MongoDB Enterprise allows authentication with Microsoft Active Directory Services using LDAP and Kerberos protocols. On Linux systems it is now possible to leverage Centrify’s Server Suite solution for integrating MongoDB with Active Directory.

The use of Centrify’s Active Directory integration with MongoDB greatly simplifies setup process and allows MongoDB to seamlessly integrate into the most complex Active Directory environments found at enterprise customer sites with hundreds or thousands of employees.

Requirements

  • Existing Active Directory domain
  • MongoDB Enterprise 2.4 or greater
  • Centrify Suite

All further MongoDB commands in this paper are given for the current latest stable release, MongoDB 2.6.5. The Linux OS used is RHEL6.4. The Centrify Server Suite version is 2014.1.

Setup procedure

Preparing a new MongoDB Linux server

In existing Enterprise environments that are already using Centrify and MongoDB there are usually specific guidelines on setting up Linux systems. Here we will cover the most basic steps needed, that can be used as a quick reference:

1. Configure hostname and DNS resolution

For Centrify and MongoDB to function properly you must set a hostname on the system and make sure it’s configured to use the proper Active Directory-aware DNS server instance IP address. You can update the hostname using commands that resemble the following:

$ nano /etc/sysconfig/network
HOSTNAME=lin-client.mongotest.com
$ reboot
$ hostname -f
lin-client.mongotest.com

Next, verify the DNS settings and add additional servers, if needed:

$ nano /etc/resolv.conf
search mongotest.com
nameserver 10.10.42.250

2. Install MongoDB Enterprise

The installation process is well outlined in our Documentation. It’s recommended to turn SELinux off for this exercise:

$ nano /etc/selinux/config
SELINUX=disabled

Since MongoDB grants user privileges through role-based authorization, there should be an LDAP and a Kerberos user created in mongodb:

$ service mongod start
$ mongo
> db.getSiblingDB("$external").createUser(
    {
      user : "alex",
      roles: [ { role: "root" , db : "admin"} ]
    }
)
> db.getSiblingDB("$external").createUser(
   {
     user: "alex@MONGOTEST.COM",
     roles: [ { role: "root", db: "admin" } ]
   }
)

“alex” is a user listed in AD and who is a member of the “Domain Users” group and has “support” set as its Organizational Unit.

3. Install Centrify agent

Unpack the Centrify suite archive and install the centrify-dc package. Then join the server to your domain as a workstation:

$ rpm -ihv centrifydc-5.2.0-rhel3-x86_64.rpm
$ adjoin -V -w -u ldap_admin mongotest.com
ldap_admin@MONGOTEST.COM's password:

Here “ldap_admin” is user who is a member of the “Domain Admins” group in AD.

Setting up MongoDB with LDAP authentication using Centrify

Centrify agent manages all communications with Active Directory, and MongoDB can use the Centrify PAM module to authenticate LDAP users.

1. Configure saslauthd, which is used by MongoDB as an interface between the database and the Linux PAM system.

a. Verify that “MECH=pam” is set in /etc/sysconfig/saslauthd:

$ grep ^MECH /etc/sysconfig/saslauthd
MECH=pam

b. Turn on the saslauthd service and ensure it is started upon reboot:

$ service saslauthd start
Starting saslauthd:                                     [  OK  ]
$ chkconfig saslauthd on
$ chkconfig --list saslauthd
saslauthd  0:off   1:off   2:on    3:on 4:on    5:on    6:off

2. Configure PAM to recognize the mongodb service by creating an appropriate PAM service file. We will use the sshd service file as a template, since it should’ve already been preconfigured to work with Centrify:

$ cp -v /etc/pam.d/{sshd,mongodb}
`/etc/pam.d/sshd' -> `/etc/pam.d/mongodb'

3. Start MongoDB with LDAP authentication enabled, by adjusting the config file:

$ nano /etc/mongod.conf
auth=true
setParameter=saslauthdPath=/var/run/saslauthd/mux
setParameter=authenticationMechanisms=PLAIN
$ service mongod restart

4. Try to authenticate as the user “alex” in MongoDB:

$ mongo
> db.getSiblingDB("$external").auth(
   {
     mechanism: "PLAIN",
     user: "alex",
     pwd:  "xxx",
     digestPassword: false
   }
)
1
>

Returning a value of “1” means the authentication was successful.

Setting up MongoDB with Kerberos authentication using Centrify

Centrify agent automatically updates system Kerberos configuration (the /etc/krb5.conf file), so no manual configuration is necessary. Additionally, Centrify provides means to create Active Directory service user, service principal name and keyfile directly from the Linux server, thus making automation easier.

1. Create the “lin-client-svc” user in Active Directory with SPN and UPN for the server, and export its keytab to the “mongod_lin.keytab” file:

$ adkeytab -n -P mongodb/lin-client.mongotest.com@MONGOTEST.COM -U mongodb/lin-client.mongotest.com@MONGOTEST.COM -K /home/ec2-user/mongod_lin.keytab -c "OU=support" -V --user ldap_admin lin-client-svc
ldap_admin@MONGOTEST.COM's password:
$ adquery user lin-client-svc -PS
userPrincipalName:mongodb/lin-client.mongotest.com@MONGOTEST.COM
servicePrincipalName:mongodb/lin-client.mongotest.com

Again, the “ldap_admin” is user who is a member of the “Domain Admins” group in AD. An OU “support” will be used to create the “lin-client-svc” service user.

2. Start MongoDB with Kerberos authentication enabled, by adjusting the config file. You also need to make sure that mongod listens on the interface associated with the FQDN. For this exercise, you can just configure mongod to listen on all interfaces:

$ nano /etc/mongod.conf
# Listen to local interface only. Comment out to listen on all interfaces.
#bind_ip=127.0.0.1
auth=true
setParameter=authenticationMechanisms=GSSAPI
$ service mongod stop
$ env KRB5_KTNAME=/home/ec2-user/mongod_lin.keytab mongod -f /etc/mongod.conf

3. Try to authenticate as the user “alex@MONGOTEST.COM” in MongoDB:

$ kinit alex@MONGOTEST.COM
Password for alex@MONGOTEST.COM:
$ mongo --host lin-client.mongotest.com
> db.getSiblingDB("$external").auth(
   {
     mechanism: "GSSAPI",
     user: "alex@MONGOTEST.COM",
   }
)
1
>

The return value of “1” indicates success.

Summary and more information

MongoDB supports different options for authentication, including Kerberos and LDAP external authentication. With MongoDB and Centrify integration, it is now possible to speed up enterprise deployments of MongoDB into your existing security and Active Directory infrastructure and ensure quick day-one productivity without expending days and weeks of labor dealing with open-source tools.

About Centrify

Centrify is a leading provider of unified identity management solutions that result in single sign-on (SSO) for users and a simplified identity infrastructure for IT. Centrify’s Server Suite software integrates Linux systems into Active Directory domains to enable centralized authentication, access control, privilege user management and auditing access for compliance needs. Over the last 10 years, more than 5,000 customers around the world, including nearly half of the Fortune 50, have deployed and trusted Centrify solutions across millions of servers, workstations, and applications, and have regularly reduced their identity management and compliance costs by 50% or more.

Video tutorials

Video on how to use Centrify to integrate MongoDB with Active Directory:

Video on how to enforce PAM access rights as an additional security layer for MongoDB with Centrify:

Centrify Community post and videos showcasing Active Directory integration for MongoDB: http://community.centrify.com/t5/Standard-Edition-DirectControl/MongoDB-AD-Integration-made-easy-with-Centrify/td-p/18779

MongoDB security documentation is available here: http://docs.mongodb.org/manual/security/ MongoDB user and role management tutorials: http://docs.mongodb.org/manual/administration/security-user-role-management/

MongoDB University MCP Spotlight: Nestor Campos

Oct 29 • Posted 1 month ago

As part of MongoDB University’s 2 Year Anniversary, we are sharing stories from our MongoDB University classes to showcase how they got started with MongoDB and where they’ve gone since graduation

This is a guest post by Nestor Campos, a software engineer and consultant from Chile who is a Certified MongoDB Developer, who will share his experience learning MongoDB through MongoDB University.

As a software engineer, I know I always need to keep my up with new technologies to stay relevant in the market.

After several years working with relational database, I found that they were not meeting the needs for all my projects, so after doing some research, I found the “non-relational”(NoSQL) databases and started experimenting with them. I decided to work with MongoDB, for its simplicity, ease of use and its wide range of use cases.

As soon as I started with MongoDB, MongoDB University opened, and I enrolled in the first course, M101P. It was difficult at first, but I learned quickly and working with MongoDB became more natural to me every day of the course.

After passing the M101 course, it was time to validate my work and get certified. I signed up to take the Developer Certification exam, studied for several months, and completed the course successfully. I was very excited and after completing the course, was motivated to continue working with MongoDB.

All of the courses and certification I’ve completed ​​through MongoDB University have helped give a new vision to the projects I work on, and opened a range of possibilities for my the entire organization.

The knowledge I built through MongoDB University allowed me to give back to the community and get ahead at work. After getting certified, I taught some workshops on Big Data. Starting next week, we are starting a large project that will involve collecting and manipulating data from social networks, and MongoDB was already defined as the core database to meet our objectives. Now I feel like a real Big Data engineer.

I am now certified as a MongoDB Developer and my goal is to become a certified MongoDB DBA next year. I hope more community members will take advantage of MongoDB certification and share the benefits of MongoDB within their organizations and to their clients.

260,000 Students Later: MongoDB University Celebrates its Two Year Anniversary

Oct 24 • Posted 1 month ago

By Shannon Bradshaw, Director of Education at MongoDB

We’re at MongoDB SF 2012 and Andrew Erlichson has a crazy idea. “Let’s create free online classes on MongoDB and make them available to everyone in the world.” CEO Dwight Merriman and President Max Schireson love the idea, and within days a small team at MongoDB begins building out the first two courses, M101 and M102.

In October 2012, we open enrollment for the first MongoDB University courses. Andrew and Dwight announce the courses over a webcast from New York City. By the time Andrew and Dwight walk from the webcast back to their desks 1,000 people have already registered for the inaugural courses.

Two years later, we’ve expanded MongoDB University to help professionals around the globe build their MongoDB expertise from start to finish. With 260,000+ registrations, MongoDB University now hosts five online MongoDB courses, making it easier than ever to get started building and administering MongoDB. In collaboration with Udacity, we created a data wrangling course to help Data Scientists take advantage of MongoDB.

MongoDB professionals can also formalize their expertise through our growing Certification program for Application Developers and Database Administrators. Finally, we expanded our in-person training to include classes on MongoDB Diagnostics and Debugging, Advanced Data Modeling, MongoDB Advanced Operations, and more. Through these courses, we reach thousands of individuals each year and help them get their MongoDB projects off the ground. Needless to say, we are excited about the next two years of MongoDB University.

As we celebrate the two year anniversary of MongoDB University, I wanted to take this time to thank our students and the collaborators who have given us the critical feedback we need to build excellent online education and in-person training experiences. Over the next two weeks, we will be featuring stories from our alumni and MongoDB Certified Professionals to highlight the amazing things they are doing with MongoDB. Finally, a huge thank you to all of our MongoDB University instructors and teaching assistants who help our students complete their courses with success.

If you haven’t started with MongoDB University, have a look at the course catalogue. The next sessions of MongoDB for Node.js Developers and MongoDB for Python Developers are still open for registration.

Sharding Pitfalls Part III: Chunk Balancing and Collection Limits

Oct 22 • Posted 1 month ago

In Parts 1 and 2 we have covered a number of common issues people run into when managing a sharded MongoDB cluster. In this final post of the series we will cover a subtle, but important distinction in terms of balancing a sharded cluster as well as an interesting limitation that can be worked around relatively easily, but is nonetheless surprising when it comes up.

6. Chunk balancing != data balancing != traffic balancing

The balancer in a sharded cluster cares about just one thing:

Are chunks for a given collection evenly balanced across all shards?

If they are not, then it will take steps to rectify that imbalance. This all sounds perfectly logical, and even with extra complexity like tagging involved the logic is pretty straight forward. If we assume that all chunks are equal, then we can rest assured that our data is being evenly balanced across all the shards in our cluster and rest easy at night.

Although that is sometimes, perhaps even frequently, the case it is not always true - chunks are not always equal. There can be massive “jumbo” chunks that exceed the maximum chunk size (64MiB), completely empty chunks and everything in between.

Let’s use an example from our first pitfall, the monotonically increasing shard key. For our example, we have picked just such a key to shard on (date), and up until this point we have had just one shard and had not sharded the collection. We are about to add a second shard to our cluster and so we enable sharding on the collection and do the necessary admin work to add the new shard into the cluster.

Once the collection is enabled for sharding, the first shard contains all the newly minted chunks. Let’s represent them in a simplified table of 10 chunks. This is not representative of a real data set, but it will do for illustrative purposes:

Table 1 - Initial Chunk Layout

Now we add our second shard. The balancer will kick in and attempt to distribute the chunks evenly. It will do this by moving the lowest range chunks to the new shard until the counts are identical. Once it is finished balancing, our table now looks like this:

Table 2 - Balanced Chunk Layout

That looks pretty good at the moment, but lets imagine that more recent chunks are more likely to have more activity (updates say) than older chunks. Adding the traffic share estimates for each chunk shows that shard1 is taking far more traffic (72%) than shard2 (28%) despite the chunks seeming balanced overall based on the approximate size. Hence, chunk balancing is not equal to traffic balancing.

Using that same example, let’s add another wrinkle - periodic deletion of old data. Every 3 months we run a job to delete any data older than 12 months. Let’s look at the impact of that on our table after we run it for the first time (assuming the first run happens on July 1st 2015).

Table 3 - Post-Delete Chunk Layout

The distribution of data is now completely skewed toward shard1 - shard2 is in fact empty! However, the balancer is completely unaware of this imbalance - the chunk count has remained the same the entire time, and as far as it is concerned the system is in a steady state. With no data on shard2, our traffic imbalance as seen above will be even worse, and we have essentially negated the benefit of having a second shard for this collection.

Possible Mitigation Strategies

  • If data and traffic balance are important, select an appropriate shard key
  • Move chunks manually to address the imbalances - swap “hot” chunks for “cool” chunks, empty chunks for larger chunks

7. Waiting too long to shard a collection (collection too large)

This is not very common, but when it falls on your shoulders, it can be quite challenging to solve. There is a maximum data size for a collection when when it is initially split which is a function of the chunk size and data size as noted on the limits page.

If your collection contains less than 256GiB of data, then there will be no issue. If the collection size exceeds 256GiB but is less than 400GiB, then MongoDB may be able to do an initial split without any special measures being taken.

Otherwise, with larger initial data sizes and the default settings, the initial split will fail. It is worth noting that once split the collection may grow as needed and without any real limitations as long as you can continue to add shards as data size grows.

Possible Mitigation Strategies

Since the limit is dictated by the chunk size and the data size, and assuming there is not much to be done about the data size, then the remaining variable is the chunk size. This is adjustable (default is 64MiB) and can be raised in order to let a large collection split initially and then reduced once that has been completed.

The required chunk size increase will depend on the actual data size. However, this is relatively easy to work out - simply divide your data size by 256GB and then multiply that figure by 64MiB (and round up if it is not a nice even number). As an example, let’s consider a 4TiB collection:

4TiB divided by 256GiB = 16 64MiB x 16 = 1024MiB

Hence, set the max chunk size to 1024MiB, then perform the initial sharding of the collection, and then finally reduce the chunk size back to 64MiB using the same procedure. .

Thanks for reading through the Sharding Pitfall series! If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations.

Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.

MongoDB Management Service Re-imagined: The Easiest Way to Run MongoDB

Oct 14 • Posted 2 months ago

Discuss on Hacker News

We consistently hear that getting started with MongoDB is easy, but scaling to large configurations that include replication and sharding can be challenging. With MMS, it is now much easier.

Today we introduced major enhancements to MongoDB Management Service (MMS) that makes it significantly easier to run MongoDB. MMS is now centered around the experience of deploying and managing MongoDB on the infrastructure of your choice. You can now deploy a cluster through MMS and then monitor your deployment. You can also optionally back up your MongoDB deployment directly to MongoDB, Inc. Once deployed, you can upgrade or scale a cluster in just a few clicks.

How It Works

MMS works by communicating with an automation agent on each server. The automation agent contacts MMS and gets instructions on the goal state of your MongoDB deployment.

MMS can deploy MongoDB replica sets, sharded clusters and standalones on any Internet-connected server. The servers need only be able to make outbound TCP connections to MMS.

MMS Backup is built directly into MMS. You can enable continuous backup in just a few clicks as you deploy a cluster.

The Infrastructure of Your Choice

By “infrastructure of your choice” we mean that MMS can run and control MongoDB in public cloud, private data center or even your own laptop. For AWS users, we can control virtual machine provisioning directly from MMS.

For example, you might start 20 servers at Google Compute, put the MMS Automation Agent on each servers and then launch a new sharded cluster on the servers through MMS.

If you use AWS, you can insert your AWS keys directly into MMS, and MMS will provision EC2 servers for you and start the MMS automation agent. Hence, with AWS, deploying MongoDB is even simpler.

Bringing your own infrastructure has some advantages. The database is not an island. It must interact with your application. With MMS, you can put your database servers in security zones that you design and be assured that the different pieces of the architecture are in the right places for fault tolerance. For example, if you use AWS, deploying MongoDB across availability zones is now a single click away.

Why We’re Excited

We believe MMS is a quantum leap forward for MongoDB developers and operators. Developers can get MongoDB running much more quickly without understanding the vagaries of installation. Ops can confidently create scalable, fault tolerant, backed up, monitored deployments with a small fraction of the work.

For those who have been using MMS for a long time, you know that it was previously a free monitoring and paid backup service. That classic version of MMS is closed to new users. But if you have a classic account it will continue to work the same way it always has.

As much as we are releasing today, we have barely begun to scratch the surface of what is possible with MMS so expect even more in the future.

We hope you find MMS useful in running MongoDB at scale. You can open a free account and get started at mms.mongodb.com.

Pricing

The new MMS is free for up to eight servers, so most users won’t need to pay anything to run MongoDB through MMS. Full pricing details are available at mms.mongodb.com.

Announcing The William Zola Outstanding Contributor Award

Oct 9 • Posted 2 months ago

Nominate a community champion for the William Zola Outstanding Contributor Award here

We place a tremendous value on community engagement at MongoDB, specifically around working with you on your toughest technical problems. Even in the early days of the project, MongoDB’s co-founders Dwight and Eliot were committed to the success of all users, and we want to continue that spirit as the community grows.

Today we are pleased to announce the William Zola Award for Community Excellence to honor those whose support contributions make a significant difference to users around the globe.

It’s not easy to be a champion at community support, and yet, at MongoDB we see it every day, from community members who provide advice on the MongoDB User Forum, to our own engineers who diagnose and debug problems for users around the globe.

One of our strongest Community Support advocates was William Zola, who passed away unexpectedly this past summer. William, Lead Technical Services engineer, had a passion for creating user success, and helped thousands of users with support problems, much of it on his own time. William was so effective at meeting users in their time of distress that people often asked for him by name on the MongoDB User Forum. Most engineers at MongoDB went through his customer skills training to learn how to create an ideal user experience while maintaining technical integrity. William taught us:

  • How the user feels is every bit as important as solving their technical problem
  • We should work to solve the problem and not just close a case or ticket
  • Every user interaction should drive the case one step closer to resolution
  • It’s all about the user

Over time, William’s advice and philosophy towards user success came to permeate MongoDB’s entire organization and community.

The Award

This December at MongoDB SF we will announce the first winner of what we will call “The Zola”, awarded to a user who has offered exceptional support to our community in line with William’s philosophy. The winner of the Zola will receive a complimentary trip to San Francisco to receive the award at the event along with a $1,000 Amazon Gift Card and an award plaque.

Today we open nominations and begin the search for the first winners of the Zola. MongoDB users who support others on StackOverflow, the MongoDB User Forum, and in person through ad-hoc or structured mentoring are all qualified to receive the award.

Nominations will be accepted until November 10 through this form, so please send in names of people who have positively impacted your experience with MongoDB during your time as a user. Individuals will be judged on the impact of their work and their demonstration of William’s values.

William’s extraordinary contributions are remembered in users like you who pass along your knowledge of MongoDB and do it with gusto. Even if you do not qualify for the Zola now, there is always an opportunity for you to contribute to the MongoDB ecosystem by sharing your ideas and experience on StackOverflow, the MongoDB User Forum and in your local communities.

Tell us who should receive the first “Zola.”

Discuss on Hacker News

MongoDB Employees are not eligible for the “Zola”

Sharding Pitfalls Part II: Running a Sharded Cluster

Oct 8 • Posted 2 months ago

By Adam Comerford, Senior Solutions Engineer

In Part I we discussed important considerations when picking a shard key. In this post we will go through some recommendations when running a sharded cluster at scale. Scalability is one of the core benefits of sharding in MongoDB but this can give you a false sense of security; even with that flexibility, you still have to make smart decisions about how and when you deploy resources. In this post, we will cover a couple of common mistakes that people tend to make when it comes to running a sharded cluster.

3. Waiting too long to add a new shard (overloaded)

You sharded your database and scaled horizontally for a reason, perhaps it was to add more memory or disk capacity. Whatever the reason, if your application usage grows over time so (generally) does your database utilization. Eventually, your current sharded cluster will pass a certain point, let’s call it 80% utilized (as a nice round estimate), such that it becomes problematic to add another shard. Why? Well, adding a new shard to a cluster is not free, and it is not instantaneous. It consumes resources and (initially) accepts very little traffic.

Essentially, at the start of its existence, a newly added shard costs you capacity instead of adding capacity. The length of time it will stay in this state will depend on the balancer and how long it takes for a significant portion of “busy/active” chunks to move onto the new shard.

It can often be easier to visualize this process, so let’s make up some hypothetical numbers and set the bar relatively low. Our imaginary existing cluster will be a set of 2 shards, with 2000 chunks (500 considered “active”) and to that we need to add a 3rd shard. This 3rd shard will eventually store one third of the active chunks (and total chunks). The question is, when does this shard stop adding overhead overall and instead become an asset?

In reality, this will vary from cluster to cluster and have a lot of dependencies and variables - in other words you need to have good metrics about your cluster, particularly your load bottleneck.

Therefore we will once again use our imaginations and go with a relatively low bar: when 5% of active chunks—that is, those chunks seeing most traffic—have migrated to the new shard, you should expect a net gain in performance. In our imaginary system we have evaluated our load levels, the expected impact of migrations and have determine that once that 5% threshold of active chunks has been migrated to the new shard it can be considered a net gain for the overall system. Once all chunks have been balanced, then the migration overhead disappears, but initially this will be an expected trade off.

This chart shows how long it would take for new shards to reach net positive contribution in your cluster (the dotted line implies net gain):

In this fabricated example, it takes almost 2 hours for the new shard to attain a viable level of active chunks and be considered a net gain for the overall system. Although these numbers are fictional, these numbers are based on setups we have seen in real systems with moderate load.

From there it is relatively easy to imagine this set of migrations taking even longer on an overloaded set of shards, and taking far longer for our newly added shard to cross the threshold and become a net gain. As such it is best to be proactive and add capacity before it becomes a necessity.

Possible Mitigation Strategies

  • Manual balancing of targeted “hot” chunks (chunk that is being accessed more than others) to move activity to the new shard more quickly
  • Add the shard at low traffic time so that there is less competition for resources
  • Disable balancing on some collections, prioritise balancing busy collections first

4. Under-provisioning Config Servers

Provisioning enough resources without being wasteful is always tricky, and all the more so in a complicated distributed system like a MongoDB sharded cluster. Everyone wants to use their hardware, virtual instances, virtual machines, containers and the like in the most efficient way possible, and get the best bang for their buck. Hence it is only natural to take a look at the various pieces of a distributed cluster and look for lower utilized pieces that could be put on less expensive resources.

The most common pitfall here with MongoDB are the config servers, which are often neglected when stress testing a cluster. In testing environments and smaller deployments (unless specific measures are taken to stress them) they are relatively lightly loaded and usually identified as candidates for lesser instances/hardware.

The problem is that these are critical pieces of infrastructure. They may not be heavily loaded all the time, but when they do see load and struggle to service requests, that can impact all queries (reads, writes, authentication) and add latency to all requests made of the cluster in question.

In particular, the first config server in the list supplied to your mongos processes is vital. This is the config server that all mongos processes will default to read from when fetching or refreshing their view of the data distribution in your cluster. Similarly, this is the server that will be hit when attempting to authenticate a user. If it is under-provisioned and cannot service queries, or if it has problems with networking (packet loss, congestion), then the effects will be significant.

Possible Mitigation Strategies

  • Ensure the config servers are load tested, slightly over-provisioned (the first config server in particular)
  • If using virtual machines or cloud based instances, investigate increasing available resources
  • Turning off the balancer, disabling chunk splitting will reduce the chances of high read traffic to the config servers (no migrations, no meta data refresh) but this is only a temporary fix unless you have a perfect write distribution and may not eliminate issues completely.

5. Using the count() command on sharded collections

This pitfall is very common, and it seems to hit somewhat randomly in terms of how long someone has been running a sharded environment. At some point, a question will arise along the lines of:

“How are we tracking/verifying/checking how many documents we have in each collection on each shard, how balanced are they and do they agree with <some other system that holds the same data>?”

Hopefully no one is actually constructing questions this way in your organization, but you get the basic idea. The most obvious way to do a quick check on this type of thing is to count the documents and see if the numbers make sense and/or agree with counts elsewhere. That thinking naturally leads people to the count command and they proceed to use it to gather figures for their documents and collections.

Unfortunately, on a busy, mature sharded cluster, the results will very rarely be what is expected. The reason for this is that the count command as implemented today has several optimizations in place to make it faster to run in general and those speed optimizations essentially bypass a key piece of the sharding functionality needed to return accurate results in this case. This is a known bug and is being tracked in SERVER-3645, but does not stop people from consistently hitting this issue. The nature of the issue means that count will report documents in the results that it should not, for example:

  • Documents that are being deleted as part of a chunk migrations
  • Documents that have been left behind from previous chunk migrations (also known as orphans)
  • Documents currently being copied as part of an in-flight chunk migration

A regular query (rather than a count) will have its results filtered by the respective primary and not suffer from the same problem. Hence, if you were to manually count the results from a query client-side you would get an accurate result.

This quirk of sharded environments will eventually be fixed, but for now it will inevitably crop up from time to time in all active sharded clusters used by a large team.

Possible Mitigation Strategies

  • Do counts on the client side, or use targeted, range based queries (with a primary read preference) to count instead
  • Use cleanUpOrphaned and disable the balancer (make sure it has finished current round) when performing counts across the cluster

If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations.

Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.

Read the whole series

Announcing the Server Discovery and Monitoring Spec

Oct 3 • Posted 2 months ago

By Jesse Davis, Python Engineer at MongoDB

Space Shuttle Discovery

[Space Shuttle Discovery]

Last week, we published a draft of the Server Discovery And Monitoring Spec for MongoDB drivers. This spec defines how a MongoDB client discovers and monitors a single server, a set of mongoses, or a replica set. How does the client determine what types of servers they are? How does it keep this information up to date? How does the client find an entire replica set from a seed list, and how does it respond to a stepdown, election, reconfiguration, or network error?

In the past each MongoDB driver answered these questions a little differently, and mongos differed a little from the drivers. We couldn’t answer questions like, “Once I add a secondary to my replica set, how long does it take for the driver to discover it?” Or, “How does a driver detect when the primary steps down, and how does it react?”

From now on, all drivers answer these questions the same. Or, where there’s a legitimate reason for them to differ, there are as few differences as possible and each is clearly explained in the spec. Even in cases where several answers seem equally good, drivers agree on one way to do it.

The server discovery and monitoring method is specified in five sections. First, a client is constructed. Second, it begins monitoring the server topology by calling the ismaster command on all servers. (The algorithm for multi-threaded and asynchronous clients is described separately from single-threaded clients.) Third, as ismaster responses are received the client parses them, and fourth, it updates its view of the topology. Finally, the spec describes how drivers update their topology view in response to errors.

I’m particularly excited about the unittests that accompany the spec. We have 37 tests that are specified formally in YAML files, with inputs and expected outcomes for a variety of scenarios. For each driver we’ll write a test runner that feeds the inputs to the driver and verifies the outcome. This ends confusion about what the spec means, or whether all drivers conform to it.

The Java driver 2.12.1 is the spec’s reference implementation for multi-threaded drivers, and I’m making the upcoming PyMongo 3.0 release conform to the spec as well. Mongos 2.6’s replica set monitor is the reference implementation for single-threaded drivers, with a few differences. The upcoming Perl driver 1.0 implements the spec orthodoxly.

Once we have multiple reference implementations and the dust has settled, the draft spec will be final. We’ll bring the rest of our drivers up to spec over the next year.

You can read more about the Server Discovery And Monitoring Spec at these links:

We have more work to do. For one thing, the Server Discovery And Monitoring Spec only describes how the client gathers information about your server topology—it does not describe which servers the client uses for operations. My Read Preferences Spec only partly answers this second question. My colleague David Golden is writing an improved and expanded version of Read Preferences, which will be called the Server Selection Spec. Once that spec is complete, we’ll have a standard algorithm for all drivers that answers questions like, “Which replica set member does the driver use for a query? What about an aggregation? Which mongos does it use for an insert?” It’ll include tests of the same formality and rigor as the Server Discovery And Monitoring Spec does.

Looking farther ahead, we plan to standardize the drivers’ APIs so we all do basic CRUD operations the same. And since we’ll allow much larger replica sets soon, both the server-discovery and the server-selection specs will need amendments to handle large replica sets. In all cases, we’ll provide a higher level of rigor, clarity, and formality in our specs than we have before.

This was originally posted to Jesse’s blog, Empty Square

Sharding Pitfalls: Part I

Oct 1 • Posted 2 months ago

By Adam Comerford, Senior Solutions Engineer

Sharding is a popular feature in MongoDB, primarily used for distributing data across clusters for horizontal scaling. The benefits of sharding for scalability are well known, and often one of the major factors in choosing MongoDB in the first place, but as you add complexity to a distributed system, you increase the chances of hitting a problem.

The good news is that many of the common issues people encounter when moving to a sharded environment are avoidable, and most of them can be mitigated if you have already hit them.

Forewarned is forearmed and so with that in mind, we want users to be aware of best practices and situations to avoid when introducing sharding into your environment. In this three part blog series we will discuss several pitfalls and gotchas that we have seen occur with some regularity among MongoDB users. We’ll give an overview of the problem, how it occurs, how to avoid it and then discuss some possible mitigation strategies to employ if you have already run into this problem.

It should be noted that some of these topics are worthy of full technical articles in their own right, which is beyond the scope of a relatively short blog post. Think of these post as a good starting point and, if you have not yet hit any of these problems, an informative cautionary tale for anyone running a sharded MongoDB cluster. For additional details, please view the Sharding section in the MongoDB Manual.

Many of these topics are also covered as part of the M102 (MongoDB for DBAs) and M202 (Advanced Deployment and Operations) classes that are available for free on MongoDB University.

For our first set of cautionary tales we will focus on shard keys.

1. Using a monotonically increasing shard key (like ObjectID)

Although this is one of the most commonly covered topics on blogs, training material, MongoDB Days and more, the selection of a shard key remains a formidable exercise for the novice MongoDB DBA or developer.

The most common mistake we see is the selection of a monotonically increasing shard key when using range-based sharding rather than hashed sharding, which is a fancy way of saying the shard key value for new documents only increases. Examples of this would be a timestamp (naturally) or anything that has a time component as its most significant component like ObjectID (first 4 bytes are a time stamp).

Why is it a bad idea?

The short answer is insert scalability. If you select such a shard key, all inserts (new documents) will go to a single chunk - the highest range chunk, and that will never change. Hence, regardless of how many shards you add, your maximum write capacity will never increase - you will only ever write new documents to a single chunk and that chunk will only ever live on a single shard.

Occasionally, this type of shard key can be the correct choice, but if so then you won’t be able to scale for write capacity.

Possible Mitigation Strategies

  • Change the shard key - this is problematic with large collections, because the data essentially has to be dumped out and re-imported

  • More specifically, use a hash based shard key, which will allow the use of the same field while providing good write scalability.

2. Trying to Change Value of the Shard Key

Shard keys are immutable (cannot be changed) for an existing document. This issue usually only crops up when sharding a previously unsharded collection. Prior to sharding, certain updates will be possible that are no longer possible after the collection has been sharded.

Attempting to update the shard key for an existing document will fail with the following error:

cannot modify shard key's value fieldid for collection: foo.foo

Possible Mitigation Strategies

  • Delete and re-insert the document to alter the shard key rather than attempting to update it in-place. It should be noted that this will not be an atomic operation, so must be done with caution.

Now you have a better understanding of how to choose and change your shard key if needed. In our next post, we will go through some potential obstacles you will face when scaling your sharded environment.

If you want more insight on scaling techniques for MongoDB, view the slides and video from our recent webinar on how to achieve scale with MongoDB, which reviews three different ways to achieve scale with MongoDB.

Read Part II in the series and see what to look out for when running a sharded cluster

How to Perform Fuzzy-Matching with Mongo Connector and Elastic Search

Aug 26 • Posted 3 months ago

By Luke Lovett, Python Engineer at MongoDB

Introduction

Suppose you’re running MongoDB. Great! Now you can find exact matches to all the queries you can throw at the database. Now, imagine that you’re also building a text-search feature into your application. It has to draw words out of misspelled noise, and results may match on synonyms, too! For this daunting task you’ve chosen to use one of the Lucene-based projects, Elasticsearch or Solr. But now you have a problem— How will this tool search through your documents stored in MongoDB? And how will you keep the contents of the search engine up-to-date?

Mongo Connector fills a gap between MongoDB and some of the best search tools out there, such as Elasticsearch and Solr. It is not only capable of exporting data from your MongoDB replica set or sharded cluster to these systems, but also keeps your data consistent between these systems: as you insert, update, and remove documents in MongoDB, these changes are soon reflected on the other side through Mongo Connector. You may even use Mongo Connector to stream changes performed on one replica set primary to another, thus simulating a “multi-master” cluster.

When Mongo Connector saw its first release in August of 2012, it was very simplistic in its capabilities and lacked fault tolerance. I’ve been working on Mongo Connector since November, 2013 with the help of the MongoDB Python team, and I’m glad to say that Mongo Connector has come a long way in terms of the features it provides and (especially) stability. This post will show off some of these new features and give an example of how to replicate operations from MongoDB to Elasticsearch, an open-source search engine, using Mongo Connector. At the end of this post, we’ll be able to make fuzzy-match text queries against data streaming into Elasticsearch.

Getting our Dataset

For this post, we’ll be pulling in posts from the popular link aggregation website, Reddit. We recently added safe encoding of data types supported by MongoDB (i.e., BSON types) to types external database drivers (in this case, elasticsearch-py) can handle. This makes it safe to use for replicating documents whose content we may not have much control over (e.g., from web scraping). Using this script that pulls new posts from reddit, we’ll stream new Reddit posts to MongoDB:

./reddit2mongo --mongo-host localhost --mongo-port 27017

As the post is processed, you should see the first 20 characters of the title. This is (I admit, slowly, thanks to Reddit API limits) emulating the inserts into MongoDB that your application is making.

Firing up the Connector

Next, we’ll start Mongo Connector. To download and install Mongo Connector, you can use pip:

pip install mongo-connector

For this demonstration, we’ll assume that you already have Elasticsearch set up and running on your local machine, listening on port 9200. You can start replicating from MongoDB to Elasticsearch using the following command:

Of course, if we only wanted to perform text search on post titles and text, we can restrict what fields are passed through to Elasticsearch using the —fields option. This way, we can minimize the amount of data we are actually duplicating:

Just as you see the Reddit posts printed to STDOUT by reddit2mongo, you should see output coming from Mongo Connector logging the fact that each document has been forwarded to ES at about the same time! What a beautiful scene!

Searching, Elastically

Now we’re ready to use Elasticsearch to perform fuzzy text queries on our dataset as it arrives from MongoDB. Because we’re streaming directly from Reddit’s website, I can’t really say what results you’ll find in your dataset, but as this particular corner of the internet seems to love cats almost as much as we love search engines, it’s probably safe to say that a query for kitten will get you somewhere:

Because we’re performing a fuzzy search, we can even do a search for the non-word kiten. Since most people aren’t too careful with their spelling, you can imagine how powerful this feature is when performing text searches based directly on a user’s input:

The fuzziness parameter determines the maximum “edit distance” the text query can be in order to match a field. The prefix_length parameter says that results have to match the first letter of the query. This article offers a great explanation of how this works. This search yielded the same results for me as its properly-spelled version.

More than just Inserts

Although our demo was just taking advantage of continuously streaming documents from MongoDB to Elasticsearch, Mongo Connector is more than just an import/export tool. When you update or delete documents in MongoDB, those operations are replicated to your other systems as well, keeping all systems in-sync with the current primary of a replica set. If the primary fails over and a rollback occurs, Mongo Connector can detect these and do the Right Thing to maintain consistency regardless.

Recap

The really great thing about this is that we’re performing operations in MongoDB and Elasticsearch at the same time. Without a tool like Mongo Connector, we would have to use a tool like mongoexport to dump data from MongoDB periodically into JSON, then upload this data into an empty Elasticsearch index, so we don’t have previously-deleted documents hanging around. This would probably be an enormous hassle, and we would lose the near real-time capability of our ES-powered search engine.

Although Mongo Connector has improved substantially since its first release, it’s still an experimental project and has a ways to go before official support by MongoDB, Inc. However, I am committed to answering questions as well as reviewing feature requests and bug reports reported to Mongo Connector’s issues page on Github. Also be sure to check out the full documentation on its Github wiki page.

Resources

Setting up Java Applications to Communicate with MongoDB, Kerberos and SSL

Aug 19 • Posted 4 months ago

By Alex Komyagin, Technical Services Engineer at MongoDB

Setting up Kerberos authentication and SSL encryption in a MongoDB Java application is not as simple as other languages. In this post, I’m going to show you how to create a Kerberos and SSL enabled Java application that communicates with MongoDB.

My original setup consists of the following:

1) KDC server:

kdc.mongotest.com

kerberos config file (/etc/krb5.conf):

KDC has the following principals:

  • gssapitest@MONGOTEST.COM - user principle (for java app)
  • mongodb/rhel64.mongotest.com@MONGOTEST.COM - service principle (for mongodb server)

2) MongoDB server:

rhel64.mongotest.com

MongoDB version: 2.6.0

MongoDB config file:

This server also has the global environment variable $KRB5_KTNAME set to the keytab file exported from KDC.

Application user is configured in the admin database like this:

3) Application server: has stock OS with krb5 installed

All servers are running with RHEL6.4 onboard.

Now let’s talk about how to create a Java application with Kerberos and SSL enabled, and that will run on the application server. Here is the sample code that we will use (SSLApp.java):

Download the java driver:

Install java and jdk:

sudo yum install java-1.7.0 sudo yum install java-1.7.0-devel

Create a certificate store for Java and store the server certificate there, so that Java knows who it should trust:

(mongodb.crt is just a public certificate part of mongodb.pem)

Copy kerberos config file to the application server: /etc/krb5.conf or ““C:\WINDOWS\krb5.ini“` (otherwise you’ll have to specify kdc and realm as Java runtime options)

Use kinit to store the principal password on the application server:

kinit gssapitest@MONGOTEST.COM

As an alternative to kinit, you can use JAAS to cache kerberos credentials.

Compile and run the Java program

It is important to specify useSubjectCredsOnly=false, otherwise you’ll get the “No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)” exception from Java. As we discovered, this is not strictly necessary in all cases, but it is if you are relying on kinit to get the service ticket.

The Java driver needs to construct MongoDB service principal name in order to request the Kerberos ticket. The service principal is constructed based on the server name you provide (unless you explicitly asked to canonicalize server name). For example, if I change rhel64.mongotest.com to the host IP address in the connection URI, I would be getting Kerberos exceptions No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - UNKNOWN_SERVER)]. So be sure you specify the same server host name as you used in the Kerberos principal (). Adding -Dsun.security.krb5.debug=true to Java runtime options helps a lot in debugging kerberos auth issues.

These steps should help simplify the process of connecting Java applications with SSL. Before deploying any application with MongoDB, be sure to read through our 12 tips for going into production and the Security Checklist which outlines recommended security measures to protect your MongoDB installation. More information on configuring MongoDB Security can be found in the MongoDB Manual.

For further questions, feel free to reach out to the MongoDB team through google-groups.

Getting Started with MongoDB and Java: Part II

Aug 14 • Posted 4 months ago

By Trisha Gee, Java Engineer and Advocate at MongoDB

In the last article, we covered the basics of installing and connecting to MongoDB via a Java application. In this post, I’ll give an introduction to CRUD (Create, Read, Update, Delete) operations using the Java driver. As in the previous article, if you want to follow along and code as we go, you can use these tips to get the tests in the Getting Started project to go green.

Creating documents

In the last article, we introduced documents and how to create them from Java and insert them into MongoDB, so I’m not going to repeat that here. But if you want a reminder, or simply want to skip to playing with the code, you can take a look at Exercise3InsertTest.

Querying

Putting stuff in the database is all well and good, but you’ll probably want to query the database to get data from it.

In the last article we covered some basics on using find() to get data from the database. We also showed an example in Exercise4RetrieveTest. But MongoDB supports more than simply getting a single document by ID or getting all the documents in a collection. As I mentioned, you can query by example, building up a query document that looks a similar shape to the one you want.

For the following examples I’m going to assume a document which looks something like this:

person = {
  _id: "anId",
  name: "A Name",
  address: {
    street: "Street Address",
    city: "City",
    phone: "12345"
  }
  books: [ 27464, 747854, ...]
}  

Find a document by ID

To recap, you can easily get a document back from the database using the unique ID:

…and you get the values out of the document (represented as a DBObject) using a Map-like syntax:

In the above example, because you’ve queried by ID (and you knew that ID existed), you can be sure that the cursor has a single document that matches the query. Therefore you can use cursor.one() to get it.

Find all documents matching some criteria

In the real world, you won’t always know the ID of the document you want. You could be looking for all the people with a particular name, for example.

In this case, you can create a query document that has the criteria you want:

You can find out the number of results:

and you can, naturally, iterate over them:

A note on batching

The cursor will fetch results in batches from the database, so if you run a query that matches a lot of documents, you don’t have to worry that every document is loaded into memory immediately. For most queries, the first batch returned will be 101 documents. But as you iterate over the cursor, the driver will automatically fetch further batches from the server. So you don’t have to worry about managing batching in your application. But you do need to be aware that if you iterate over the whole of the cursor (for example to put it into a List), you will end up fetching all the results and putting them in memory.

You can get started with Exercise5SimpleQueryTest.

Selecting Fields

Generally speaking, you will read entire documents from MongoDB most of the time. However, you can choose to return just the fields that you care about (for example, you might have a large document and not need all the values). You do this by passing a second parameter into the find method that’s another DBObject defining the fields you want to return. In this example, we’ll search for people called “Smith”, and return only the name field. To do this we pass in a DBObject representing {name: 1}:

You can also use this method to exclude fields from the results. Maybe we might want to exclude an unnecessary subdocument from the results - let’s say we want to find everyone called “Smith”, but we don’t want to return the address. We do this by passing in a zero for this field name, i.e. {address: 0}:

With this information, you’re ready to tackle Exercise6SelectFieldsTest

Query Operators

As I mentioned in the previous article, your fields can be one of a number of types, including numeric. This means that you can do queries for numeric values as well. Let’s assume, for example, that our person has a numberOfOrders field, and we wanted to find everyone who had ordered more than, let’s say, 10 items. You can do this using the $gt operator:

Note that you have to create a further subdocument containing the $gt condition to use this operator. All of the query operators are documented, and work in a similar way to this example.

You might be wondering what terrible things could happen if you try to perform some sort of numeric comparison on a field that is a String, since the database supports any type of value in any of the fields (and in Java the values are Objects so you don’t get the benefit of type safety). So, what happens if you do this?

The answer is you get zero results (assuming all your documents contain names that are Strings), and you don’t get any errors. The flexible nature of the document schema allows you to mix and match types and query without error.

You can use this technique to get the test in Exercise7QueryOperatorsTest to go green - it’s a bit of a daft example, but you get the idea.

Querying Subdocuments

So far we’ve assumed that we only want to query values in our top-level fields. However, we might want to query for values in a subdocument - for example, with our person document, we might want to find everyone who lives in the same city. We can use dot notation like this:

We’re not going to use this technique in a query test, but we will use it later when we’re doing updates.

Familiar methods

I mentioned earlier that you can iterate over a cursor, and that the driver will fetch results in batches. However, you can also use the familiar-looking skip() and limit() methods. You can use these to fix up the test in Exercise8SkipAndLimitTest.

A last note on querying: Indexes

Like a traditional database, you can add indexes onto the database to improve the speed of regular queries. There’s extensive documentation on indexes which you can read at your own leisure. However, it is worth pointing out that, if necessary, you can programmatically create indexes via the Java driver, using createIndexes. For example:

There is a very simple example for creating an index in Exercise9IndexTest, but indexes are a full topic on their own, and the purpose of this part of the tutorial is to merely make you aware of their existence rather than provide a comprehensive tutorial on their purpose and uses.

Updating values

Now you can insert into and read from the database. But your data is probably not static, especially as one of the benefits of MongoDB is a flexible schema that can evolve with your needs over time.

In order to update values in the database, you’ll have to define the query criteria that states which document(s) you want to update, and you’ll have to pass in the document that represents the updates you want to make.

There are a few things to be aware of when you’re updating documents in MongoDB, once you understand these it’s as simple as everything else we’ve seen so far.

Firstly, by default only the first document that matches the query criteria is updated.

Secondly, if you pass in a document as the value to update to, this new document will replace the whole existing document. If you think about it, the common use-case will be: you retrieve something from the database; you modify it based on some criteria from your application or the user; then you save the updated document to the database.

I’ll show the various types of updates (and point you to the code in the test class) to walk you through these different cases.

Simple Update: Find a document and replace it with an updated one

We’ll carry on using our simple Person document for our examples. Let’s assume we’ve got a document in our database that looks like:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  address: {
    street: "123 Fake St",
    city: "Faketon",
    phone: "5559991234"
  }
  books: [ 27464, 747854, ...]
} 

Maybe Jo goes into witness protection and needs to change his/her name. Assuming we’ve got jo populated in a DBObject, we can make the appropriate changes to the document and save it into the database:

You can make a start with Exercise10UpdateByReplacementTest.

Update Operators: Change a field

But sometimes you won’t have the whole document to replace the old one, sometimes you just want to update a single field in whichever document matched your criteria.

Let’s imagine that we only want to change Jo’s phone number, and we don’t have a DBObject with all of Jo’s details but we do have the ID of the document. If we use the $set operator, we’ll replace only the field we want to change:

There are a number of other operators for performing updates on documents, for example $inc which will increment a numeric field by a given amount.

Now you can do Exercise11UpdateAFieldTest

Update Multiple

As I mentioned earlier, by default the update operation updates the first document it finds and no more. You can, however, set the multi flag on update to update everything.

So maybe we want to update everyone in the database to have a country field, and for now we’re going to assume all the current people are in the UK:

The query parameter is an empty document which finds everything; the second boolean (set to true) is the flag that says to update all the values which were found.

Now we’ve learnt enough to complete the two tests in Exercise12UpdateMultipleDocumentsTest

Upsert

Finally, the last thing to mention when updating documents is Upsert (Update-or-Insert). This will search for a document matching the criteria and either: update it if it’s there; or insert it into the database if it wasn’t.

Like “update multiple”, you define an upsert operation with a magic boolean. It shouldn’t come as a surprise to find it’s the first boolean param in the update statement (since “multi” was the second):

Now you know everything you need to complete the test in Exercise13UpsertTest

Removing from the database

Finally the D in CRUD - Delete. The syntax of a remove should look familiar now we’ve got this far, you pass a document that represents your selection criteria into the remove method. So if we wanted to delete jo from our database, we’d do:

Unlike update, if the query matches more than one document, all those documents will be deleted (something to be aware of!). If we wanted to remove everyone who lives in London, we’d need to do:

That’s all there is to remove, you’re ready to finish off Exercise14RemoveTest

Conclusion

Unlike traditional databases, you don’t create SQL queries in MongoDB to perform CRUD operations. Instead, operations are done by constructing documents both to query the database, and to define the operations to perform.

While we’ve covered what the basics look like in Java, there’s loads more documentation on all the core concepts in the MongoDB documentation:

blog comments powered by Disqus