MongoDB connector#

The MongoDB connector allows the use of MongoDB collections as tables in Starburst Enterprise.

SEP includes additional enterprise features that are built on top of the existing Trino connector functionality. For more information on connector key feature differences between Trino and SEP, see the connectors feature matrix.

Requirements#

To connect to MongoDB, you need:

Configuration#

To configure the MongoDB connector, create a catalog properties file that specifies the MongoDB connector by setting the connector.name to mongodb.

For example, to access a database as the example catalog, create the file etc/catalog/example.properties. Replace the connection properties as appropriate for your setup:

connector.name=mongodb
mongodb.connection-url=mongodb://user:pass@sample.host:27017/

The connection-url defines the connection information and parameters to pass to the MongoDB driver. The supported parameters for the URL are available in the MongoDB documentation.

Multiple MongoDB servers#

You can have as many catalogs as you need. If you have additional MongoDB servers, configure another catalog.

To add another catalog, add a new properties file to etc/catalog. For example, if you name the property file sales.properties, SEP creates a catalog named sales.

Configuration properties#

The following table describes catalog configuration properties for the connector:

Property name

Description

Default

mongodb.allow-local-scheduling

Assign MongoDB splits to a specific worker. Set the value of this property to true if SEP and MongoDB share the same cluster, and specific MongoDB splits should be processed on the same worker and MongoDB node. Note that a shared deployment is not recommended, and enabling this property can lead to resource contention.

false

mongodb.case-insensitive-name-matching

Match database and collection names case insensitively.

false

mongodb.connection-timeout

The socket connection timeout in milliseconds. A value of 0 means no timeout. This is only used when establishing a new connection.

10000

mongodb.connection-url

The connection url string containing the protocol, credential, and host info for use in connection to your MongoDB deployment. For example, the connection string may use the format mongodb://<user>:<pass>@<host>:<port>/?<options> or mongodb+srv://<user>:<pass>@<host>/?<options>, depending on the protocol used. The user/pass credentials must be for a user with write access to the schema information collection. This property is required.

mongodb.connections-per-host

The maximum number of connections allowed per host for this MongoClient instance. Those connections are kept in a pool when idle. Once the pool is exhausted, any operation requiring a connection blocks waiting for an available connection.

100

mongodb.cursor-batch-size

Limits the number of elements returned in one batch. A cursor typically fetches a batch of result objects and stores them locally. If batchSize is 0, the driver’s default is used. If batchSize is positive, it represents the size of each batch of objects retrieved. It can be adjusted to optimize performance and limit data transfer. If batchSize is negative, it limits the number of objects returned that fit within the max batch size limit (usually 4MB), and the cursor is closed. For example, if batchSize is -10, then the server returns a maximum of 10 documents, and as many as can fit in 4MB, then closes the cursor. Do not use a batch size of 1.

0

mongodb.dynamic-filtering.wait-timeout

Duration to wait for completion of dynamic filters during split generation.

5s

mongodb.max-connection-idle-time

The maximum idle time of a pooled connection in milliseconds. A value of 0 indicates no limit to the idle time. A pooled connection that has exceeded its idle time will be closed and replaced when necessary by a new connection.

0

mongodb.max-wait-time

The maximum wait time in milliseconds, that a thread may wait for a connection to become available. A value of 0 means that it does not wait. A negative value means to wait indefinitely for a connection to become available.

120000

mongodb.min-connections-per-host

The minimum number of connections per host for this MongoClient instance. Those connections are kept in a pool when idle, and the pool ensures over time that it contains at least this minimum number.

0

mongodb.read-preference

The read preference to use for queries, map-reduce, aggregation, and count. The available values are PRIMARY, PRIMARY_PREFERRED, SECONDARY, SECONDARY_PREFERRED and NEAREST.

PRIMARY

mongodb.required-replica-set

The required replica set name. With this option set, the MongoClient instance performs the following actions:

  • Connect in replica set mode, and discover all members of the set based on the given servers

  • Make sure that the set name reported by all members matches the required set name.

  • Refuse to service any requests, if authenticated user is not part of a replica set with the required name.

mongodb.schema-collection

A collection which contains schema information in each MongoDB database should define the schema of all tables. Please refer the Table definition section for the details.

At startup, the connector tries to guess the data type of fields based on the type mapping. The initial guess can be incorrect for your specific collection. In that case, you need to modify it manually. Please refer the Table definition section for the details. Creating new tables using CREATE TABLE and CREATE TABLE AS SELECT automatically create an entry for you.

_schema

mongodb.socket-timeout

The socket timeout in milliseconds. It is used for I/O socket read and write operations.

Defaults to 0 and means no timeout.

mongodb.tls.enabled

Enables TLS/SSL connections to MongoDB servers.

false

mongodb.tls.keystore-password

The key password for the key store specified by mongodb.tls.keystore-path.

mongodb.tls.keystore-path

The path to the PEM or JKS key store.

mongodb.tls.truststore-password

The key password for the trust store specified by mongodb.tls.truststore-path.

mongodb.tls.truststore-path

The path to PEM or JKS trust store.

mongodb.write-concern

The write concern to use. The available values are ACKNOWLEDGED, JOURNALED, MAJORITY and UNACKNOWLEDGED.

ACKNOWLEDGED

Table definition#

MongoDB maintains table definitions on the special collection where mongodb.schema-collection configuration value specifies.

Note

The plugin cannot detect that a collection has been deleted. You must delete the entry by executing db.getCollection("_schema").remove( { table: deleted_table_name }) in the MongoDB Shell. You can also drop a collection in SEP by running DROP TABLE table_name.

A schema collection consists of a MongoDB document for a table.

{
    "table": ...,
    "fields": [
          { "name" : ...,
            "type" : "varchar|bigint|boolean|double|date|array(bigint)|...",
            "hidden" : false },
            ...
        ]
    }
}

The connector quotes the fields for a row type when auto-generating the schema. However, the auto-generated schema must be corrected manually in the collection to match the information in the tables.

Manually altered fields must be explicitly quoted. For example, row("UpperCase" varchar).

Field

Required

Type

Description

table

required

string

Trino table name.

fields

required

array

A list of field definitions. Each field definition creates a new column in the Trino table.

Each field definition:

{
    "name": ...,
    "type": ...,
    "hidden": ...
}

Field

Required

Type

Description

name

required

string

Name of the column in the Trino table.

type

required

string

Trino type of the column.

hidden

required

boolean

Hides the column from DESCRIBE <table name> and SELECT *. Defaults to false.

There is no limit on field descriptions for either key or message.

ObjectId#

MongoDB collection has the special field _id. The connector tries to follow the same rules for this special field, so there will be hidden field _id.

CREATE TABLE IF NOT EXISTS orders (
    orderkey BIGINT,
    orderstatus VARCHAR,
    totalprice DOUBLE,
    orderdate DATE
);

INSERT INTO orders VALUES(1, 'bad', 50.0, current_date);
INSERT INTO orders VALUES(2, 'good', 100.0, current_date);
SELECT _id, * FROM orders;
                 _id                 | orderkey | orderstatus | totalprice | orderdate
-------------------------------------+----------+-------------+------------+------------
 55 b1 51 63 38 64 d6 43 8c 61 a9 ce |        1 | bad         |       50.0 | 2015-07-23
 55 b1 51 67 38 64 d6 43 8c 61 a9 cf |        2 | good        |      100.0 | 2015-07-23
(2 rows)
SELECT _id, * FROM orders WHERE _id = ObjectId('55b151633864d6438c61a9ce');
                 _id                 | orderkey | orderstatus | totalprice | orderdate
-------------------------------------+----------+-------------+------------+------------
 55 b1 51 63 38 64 d6 43 8c 61 a9 ce |        1 | bad         |       50.0 | 2015-07-23
(1 row)

You can render the _id field to readable values with a cast to VARCHAR:

SELECT CAST(_id AS VARCHAR), * FROM orders WHERE _id = ObjectId('55b151633864d6438c61a9ce');
           _id             | orderkey | orderstatus | totalprice | orderdate
---------------------------+----------+-------------+------------+------------
 55b151633864d6438c61a9ce  |        1 | bad         |       50.0 | 2015-07-23
(1 row)

ObjectId timestamp functions#

The first four bytes of each ObjectId represent an embedded timestamp of its creation time. Trino provides a couple of functions to take advantage of this MongoDB feature.

objectid_timestamp(ObjectId) timestamp#

Extracts the TIMESTAMP WITH TIME ZONE from a given ObjectId:

SELECT objectid_timestamp(ObjectId('507f191e810c19729de860ea'));
-- 2012-10-17 20:46:22.000 UTC
timestamp_objectid(timestamp) ObjectId#

Creates an ObjectId from a TIMESTAMP WITH TIME ZONE:

SELECT timestamp_objectid(TIMESTAMP '2021-08-07 17:51:36 +00:00');
-- 61 0e c8 28 00 00 00 00 00 00 00 00

In MongoDB, you can filter all the documents created after 2021-08-07 17:51:36 with a query like this:

db.collection.find({"_id": {"$gt": ObjectId("610ec8280000000000000000")}})

In SEP, the same can be achieved with this query:

SELECT *
FROM collection
WHERE _id > timestamp_objectid(TIMESTAMP '2021-08-07 17:51:36 +00:00');

Fault-tolerant execution support#

The connector supports Fault-tolerant execution of query processing. Read and write operations are both supported with any retry policy.

Type mapping#

Because Trino and MongoDB each support types that the other does not, this connector modifies some types when reading or writing data. Data types may not map the same way in both directions between Trino and the data source. Refer to the following sections for type mapping in each direction.

MongoDB to Trino type mapping#

The connector maps MongoDB types to the corresponding Trino types following this table:

MongoDB to Trino type mapping#

MongoDB type

Trino type

Notes

Boolean

BOOLEAN

Int32

BIGINT

Int64

BIGINT

Double

DOUBLE

Decimal128

DECIMAL(p, s)

Date

TIMESTAMP(3)

String

VARCHAR

Binary

VARBINARY

ObjectId

ObjectId

Object

ROW

Array

ARRAY

Map to ROW if the element type is not unique.

DBRef

ROW

No other types are supported.

Trino to MongoDB type mapping#

The connector maps Trino types to the corresponding MongoDB types following this table:

Trino to MongoDB type mapping#

Trino type

MongoDB type

BOOLEAN

Boolean

BIGINT

Int64

DOUBLE

Double

DECIMAL(p, s)

Decimal128

TIMESTAMP(3)

Date

VARCHAR

String

VARBINARY

Binary

ObjectId

ObjectId

ROW

Object

ARRAY

Array

No other types are supported.

SQL support#

The connector provides read and write access to data and metadata in MongoDB. In addition to the globally available and read operation statements, the connector supports the following features:

ALTER TABLE#

The connector supports the following ALTER TABLE operations:

  • ALTER TABLE RENAME TO

  • ALTER TABLE ADD COLUMN

  • ALTER TABLE DROP COLUMN

Other uses of ALTER TABLE are not supported.

Table functions#

The connector provides specific table functions to access MongoDB.

query(database, collection, filter) -> table#

The query function lets you query the underlying MongoDB directly. It requires syntax native to MongoDB, because the full query is pushed down and processed by MongoDB. This can be useful for accessing native features which are not available in SEP or for improving query performance in situations where running a query natively may be faster.

For example, get all rows where regionkey field is 0:

SELECT
  *
FROM
  TABLE(
    example.system.query(
      database => 'tpch',
      collection => 'region',
      filter => '{ regionkey: 0 }'
    )
  );

Performance#

The connector includes a number of performance features, detailed in the following sections.

Predicate pushdown support#

The connector supports predicate pushdown, including complex expression predicate pushdown using MongoDB indexing. It also supports predicate pushdown for the following functions:

Complex expression predicate pushdown is supported on columns of following types:

  • SMALLINT

  • TINYINT

  • INT

  • BIGINT

  • VARCHAR

  • TIMESTAMP(3)

  • TIMESTAMP WITH TIMEZONE(3)

You can configure complex predicate pushdown, by configuring the mongodb.complex-expression-pushdown.enabled property in your catalog configuration file. The equivalent session property is complex_expression_pushdown_enabled. By default, this feature is set to true.

Security#

The connector includes security-related features detailed in the following section.

Kerberos authentication#

The connector supports Kerberos authentication using a keytab.

To configure Kerberos authentication with a keytab, add the following catalog configuration properties to the catalog properties file:

mongo.authentication.type=KERBEROS
kerberos.client.principal=example@example.com
kerberos.client.keytab=etc/kerberos/example.keytab
kerberos.config=etc/kerberos/krb5.conf

With this configuration the user example@example.com, defined in the principal property, is used to connect to the database, and the related Kerberos service ticket is located in the example.keytab file.

Kerberos credential pass-through#

The connector can be configured to pass through Kerberos credentials received by SEP to the MongoDB database. Configure Kerberos and SEP, following the instructions in Kerberos credential pass-through.

Configure the connector to pass through the credentials from the server to the database in your catalog properties file, and ensure the Kerberos client configuration properties are in place on all nodes.

mongo.authentication.type=KERBEROS_PASS_THROUGH
http.authentication.krb5.config=/etc/krb5.conf
http-server.authentication.krb5.service-name=exampleServiceName
http-server.authentication.krb5.keytab=/path/to/Keytab/File

Note

When delegated Kerberos authentication is configured for the Starburst Enterprise web UI, ensure the http-server.authentication.krb5.service-name property value is set to HTTP to match the configured Kerberos service name.

Now, any MongoDB database accessed using SEP is subject to the Kerberos-defined data access restrictions and permissions.

Password credential pass-through#

The connector supports password credential pass-through. To enable it, edit the catalog properties file to include the authentication type:

mongo.authentication.type=PASSWORD_PASS_THROUGH
mongo.authentication.source=
mongo.authentication.mechanism=

The mongo.authentication.source property is the name of the database where the user’s data is stored. If the user’s data is stored outside of MongoDB, set the property value to $external.

The mongo.authentication.mechanism property is modeled after MongoDB’s authentication mechanisms and accepts a single value from the following set defined by MongoDB:

  • SCRAM-SHA-1

  • SCRAM-SHA-256

  • PLAIN

For more information about configurations and limitations, see Password credential pass-through.