Configuration

Data7 configuration is splitted over three different files:

settings.yaml: general server configuration
.secrets.yaml: all sensible settings or credentials for Data7
data7.yaml: the datasets definition

All configuraiton files respect general and specific rules that we will describe in detail in the following sections.

General rules

Settings can be defined for multiple environments

Data7 configuration management is based on the Dynaconf library. It supports defining settings given a particular environment. Meaning that you can define different values for the same setting depending on the environment your instance is associated with. By environment we mean, development, testing, staging, production to name a few.

You can define as many environments as you need. If none is active for the current instance (more on this later), Data7 will look for a default configuration.

# settings.yaml
default:
  # The default value if no other environment is defined or active
  debug: false

development:
  debug: true

testing:
  # Speed up tests
  debug: false

staging:
  # Better not expose logs publicly
  debug: false

production:
  # Strongly recommended
  debug: false

To activate a particular environment for your instance, you have two options:

Define the ENV_FOR_DYNACONF environment variable with the environment name you want to activate, e.g. ENV_FOR_DYNACONF=staging.
Set the ENV_FOR_DYNACONF value in a .env file:

ENV_FOR_DYNACONF=development

Setting names are case-insensitive

This is an important rule: each setting can be define in upper or lower case, e.g. debug and DEBUG are the same setting.

Tip for contributors

As a consequence, you can define your settings in lower case because it's more readable in your YAML configuration:

debug: true

And use the upper case form in the code:

from data7.conf import settings


print(f"{settings.DEBUG=}")

Settings can be overridden using environment variables

Every setting can be overridden by defining the corresponding environment variable (in uppercase) prefixed by DATA7_, e.g. for the debug setting, you can define the DATA7_DEBUG=false environment variable to override the value defined in the settings.yaml file.

Use `data7 init` to boostrap your configuration

Data7 comes with a CLI that can help you boostraping your project (see the tutorial). Remember that the data7 init command will generate the three required configuration files for you. Once generated it's up to you to define your own environments and change setting values to suit your needs.

Configuration details

`settings.yaml`

`DATASETS_ROOT_URL`

The root URL that will prefix dataset URLs (e.g. the /d in /d/invoices.csv for the invoices dataset.)

Default: /d

`CHUNK_SIZE`

Size of batches to process, i.e. the number of SQL query result rows to process at each iteration.

Default: 5000

`SCHEMA_SNIFFER_SIZE`

The number of SQL query result rows used to infer a table schema (data types).

Default: 1000

`DEFAULT_DTYPE_BACKEND`

The backend used to infer data types while fetching data from the database. Possible values are: numpy_nullable or pyarrow (see Pandas documentation).

Default: pyarrow

`PROFILER_INTERVAL`

From pyinstrument's documentation:

The minimum time, in seconds, between each stack sample. This translates into the resolution of the sampling.

Default: 0.001

`PROFILER_ASYNC_MODE`

From pyinstrument's documentation:

Configures how this Profiler tracks time in a program that uses async/await.

Default: enabled

`DEBUG`

Set to true to enable debugging mode, logs and server response will be more explicit.

Default: false

Warning

We strongly recommend to keep default false value when running Data7 in production.

`PROFILING`

(De)Activate server request profiling. If set to True, adding the ?profile=1 argument to HTTP requests returns the profiling analysis instead of the expected requested dataset.

Example query: http://localhost:8000/d/invoices.csv?profile=1

Default: false

`HOST`

This is the host socket will be bind to. It can be an IPv4 or IPv6 address, or a fully qualified domain name (e.g. data7.example.org). Set this to 0.0.0.0 if you want your application to be available from your local network.

Default: None (required)

`PORT`

This the host port the socket will be bind to. It is classicaly set to 8000 for a Python application.

Default: None (required)

`EXECUTION_ENVIRONMENT`

Used by Sentry to track the environment of raised issue.

Default: None

`SENTRY_DSN`

The DSN of your Sentry project, e.g. https://account@sentry.io/project_id. When not set, Sentry integration is not active.

Default: None

`SENTRY_ENABLE_TRACING`

Toggle Sentry performance tracking feature.

Default: True

`SENTRY_TRACES_SAMPLE_RATE`

The sample rate of traces sent to sentry: 1.0 means 100% while 0.1 means 10%. It should be adapted for production.

Default: 1.0

`SENTRY_PROFILES_SAMPLE_RATE`

The sample rate of profiles sent to sentry: 1.0 means 100% while 0.1 means 10%. It should be adapted for production.

Default: 1.0

`.secrets.yaml`

`DATABASE_URL`

The URL that will be used by Data7 for database connections. It uses the classical pattern:

<database engine>://<user>:<password>@<host>:<port>/<database name>

Info

Data7 supports all asynchronous database engines supported by the databases library. Depending on your database engine, you may need to add the related database driver to your project dependencies.

Supposing your database user is data7, its password is secret and the database name you will query is chinook, depending on the database engine and driver you want to use, here is a table that summarizes dependencies you need to install and DATABASE_URL example values.

Database	Dependency	Example value
PostgreSQL	`psycopg[binary,pool]`	`postgresql+psycopg://data7:secret@localhost:5432/chinook`
MySQL	`mariadb-connector`	`mysql://data7:secret@localhost:3306/chinook`
SQLite	-	`sqlite:///chinook.db`

`data7.yaml`

`DATASETS`

This is the core setting of your Data7 instance. DATASETS is a list of dataset definitions. Each dataset is defined by:

a basename: the base name of your dataset will be used in its URL (e.g. /d/invoices.csv for the invoices basename) and thus the corresponding file name when you will fetch its content.
a query: the SQL query that will be executed to fetch data.

You will find example definitions for the development environment:

datasets:
  # Base dataset exposing all table records
  #
  - basename: invoices
    query: "SELECT * FROM Invoice"

  # A more complex dataset using related tables
  #
  - basename: tracks
    query: >-
      SELECT Artist.Name as artist, Album.Title as title, Track.Name as track
      FROM Artist
      INNER JOIN Album ON Artist.ArtistId = Album.ArtistId
      INNER JOIN Track ON Album.AlbumId = Track.AlbumId
      ORDER BY Artist.Name, Album.Title

Tip

Remember that this file's syntax should be validity YAML. Each database query should also be valid. You can check both YAML and SQL validity using the data7 check command.

Configuration

General rules

Settings can be defined for multiple environments

Setting names are case-insensitive

Settings can be overridden using environment variables

Use data7 init to boostrap your configuration

Configuration details

settings.yaml

DATASETS_ROOT_URL

CHUNK_SIZE

SCHEMA_SNIFFER_SIZE

DEFAULT_DTYPE_BACKEND

PROFILER_INTERVAL

PROFILER_ASYNC_MODE

DEBUG

PROFILING

HOST

PORT

EXECUTION_ENVIRONMENT

SENTRY_DSN

SENTRY_ENABLE_TRACING

SENTRY_TRACES_SAMPLE_RATE

SENTRY_PROFILES_SAMPLE_RATE

.secrets.yaml

DATABASE_URL

data7.yaml

DATASETS