Skip to content

Configuration

Data7 configuration is splitted over three different files:

  1. settings.yaml: general server configuration
  2. .secrets.yaml: all sensible settings or credentials for Data7
  3. data7.yaml: the datasets definition

All configuraiton files respect general and specific rules that we will describe in detail in the following sections.

General rules

Settings can be defined for multiple environments

Data7 configuration management is based on the Dynaconf library. It supports defining settings given a particular environment. Meaning that you can define different values for the same setting depending on the environment your instance is associated with. By environment we mean, development, testing, staging, production to name a few.

You can define as many environments as you need. If none is active for the current instance (more on this later), Data7 will look for a default configuration.

# settings.yaml
default:
  # The default value if no other environment is defined or active
  debug: false

development:
  debug: true

testing:
  # Speed up tests
  debug: false

staging:
  # Better not expose logs publicly
  debug: false

production:
  # Strongly recommended
  debug: false

To activate a particular environment for your instance, you have two options:

  1. Define the ENV_FOR_DYNACONF environment variable with the environment name you want to activate, e.g. ENV_FOR_DYNACONF=staging.
  2. Set the ENV_FOR_DYNACONF value in a .env file:
ENV_FOR_DYNACONF=development

Setting names are case-insensitive

This is an important rule: each setting can be define in upper or lower case, e.g. debug and DEBUG are the same setting.

Tip for contributors

As a consequence, you can define your settings in lower case because it's more readable in your YAML configuration:

debug: true

And use the upper case form in the code:

from data7.conf import settings


print(f"{settings.DEBUG=}")

Settings can be overridden using environment variables

Every setting can be overridden by defining the corresponding environment variable (in uppercase) prefixed by DATA7_, e.g. for the debug setting, you can define the DATA7_DEBUG=false environment variable to override the value defined in the settings.yaml file.

Use data7 init to boostrap your configuration

Data7 comes with a CLI that can help you boostraping your project (see the tutorial). Remember that the data7 init command will generate the three required configuration files for you. Once generated it's up to you to define your own environments and change setting values to suit your needs.

Configuration details

settings.yaml


DATASETS_ROOT_URL

The root URL that will prefix dataset URLs (e.g. the /d in /d/invoices.csv for the invoices dataset.)

Default: /d


CHUNK_SIZE

Size of batches to process, i.e. the number of SQL query result rows to process at each iteration.

Default: 5000


SCHEMA_SNIFFER_SIZE

The number of SQL query result rows used to infer a table schema (data types).

Default: 1000


DEFAULT_DTYPE_BACKEND

The backend used to infer data types while fetching data from the database. Possible values are: numpy_nullable or pyarrow (see Pandas documentation).

Default: pyarrow


PROFILER_INTERVAL

From pyinstrument's documentation:

The minimum time, in seconds, between each stack sample. This translates into the resolution of the sampling.

Default: 0.001


PROFILER_ASYNC_MODE

From pyinstrument's documentation:

Configures how this Profiler tracks time in a program that uses async/await.

Default: enabled


DEBUG

Set to true to enable debugging mode, logs and server response will be more explicit.

Default: false

Warning

We strongly recommend to keep default false value when running Data7 in production.


PROFILING

(De)Activate server request profiling. If set to True, adding the ?profile=1 argument to HTTP requests returns the profiling analysis instead of the expected requested dataset.

Example query: http://localhost:8000/d/invoices.csv?profile=1

Default: false


HOST

This is the host socket will be bind to. It can be an IPv4 or IPv6 address, or a fully qualified domain name (e.g. data7.example.org). Set this to 0.0.0.0 if you want your application to be available from your local network.

Default: None (required)


PORT

This the host port the socket will be bind to. It is classicaly set to 8000 for a Python application.

Default: None (required)


EXECUTION_ENVIRONMENT

Used by Sentry to track the environment of raised issue.

Default: None


SENTRY_DSN

The DSN of your Sentry project, e.g. https://account@sentry.io/project_id. When not set, Sentry integration is not active.

Default: None


SENTRY_TRACES_SAMPLE_RATE

The sample rate of traces sent to sentry: 1.0 means 100% while 0.1 means 10%.

Default: 1.0


.secrets.yaml


DATABASE_URL

The URL that will be used by Data7 for database connections. It uses the classical pattern:

<database engine>://<user>:<password>@<host>:<port>/<database name>

Info

Data7 supports all asynchronous database engines supported by the databases library. Depending on your database engine, you may need to add the related database driver to your project dependencies.

Supposing your database user is data7, its password is secret and the database name you will query is chinook, depending on the database engine and driver you want to use, here is a table that summarizes dependencies you need to install and DATABASE_URL example values.

Database Dependency Example value
PostgreSQL psycopg[binary,pool] postgresql+psycopg://data7:secret@localhost:5432/chinook
MySQL mariadb-connector mysql://data7:secret@localhost:3306/chinook
SQLite - sqlite:///chinook.db

data7.yaml


DATASETS

This is the core setting of your Data7 instance. DATASETS is a list of dataset definitions. Each dataset is defined by:

  • a basename: the base name of your dataset will be used in its URL (e.g. /d/invoices.csv for the invoices basename) and thus the corresponding file name when you will fetch its content.
  • a query: the SQL query that will be executed to fetch data.

You will find example definitions for the development environment:

datasets:
  # Base dataset exposing all table records
  #
  - basename: invoices
    query: "SELECT * FROM Invoice"

  # A more complex dataset using related tables
  #
  - basename: tracks
    query: >-
      SELECT Artist.Name as artist, Album.Title as title, Track.Name as track
      FROM Artist
      INNER JOIN Album ON Artist.ArtistId = Album.ArtistId
      INNER JOIN Track ON Album.AlbumId = Track.AlbumId
      ORDER BY Artist.Name, Album.Title

Tip

Remember that this file's syntax should be validity YAML. Each database query should also be valid. You can check both YAML and SQL validity using the data7 check command.