Importing your data

Amazon Personalize uses data that you provide to train a model. When you import data, you can choose to import records in bulk or incrementally or both. With incremental imports, you can add individual historical records or data from live events, or both, depending on your business requirements.

Your notebook provides information about importing historical data into Amazon Personalize. For information about recording live interactions data, see Recording Events.

To import your historical training data into Amazon Personalize, you will be doing the following:

  1. Create an empty dataset group. Dataset groups are domain-specific containers for related datasets For more information, see Step 1: Creating a Dataset Group.
personalize = boto3.client('personalize')

create_dataset_group_response = personalize.create_dataset_group(
    name = "personalize-immersion-day-dsg"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']

print(json.dumps(create_dataset_group_response, indent=2))

  1. For each type of dataset you are using, create an empty dataset with an associated schema. Datasets are Amazon Personalize containers for data and schemas that specify contents of a dataset. For more information, see Step 2: Creating a Dataset and a Schema.
interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-immersion-day-interactions-schema",
    schema = json.dumps(interactions_schema)
)

interaction_schema_arn = create_schema_response['schemaArn']

print(json.dumps(create_schema_response, indent=2))
  1. Import your data:

create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-immersion-day-import",
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']

print(json.dumps(create_dataset_import_job_response, indent=2))

You will be doing the steps above in your notebook!