The Upload New Data on the Data Mining Server Management portal allows you upload your data from a CSV formatted file.
You can define the type of each column (Text, Numeric or Timestamp) and you can also select include or exclude any columns when the data is imported.
Additionally you can aggregate the data, create a sample field and derive additional temporal-based data fields columns from an existing column. Please see the sections below for a more detailed explanation.
If a Timestamp type field is included in your data then the upload process can also perform the time-series functions Data Aggregation and and Adding Temporal Data Fields.
Data Aggregation collects data within a time window and expresses the data in a summary form. This is particularly useful for statistical analysis on large datasets.
Data Aggregation is performed over the whole dataset and for each column that is included. In the example below we are aggregating the data over a 5 minute (300 seconds) window for bank loan applicants. This 5 minute window is based on the column that has a type defined as 'Timestamp'. With columns that are defined as 'Text' the most frequent answer will be used when aggregating that column. However if you were to enter an 'event' into that text box and that event exists in the time window then the event text will be used for the column. If the event does not exist then the most frequent value will be used. With columns that are defined as 'Numeric' the average over the time window is used.
Please note, only 1 column with the type 'Timestamp' can be used per import for data aggregation
Temporal Data Fields allows the data mining algorithm to consider if the target outcome is influenced by the field values in the time leading up to the target outcome’s timestamp.
This option allows additional fields (derived from each Numeric field) to be included in the uploaded data table. Each additional field provides a summary of historical data points going back in time from each value’s time stamp (specified in seconds). The derived summaries over this time window are; the Average value, the Minimum and Maximum values. The time window can also be divided into equal Sections (up to 4), with each Section providing the same set of derived summary fields. If the time window in seconds is left blank for a given numeric field then no summary fields are derived from that field.
The above selections will result in the following fields being included in the imported data table:
In the above example, the field Temperature_avg_600s represents the average temperature over the last 10 minutes and Temperature_slope_600s provides a numeric indicator of the direction of change in the value over the last 10 minutes
You may also wish to consider adding columns to your data file (before the import) to represent time lagged values, by making a copy of a column and then shifting its data down or up to correspond to a time lag period.
Data Sampling allows you to analyse the data and identify patterns/trends in a subset of the data. This is particularly useful with large datasets.
This can be enabled if the Add Sample Data Field is selected. When selected it then allows you to enter a percentage of the data that you would like to analyse and the name of the column that will be created, as can be seen below.
When the data is imported an additional column is created called __sample (double underscore). The data is imported with the specified percentage of the data having it's __sample column set to 0 and remaining percentage set to 1. The __sample column can then be used in the 'rows filter' option.
An optional first extra row in your CSV data to customise the data or data types on the data upload. The following values can be explicitly defined after the column name: