HELP!

Upload New Data

The Upload New Data on the Data Mining Server Management portal allows you upload your data from a CSV formatted file.

You can define the type of each column (Text, Numeric or Timestamp) and you can also select include or exclude any columns when the data is imported.

Additionally you can aggregate the data, create a sample field and derive additional temporal-based data fields columns from an existing column. Please see the sections below for a more detailed explanation.

If a Timestamp type field is included in your data then the upload process can also perform the time-series functions Data Aggregation and and Adding Temporal Data Fields.

Data Aggregation

Data Aggregation collects data within a time window and expresses the data in a summary form. This is particularly useful for statistical analysis on large datasets.

Data Aggregation is performed over the whole dataset and for each column that is included. In the example below we are aggregating the data over a 5 minute (300 seconds) window for bank loan applicants. This 5 minute window is based on the column that has a type defined as 'Timestamp'. With columns that are defined as 'Text' the most frequent answer will be used when aggregating that column. However if you were to enter an 'event' into that text box and that event exists in the time window then the event text will be used for the column. If the event does not exist then the most frequent value will be used. With columns that are defined as 'Numeric' the average over the time window is used.

Please note, only 1 column with the type 'Timestamp' can be used per import for data aggregation

Adding Temporal Data Fields

Temporal Data Fields allows the data mining algorithm to consider if the target outcome is influenced by the field values in the time leading up to the target outcome’s timestamp.

This option allows additional fields (derived from each Numeric field) to be included in the uploaded data table. Each additional field provides a summary of historical data points going back in time from each value’s time stamp (specified in seconds). The derived summaries over this time window are; the Average value, the Minimum and Maximum values. The time window can also be divided into equal Sections (up to 4), with each Section providing the same set of derived summary fields. If the time window in seconds is left blank for a given numeric field then no summary fields are derived from that field.

The above selections will result in the following fields being included in the imported data table:

In the above example, the field Temperature_avg_600s represents the average temperature over the last 10 minutes and Temperature_slope_600s provides a numeric indicator of the direction of change in the value over the last 10 minutes

You may also wish to consider adding columns to your data file (before the import) to represent time lagged values, by making a copy of a column and then shifting its data down or up to correspond to a time lag period.

Data Sampling

Data Sampling allows you to analyse the data and identify patterns/trends in a subset of the data. This is particularly useful with large datasets.

This can be enabled if the Add Sample Data Field is selected. When selected it then allows you to enter a percentage of the data that you would like to analyse and the name of the column that will be created, as can be seen below.

When the data is imported an additional column is created called __sample (double underscore). The data is imported with the specified percentage of the data having it's __sample column set to 0 and remaining percentage set to 1. The __sample column can then be used in the 'rows filter' option.

Advanced Uploading Options

An optional first extra row in your CSV data to customise the data or data types on the data upload. The following values can be explicitly defined after the column name:

[field name]:t This converts a numeric column into a discrete column e.g. Age:t
[field name]:ts This defines that the column is a timestamp. Please note the timestamp must be in dd/mm/yyyy hh:mm:ss format e.g. TimeStamp:ts
[field name]:[seconds] This option must be used with temporal data. If a timestamp has been defined and the column contains :[seconds], replace [seconds] with the number of seconds to look ahead, then the column is split into sections, which is then split into sub sections. From this average, min and max values are computed. E.g. Age:300
[field name]:[event] This option must be used with data aggregation. If the column is discrete and :[event] is used, replace [event] with any text, and if the text after the column name is found in the time period then the text will be used, if it is not found then the most frequent value in the time period will be used e.g. Occupation:NA
[field name]:t:[event] This option must be used with data aggregation. If the column is numeric you can convert it to discrete using :t and also use :[event], replace [event] with any text, and if the text after the column name is found in the time period then the text will be used, if it is not found then the most frequent value in the time period will be used e.g. Age:t:NA

Upload New Data

Data Aggregation

Adding Temporal Data Fields

Data Sampling

Advanced Uploading Options

On This Page