Social Icons

COLLECT STATISTICS

teradata Collect Statistics
The Purpose of  COLLECT STATISTICS is to gather and store demographic data for one or more columns or indices of a table or join index.In this process it collects data and stores the summary in the Data Dictionary (DD) inside USER DBC.
The optimizer uses this synopsis data to generate efficient table access and join plans.

Below are the statistics will collect
  • The number of rows in the table
  • The average row size
  • Information on all Indexes in which statistics were collected
  • The range of values for the column(s) in which statistics were collected
  • The number of rows per value for the column(s) in which statistics were collected
  • The number of NULLs for the column(s) in which statistics were collected
What to COLLECT STATISTICS On?

Do NOT COLLECT Statistics on all columns and indexes because it takes too much space.
  • Primary Index of a Join Index
  • Secondary Indexes defined on any join index
  • Non-indexed columns used in joins
  • The Unique Primary Index of small tables (less than 1,000 rows per AMP)
  • All Non-Unique Primary Indexes and All Non-Unique Secondary Indexes
  • Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions
  • The Unique Primary Index of small tables (less than 1,000 rows per AMP)
  • Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins
Syntax for COLLECT STATISTICS

COLLECT STATISTICS on COLUMN ;
COLLECT STATISTICS on INDEX ();


Examples for a better Understanding

COLLECT STATISTICS on Emp_Table ;
COLLECT STATISTICS on Emp_Table COLUMN Dept_no ;
COLLECT STATISTICS on Emp_Table COLUMN(Emp_no, Dept_no);
COLLECT STATISTICS on Emp_Table INDEX Emp_no ;
COLLECT STATISTICS on Emp_Table INDEX (First_name, Last_name);


The New Teradata V14 Way to Collect Statistics


COLLECT STATISTICS COLUMN(First_Name, Last_Name) 
                                         ,COLUMN(First_Name)
                                         ,COLUMN(Dept_No) ON Employee_Table;


With the latest approach, Only a single table scan is required instead multiple table scan like old approach.

Teradata keeps all Collect Statistics information in DBC in the Data Dictionary tables. Those are
  1. DBC.Indexes (for multi-column indexes only)
  2. DBC.TVFields (for all columns and single column indexes)
  3. DBC.StatsTbl (Teradata V14 and beyond)
Collect Statistics on a PPI Table on the Partition
COLLECT STATISTICS on Order_Table_PPI COLUMN PARTITION

The main reasons to collect stats on Partiton are
  • PE will give better plan for PPI tables
  • It helps on  Partition Elimination on Range Queries.
  • Mainly helpful when a table has a lot of empty partitions.
Teradata V14 Statistics Enhancements

1) There is a SUMMARY option to collect table-level statistics
Table-level statistics known as "summary statistics" are collected whenever column or index statistics are collected.

SHOW SUMMARY STATISTICS VALUES ON Employee_Table;

One critical advantage is that the optimizer now uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations

2) SYSTEM SAMPLE option allows the system to determine the sampled system percentage.
SAMPLE n PERCENT allows you to specify sampling at the individual statistics collection level rather than at the system level
COLLECT STATISTICS    USING Sample 20 Percent   COLUMN (Last_Name) ON Employee_Table;

3) Statistics are stored in DBC.StatsTbl to reduce access contention and improve performance.

4) New views DBC.StatsV, DBC.ColumnStatsV, DBC.MultiColumnStatsV, and IndexStatsV.

5) SHOW STATISTICS statement reports detailed statistics in plain text or XML formatting.

6) MaxValueLength
Teradata only placed the first 16 bytes in the statistics before Version 14 but now the default is 25 bytes.
COLLECT STATISTICS USING MAXVALUELENGTH 50  COLUMN (Item_Name)  ON Shipping_Table  
7) MaxIntervals 
Whenever you collected statistics, Teradata did a full table scan on the values, sorted them, and then placed them into 200 intervals. Now, the default is 250 intervals

8)Expressions can now be specified in COLLECT STATS statements

COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN (EXTRACT(MONTH FROM Order_Date), Order_Total) AS Month_Plus_Total ON Order_Table;
Also use Substr(), MOD, CONCATENATION, Format Conversion expressions like Uppercase, Lowercase, Data type Conversions, CASE-Expressions, BEGIN and END expressions of PERIOD types.

9)AutoStats feature
Teradata V14.10 Autostats identifies and collects missing statistics needed and detects stale statistics for refreshing. If statistics are not being used by optimizer, they will be removed.
A new repository of a system supplied database named TDSTATS stores metadata for all stats collections. This is created by a new process called DIPSTATS.

10) External stored procedures (XSPs) perform the stats management process.

These are the latest enhancements of COLLECT STATISTICS.

1 comment:

  1. COLLECT STATISTICS COLUMN(First_Name, Last_Name)
    ,COLUMN(First_Name)
    ,COLUMN(Dept_No) ON Employee_Table;
    .........................................
    In this case it scans two times, Because deptno is not selected in first scan.

    COLLECT STATISTICS COLUMN(First_Name, Last_Name)
    ,COLUMN(First_Name)
    ,COLUMN(Last_Name) ON Employee_Table;

    In this case it go for single scan, because first_name & last_name data already available while collecting on the combination. Please correct me if I am wrong...

    Murali Udayagiri
    DWH Evangelist

    ReplyDelete