How do I use and manage Amazon Redshift WLM memory allocation?

Last updated: 2020-10-06

I'm trying to check the concurrency and Amazon Redshift workload management (WLM) allocation to the queues. How does WLM allocation work and when should I use it?

Short description

Amazon Redshift workload management (WLM) allows you to manage and define multiple query queues. It routes queries to the appropriate queues with memory allocation for queries at runtime. Some of the queries might consume more cluster resources, affecting the performance of other queries.

You can configure workload management to manage resources effectively in either of these ways:

  • Automatic WLM: Allows Amazon Redshift to manage the concurrency level of the queues and memory allocation for each dispatched query. The dispatched query allows users to define the query priority of the workload or users to each of the query queues.
  • Manual WLM: Allows you to have more control over concurrency level and memory allocation to the queues. As a result, queries with more resource consumption can run in queues with more resource allocation.

Note: To define metrics-based performance boundaries, use a query monitoring rule (QMR) along with your workload management configuration.

To check the concurrency level and WLM allocation to the queues, perform the following steps:

1.    Check the current WLM configuration of your Amazon Redshift cluster.

2.    Create a test workload management configuration, specifying the query queue's distribution and concurrency level.

3.    (Optional) If you are using manual WLM, determine how the memory is distributed between the slot counts.

Resolution

Checking your WLM configuration and memory usage

Use the STV_WLM_SERVICE_CLASS_CONFIG table to check the current WLM configuration of your Amazon Redshift cluster:

[
  {
    "query_concurrency": 2,
    "memory_percent_to_use": 30,
    "query_group": [],
    "query_group_wild_card": 0,
    "user_group": [
      "user_group1"
    ],
    "user_group_wild_card": 0,
    "rules": [
      {
        "rule_name": "BlockstoDiskSpill",
        "predicate": [
          {
            "metric_name": "query_temp_blocks_to_disk",
            "operator": ">",
            "value": 50
          }
        ],
        "action": "abort"
      }
    ]
  },
  {
    "query_concurrency": 5,
    "memory_percent_to_use": 40,
    "query_group": [],
    "query_group_wild_card": 0,
    "user_group": [
      "user_group2"
    ],
    "user_group_wild_card": 0
  },
  {
    "query_concurrency": 5,
    "memory_percent_to_use": 10,
    "query_group": [],
    "query_group_wild_card": 0,
    "user_group": [],
    "user_group_wild_card": 0,
    "rules": [
      {
        "rule_name": "abort_query",
        "predicate": [
          {
            "metric_name": "scan_row_count",
            "operator": ">",
            "value": 1000
          }
        ],
        "action": "abort"
      }
    ]
  },
  {
    "query_group": [],
    "query_group_wild_card": 0,
    "user_group": [],
    "user_group_wild_card": 0,
    "auto_wlm": false
  },
  {
    "short_query_queue": false
  }
]

Note: In this example, the WLM configuration is in JSON format and uses a query monitoring rule (Queue1).

In the WLM configuration, the “memory_percent_to_use” represents the actual amount of working memory, assigned to the service class. Note that Amazon Redshift allocates memory from the shared resource pool in your cluster. Therefore, Queue1 has a memory allocation of 30%, which is further divided into two equal slots. Each of the slots then gets an equal 15% share of the current memory allocation. Meanwhile, Queue2 has a memory allocation of 40%, which is further divided into five equal slots. Each slot then gets an equal 8% of the memory allocation. The default queue uses 10% of the memory allocation with a queue concurrency level of 5.

Use the following query to check the service class configuration for Amazon Redshift WLM:

select rtrim(name) as name,
num_query_tasks as slots,
query_working_mem as mem,
max_execution_time as max_time,
user_group_wild_card as user_wildcard,
query_group_wild_card as query_wildcard
from stv_wlm_service_class_config
where service_class > 4;

Here's an example output:

                        name                        | slots | mem | max_time | user_wildcard | query_wildcard
----------------------------------------------------+-------+-----+----------+---------------+----------------
 Service class for super user                       |     1 | 297 |        0 | false         | false
 Queue 1                                            |     2 | 522 |        0 | false         | false
 Queue 2                                            |     5 | 278 |        0 | false         | false
 Default queue                                      |     5 |  69 |        0 | false         | false
 Service class for vacuum/analyze                   |     0 |   0 |        0 | false         | false

Queue 1 has a slot count of 2 and the memory allocated for each slot (or node) is 522 MB. The memory allocation represents the actual amount of current working memory in MB per slot for each node, assigned to the service class.

Note: If all the query slots are used, then the unallocated memory from the queue is allocated to a different slot. For example, if the query concurrency is set to "5", but the current queue only uses one slot, then the unallocated memory remains. The unallocated memory is then used by the query in the current slot.

Identifying high-level tuning parameters

Here is an example query execution plan for a query:

  • Use the SVL_QUERY_METRICS_SUMMARY table to check the detailed execution and “query_queue_time” column to see which queries are getting queued. The "query_queue_time" column indicates that the query is waiting in the queue for a WLM slot to execute.
  • Use the SVL_QUERY_SUMMARY table to check the memory consumption for the query even if the query ran in-memory.
dev=# select userid, query, service_class, query_cpu_time, query_blocks_read, query_execution_time, query_cpu_usage_percent, query_temp_blocks_to_disk, query_queue_time  from SVL_QUERY_METRICS_SUMMARY where query=29608;

 userid | query | service_class | query_cpu_time | query_blocks_read | query_execution_time | query_cpu_usage_percent | query_temp_blocks_to_disk | query_queue_time
--------+-------+---------------+----------------+-------------------+----------------------+-------------------------+---------------------------+------------------
    100 | 29608 |                    8 |                       18 |                          942 |                                64 |                                   10.05 |                                               |
(1 row)


ev=# select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 29608
order by workmem desc;

 query | step |   rows   | workmem  |                  label                  | is_diskbased
-------+------+----------+----------+-----------------------------------------+--------------
 29608 |    3 |    49999 | 54263808 | hash   tbl=714                          | f
 29608 |    2 |    49999 |        0 | project                                 | f
 29608 |    0 |    49999 |        0 | scan   tbl=255079 name=part             | f
 29608 |    1 |    49999 |        0 | project                                 | f
 29608 |    6 |  1561938 |        0 | return                                  | f
 29608 |    4 |  1561938 |        0 | project                                 | f
 29608 |    5 |  1561938 |        0 | project                                 | f
 29608 |    2 | 29995220 |        0 | project                                 | f
 29608 |    1 |  1561938 |        0 | return                                  | f
 29608 |    1 | 29995220 |        0 | project                                 | f
 29608 |    0 |  1561938 |        0 | scan   tbl=4893 name=Internal Worktable | f
 29608 |    3 |  1561938 |        0 | hjoin  tbl=714                          | f
 29608 |    0 | 29995220 |        0 | scan   tbl=255087 name=lineorder        | f
(13 rows)
Use the SVL_QUERY_SUMMARY table to obtain a detailed view of resource allocation during each step of the query. Check the is_diskbased and workmem columns to view the resource consumption. For more information, see Analyzing the query summary.

Updating to WLM dynamic configuration properties

You can also use WLM dynamic configuration properties to adjust to changing workloads. You can apply dynamic properties to the database without a cluster reboot. However, WLM static configuration properties require a cluster reboot for changes to take effect.

Here's an example of a cluster that is configured with two queues:

Queue    Concurrency    % Memory to Use            
1        5              60%
2        5              40%

If the cluster has 200 GB of available memory, then the current memory allocation for each of the queue slots might look like this:

Queue 1: (200 GB * 60% ) / 5 slots  = 24 GB
Queue 2: (200 GB * 40% ) / 5 slots  = 16 GB

To update your WLM configuration properties to be dynamic, modify your settings like this:

Queue    Concurrency    % Memory to Use
1        3              75%
2        4              25%

As a result, the memory allocation has been updated to accommodate the changed workload:

Queue 1: (200 GB * 75% ) / 3 slots = 50 GB
Queue 2: (200 GB * 25% ) / 4 slots = 12.5 GB
Note: If there are any queries running in the WLM queue during a dynamic configuration update, Amazon Redshift waits for the queries to complete. After the query completes, Amazon Redshift updates the cluster with the updated settings.

Use the STV_WLM_SERVICE_CLASS_CONFIG table while the transition to dynamic WLM configuration properties is in process. When the num_query_tasks (concurrency) and query_working_mem (dynamic memory percentage) columns become equal in target values, the transition is complete.

Identifying insufficient memory allocated to the query

If a query execution plan in SVL_QUERY_SUMMARY has an is_diskbased value of "true", then consider allocating more memory to the query. You can allocate more memory by increasing the number of query slots used. For more information, see Step 1: Override the concurrency level using wlm_query_slot_count.

Note: It's a best practice to first identify the step that is causing a disk spill. Then, decide if allocating more memory to the queue can resolve the issue. Alternatively, you can optimize your query.


Did this article help?


Do you need billing or technical support?