В начало |  Текущие события |  ЧАВО |  Feeds |  Добавить ваш блог |  Обратная связь |  Архивы |  Лента новостей RSS 2.0 Русский English Deutsch Español Français Italiano 日本語 Português 中文
A brief update on NUMA and MySQL
+6 Vote Up -0 Vote Down

Some time ago, I wrote a rather popular post The MySQL “swap insanity” problem and the effects of the NUMA architecture (if you haven’t read it, stop now and do that!), which described using numactl --interleave=all to balance memory allocation across nodes in a NUMA system.

I should’ve titled it differently

In reality, the problem posed by uneven allocation across nodes under NUMA is not entirely a swapping problem. I titled the previous post as it was and explained it in the way it was explained largely to address a specific problem seen in the MySQL community. However, the problem described actually has very little to do with swap itself. The problem is really related to Linux’s behavior under memory pressure, and specifically the pressure imposed by running a single NUMA node (and especially node 0) completely out of memory.

When swap is disabled completely, problems are still encountered, usually in the form of extremely slow performance and failed memory allocations.

A more thorough solution

The original post also only addressed only one part of the solution: using interleaved allocation. A complete and reliable solution actually requires three things, as we found when implementing this change for production systems at Twitter:

  • Forcing interleaved allocation with numactl --interleave=all. This is exactly as described previously, and works well.
  • Flushing Linux’s buffer caches just before mysqld startup with sysctl -q -w vm.drop_caches=3. This helps to ensure allocation fairness, even if the daemon is restarted while significant amounts of data are in the operating system buffer cache.
  • Forcing the OS to allocate InnoDB’s buffer pool immediately upon startup, using MAP_POPULATE where supported (Linux 2.6.23+), and falling back to memset otherwise. This forces the NUMA node allocation decisions to be made immediately, while the buffer cache is still clean from the above flush.
  • These changes are implemented in Twitter MySQL 5.5 as the mysqld_safe options numa-interleave and flush-caches, and mysqld option innodb_buffer_pool_populate, respectively.

    The results

    On a production machine with 144GB of RAM and a 120GB InnoDB buffer pool, all used memory has been allocated within 152 pages (0.00045%) of perfectly balanced across both NUMA nodes:

    N0        :     16870335 ( 64.36 GB)
    N1        :     16870183 ( 64.35 GB)
    active    :           81 (  0.00 GB)
    anon      :     33739094 (128.70 GB)
    dirty     :     33739094 (128.70 GB)
    mapmax    :          221 (  0.00 GB)
    mapped    :         1467 (  0.01 GB)
    

    The buffer pool itself was allocated within 4 pages of balanced (line-wrapped for clarity):

    2aaaab2db000 interleave=0-1 anon=33358486 dirty=33358486
      N0=16679245 N1=16679241
    

    Much more importantly, these systems have been extremely stable and have not experienced the “random” stalls under heavy load that we had seen before.


    Votes:

    You must be logged in with a MySQL account to vote on Planet MySQL entries. More information on PlanetMySQL voting.

    Planet MySQL © 1995, 2013, Oracle Corporation and/or its affiliates   Legal Policies | Your Privacy Rights | Terms of Use

    Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.