A journey in debugging a Django issue.
Yesterday, we started to see this crash in develop
after we deployed a new backend.
error (2027, 'Malformed packet')
<class 'django.db.utils.OperationalError'> on line 513
Where line 513 executed the QuerySet.
This issue only occurs when querying two IDs, 0475776
and 0975933
. All other single ID and all multiple ID queries work, even those with either or both of these IDs.
This issue also only occurred on develop
, not locally, even though the database is the same.
According to MySQL bug 77298, prior to 5.7.8, a bug eerily similar to mine caused this issue:
For small values of the read_rnd_buffer_size system variable, internal caching of temporary results could fail and cause query execution failure.
Since we are using a newer version, this couldn't be the issue.
According to a StackExchange thread, we may have max_allowed_packet
set too low. However, we are able to get the data for this query by modifying some parameters, so this cannot be the issue either. Further, our Wireshark analysis shows all packets are smaller than the default value of 4194304
bytes.
LIMIT
, even if it is larger than the data returned, will make it workprov_id
items have this issueprov_id
with a prov_id
that has no rows worksThis is a perplexing problem. I did not want to monkey patch it like this when the QuerySet gets executed:
try:
ingested_df = pd.DataFrame.from_records(data=ingested, coerce_float=True)
except Exception as e:
LOGGER.error('%s: %s, on line %s', e, e.__class__, e.__traceback__.tb_lineno)
LOGGER.error('Monkey patching query with `LIMIT = 99999999999')
ingested = ingested[:99999999999]
ingested_df = pd.DataFrame.from_records(data=ingested, coerce_float=True)
Instead, I tried to make a few changes to the QuerySet chaining steps.
# Initial QuerySet, filter on dates, med status, and provider
ingested = Model.objects.filter(Q(beg_date__range=prior_range) |
Q(beg_date__range=current_range),
med_cat='N',
prov_id__in=facility_ids,
).values(
'prov_id',
'beg_date',
'method',
'category',
'code',
'allowed',
'count',
'charge'
)
# Handle removing nulls and blanks
if remove_null:
ingested = ingested.exclude(Q(category__isnull=True) | Q(code__isnull=True))
if remove_blank:
# Handle case where empty strings were stored as string literal `""`
ingested = ingested.exclude(Q(category__exact='') | Q(code__exact='') |
Q(category__exact='""') | Q(code__exact='""'))
# Handle excluding categories from api list param
ingested = ingested.exclude(category__in=excluded_categories)
# Handle payment type filter
if payment_method != 'all':
ingested = ingested.filter(method=payment_method)
Manually making the query by writing the same thing in raw SQL yields the same results.
SELECT `tables_ingesting`.`clm_id_svc_ln_num`,
`tables_ingesting`.`prov_id`,
`tables_ingesting`.`beg_date`,
`tables_ingesting`.`method`,
`tables_ingesting`.`category`,
`tables_ingesting`.`code`,
`tables_ingesting`.`allowed`,
`tables_ingesting`.`count`,
`tables_ingesting`.`charge`,
`tables_ingesting`.`count`
FROM `tables_ingesting`
WHERE ( ( `tables_ingesting`.`beg_date` BETWEEN '2018-10-01' AND '2018-12-31'
OR `tables_ingesting`.`beg_date` BETWEEN '2019-01-01' AND '2019-03-31' )
AND `tables_ingesting`.`prov_id` IN ( '0475776' )
AND `tables_ingesting`.`medcr_ind` = 'N'
AND NOT (( `tables_ingesting`.`category` IS NULL
OR `tables_ingesting`.`code` IS NULL ))
AND NOT (( `tables_ingesting`.`category` = '""'
OR `tables_ingesting`.`code` = '""'
OR `tables_ingesting`.`category` = ""
OR `tables_ingesting`.`code` = "" ))
);
Changing anything about the queries make them successful: for example, 2019-03-31
end date fails, but using 2019-03-30
works, even though neither of those dates have any data (thus the data is the same). Removing any of the filters will make it work as well, even when the filter does not match any rows.
Adding a LIMIT
clause to the query makes it work, even if the limit is greater than the number of rows returned.
The query Django prints out returns the correct data when sent directly to MySQL:
mysql> SELECT `tables_ingesting`.`prov_id`, `tables_ingesting`.`beg_date`, `tables_ingesting`.`method`, `tables_ingesting`.`category`, `tables_ingesting`.`code`, `tables_ingesting`.`allowed`, `tables_ingesting`.`count`, `tables_ingesting`.`charge` FROM `tables_ingesting` WHERE ((`tables_ingesting`.`beg_date` BETWEEN '2018-10-01' AND '2018-12-31' OR `tables_ingesting`.`beg_date` BETWEEN '2019-01-01' AND '2019-03-31') AND `tables_ingesting`.`prov_id` IN (0975933) AND `tables_ingesting`.`medcr_ind` = 'N' AND NOT ((`tables_ingesting`.`category` IS NULL OR `tables_ingesting`.`code` IS NULL)) AND NOT ((`tables_ingesting`.`category` = '' OR `tables_ingesting`.`code` = '' OR `tables_ingesting`.`category` = '""' OR `tables_ingesting`.`code` = '""')));
...
12587 rows in set (0.29 sec)
I noticed that when I built the QuerySet I asked for the same column twice:
ingested.values('prov_id',
'beg_date',
'method',
'category',
'code',
'allowed',
'count',
'charge',
'count')
I removed the second count
. Since the query started working with this change, we merged it to develop
. Once we verified it was working there, we deployed the new UI/Backend this was part of to production
.
After this deploy, both production
and develop
worked, but local
no longer worked. Again, local
and develop
use the same backend.
In the local
server, when the QuerySet got executed, we would get back no data, even though the QuerySet had preview data before it got executed. Further, the same behavior where adding a limit made it work came back. Instead of the malformed packet error, however, we now were getting this:
error (2013, 'Lost connection to MySQL server during query')
<class 'django.db.utils.OperationalError'> on line 511
Where line 511 was where the QuerySet got executed.
Again, changing any aspect of the query made it work with the exact same behavior of the previous problem in develop
. Since this only happened to single providers, I thought it might have been an issue with the following code:
ingested = Model.objects.filter(Q(beg_date__range=prior_range) |
Q(beg_date__range=current_range),
med_cat='N',
prov_id__in=facility_ids,
)
The parameter prov_id__in=facility_ids
generates the following SQL:
...
AND `tables_ingesting`.`prov_id` IN ( '0475776' )
...
Replacing that Model block with the following yielded a solution:
ingested = Model.objects.filter(Q(beg_date__range=prior_range) |
Q(beg_date__range=current_range),
med_cat='N')
if len(facility_ids) > 1:
ingested = ingested.filter(prov_id__in=facility_ids)
else:
ingested = ingested.filter(prov_id=facility_ids[0])
This means we now generate this SQL for a single provider:
...
AND `tables_ingesting`.`prov_id` = '0475776'
...
While this works, it does not explain either MySQL error, which worries me with regard to using this code in production
.
This patch did not last long, as over the weekend we started to see (2027, 'Malformed packet')
show up in the logs again, but only in production
and develop
, not locally, which is the same as before we implemented either fix.
RDS Logs showed the following:
2019-10-28T17:28:05.868458Z 1308445 [Note] Aborted connection 1308445 to db: 'tables_ingesting' user: 'client' host: '10.x.x.x' (Got an error reading communication packets)
2019-10-28T17:28:10.332017Z 1308446 [Note] Aborted connection 1308446 to db: 'tables_ingesting' user: 'client' host: '10.x.x.x' (Got an error reading communication packets)
2019-10-28T17:28:14.701596Z 1308447 [Note] Aborted connection 1308447 to db: 'tables_ingesting' user: 'client' host: '10.x.x.x' (Got an error reading communication packets)
2019-10-28T17:28:19.259719Z 1308448 [Note] Aborted connection 1308448 to db: 'tables_ingesting' user: 'client' host: '10.x.x.x' (Got an error reading communication packets)
2019-10-28T17:29:25.401275Z 1308449 [Note] Aborted connection 1308449 to db: 'tables_ingesting' user: 'client' host: '10.x.x.x' (Got an error reading communication packets)
2019-10-28T17:29:25.413090Z 1308451 [Note] Aborted connection 1308451 to db: 'tables_ingesting' user: 'client' host: '10.x.x.x' (Got an error writing communication packets)
2019-10-28T17:29:29.053986Z 1308494 [Note] Aborted connection 1308494 to db: 'tables_ingesting' user: 'client' host: '10.x.x.x' (Got an error reading communication packets)
So, something is going wrong with the MySQL instance in reading the packets that Django sends. The Oracle reference1 tells us the following info:
Error number: 1158; Symbol: ER_NET_READ_ERROR; SQLSTATE: 08S01
Message: Got an error reading communication packets
However, making additional bad requests does not get anything logged to this file, so I am not sure that it is related.
To analyze the packets sent to and from the Django server, we can leverage tcpdump
:
sudo tcpdump -i any -w ~/captures/capture_%Y-%m-%d-%H-%M-%S.cap -G 30 -n -X -Z $USER "port 8000"
We can download this file with:
scp -i cert.pem [email protected]:captures/capture_2019-10-29-13-30-15.cap .
This will download captures/capture_2019-10-29-13-30-15.cap
to our local .
Opening up this file in Wireshark gave us some interesting information.
There are a ton of packets that look fine:
0000 31 00 00 33 07 30 39 37 35 39 33 33 0a 32 30 31 1..3.0975933.201
0010 39 2d 30 33 2d 30 38 00 01 31 05 39 33 30 31 37 9-03-08..1.93017
0020 06 33 35 33 2e 30 30 05 31 2e 30 30 30 07 33 36 .353.00.1.000.36
0030 37 30 2e 38 31 70.81
Or, more easily understood in the Wireshark tree:
MySQL Protocol
Packet Length: 49
Packet Number: 52
Catalog: 0975933
Database: 2019-03-08
Table:
Original table: 1
Name: 93306
Original name: 708
Charset number: Unknown (11825)
Length: 120598576
Type: Unknown (51)
Flags: 0x3133
.... .... .... ...1 = Not null: Set
.... .... .... ..1. = Primary key: Set
.... .... .... .0.. = Unique key: Not set
.... .... .... 0... = Multiple key: Not set
.... .... ...1 .... = Blob: Set
.... .... ..1. .... = Unsigned: Set
.... .... .0.. .... = Zero fill: Not set
.... .... 0... .... = Binary: Not set
.... ...1 .... .... = Enum: Set
.... ..0. .... .... = Auto increment: Not set
.... .0.. .... .... = Timestamp: Not set
.... 0... .... .... = Set: Not set
Decimals: 51
All of these packets were smaller (about 50
bytes each) than the max_allowed_packet
of 4194304
bytes. However, these good packets were punctuated by several warnings in Wireshark for Malformed Packets, but no other information:
[Malformed Packet: MySQL]
[Expert Info (Error/Malformed): Malformed Packet (Exception occurred)]
[Malformed Packet (Exception occurred)]
[Severity level: Error]
[Group: Malformed]
The end of file packet in WireShark throws a dissector is incomplete
error for broken queries as well:
MySQL Protocol
Packet Length: 7
Packet Number: 53
Response Code: EOF Packet (0xfe)
EOF marker: 254
Warnings: 0
Server Status: 0x0022
.... .... .... ...0 = In transaction: Not set
.... .... .... ..1. = AUTO_COMMIT: Set
.... .... .... .0.. = Multi query / Unused: Not set
.... .... .... 0... = More results: Not set
.... .... ...0 .... = Bad index used: Not set
.... .... ..1. .... = No index used: Set
.... .... .0.. .... = Cursor exists: Not set
.... .... 0... .... = Last row sent: Not set
.... ...0 .... .... = Database dropped: Not set
.... ..0. .... .... = No backslash escapes: Not set
.... .0.. .... .... = Metadata changed: Not set
.... 0... .... .... = Query was slow: Not set
...0 .... .... .... = PS Out Params: Not set
..0. .... .... .... = In Trans Readonly: Not set
.0.. .... .... .... = Session state changed: Not set
Payload: 0000
[Expert Info (Warning/Undecoded): FIXME - dissector is incomplete]
[FIXME - dissector is incomplete]
[Severity level: Warning]
[Group: Undecoded]
However, this may be a problem with Wireshark not understanding the end of a MySQL TCP stream when the stream contains malformed packets.
At this point, I did not know where to turn, so I filed a bug report on the Django project. They closed it as an issue in a MySQL driver, database engine, or in database schema.
The only recommendation I found online was to increase the value for max_allowed_packet
2. I did not do this at first because of the reasons listed under Observed Behaviors. By default, my instance was set like:
mysql> SHOW VARIABLES LIKE '%_packet';
+--------------------------+------------+
| Variable_name | Value |
+--------------------------+------------+
| max_allowed_packet | 4194304 |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+
2 rows in set (0.09 sec)
Not understanding what units were used when setting the constant, I assumed 1024
would be 1Gb, so I set it to:
mysql> SHOW VARIABLES LIKE '%_packet';
+--------------------------+------------+
| Variable_name | Value |
+--------------------------+------------+
| max_allowed_packet | 1024 |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+
This made all of my requests start working! However, other people using this database started reporting issues where they would get an error like (1153, "Got a packet larger than 'max_allowed_packet' bytes")
. This made me realize that I entered the units wrong, so I set the value to 50mb:
mysql> SHOW VARIABLES LIKE '%_packet';
+--------------------------+------------+
| Variable_name | Value |
+--------------------------+------------+
| max_allowed_packet | 49999872 |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+
2 rows in set (0.04 sec)
This also worked, which confused me. How is it possible that both large and small values, but not the default, will break my queries? to test this, I set max_allowed_packet
back to the default of 4194304
and all of the queries still worked, but after 10-15 minutes would start failing again. Back to the drawing board.
Unfortunately, we can't directly access the RDS nodes, so the root cause of this issue is impossible to diagnose. Sometimes, the advice "turn it on and off again" really does fix your problem, but this case it was not the solution.
The docs for the recommended Django MySQL interface, mysqlclient, tell us to run sudo yum install python-devel mysql-devel
to ensure we have all of the dependencies. This line was in the CloudFormation file, so we assumed it was working as we expected. However, when we manually tried these install steps:
[ec2-user@ip-10-x-x-x ~]$ sudo yum install mysql-devel
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
amzn2-core | 2.4 kB 00:00:00
Package 1:mariadb-devel-5.5.64-1.amzn2.x86_64 already installed and latest version
Instead of getting mysql-devel
, we actually get mariadb-devel
, specifically an old version 2 full releases and a hard fork behind the database on RDS3. To fix this, we have to first uninstall the Amazon Linux repo's version of maria-devel
and install MySQL's repository and then install the mysql-devel
package from them:
To uninstall the maria-devel
package:
[ec2-user@ip-10-x-x-x ~]$ yum list installed mariadb\*
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Installed Packages
mariadb-devel.x86_64 1:5.5.64-1.amzn2 @amzn2-core
mariadb-libs.x86_64 1:5.5.64-1.amzn2 @amzn2-core
[ec2-user@ip-10-x-x-x ~]$ sudo yum remove mariadb-libs.x86_64 mariadb-devel.x86_64
Next, install the latest MySQL RPM package from the MySQL Yum repository download page. First, we need to install the community certificate so that we can download from these repositories:
cd /etc/pki/rpm-gpg/
sudo wget https://repo.mysql.com/RPM-GPG-KEY-mysql
Next, we can install the repository itself:
wget https://dev.mysql.com/get/mysql80-community-release-el6-3.noarch.rpm
sudo yum localinstall mysql80-community-release-el6-3.noarch.rpm
We specifically need the REHL 6 repository; the REHL 7 and 8 repositories expect newer versions of system libraries and will result in errors like this:
--> Finished Dependency Resolution
Error: Package: mysql-community-libs-8.0.18-1.el8.x86_64 (mysql80-community)
Requires: libssl.so.1.1(OPENSSL_1_1_0)(64bit)
Error: Package: mysql-community-libs-8.0.18-1.el8.x86_64 (mysql80-community)
Requires: libssl.so.1.1()(64bit)
Error: Package: mysql-community-libs-8.0.18-1.el8.x86_64 (mysql80-community)
Requires: libssl.so.1.1(OPENSSL_1_1_1)(64bit)
Error: Package: mysql-community-libs-8.0.18-1.el8.x86_64 (mysql80-community)
Requires: libc.so.6(GLIBC_2.28)(64bit)
Error: Package: mysql-community-libs-8.0.18-1.el8.x86_64 (mysql80-community)
Requires: libcrypto.so.1.1()(64bit)
Error: Package: mysql-community-libs-8.0.18-1.el8.x86_64 (mysql80-community)
Requires: libcrypto.so.1.1(OPENSSL_1_1_0)(64bit)
...74 lines truncated...
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
To verify that the repo was installed, we can run:
[ec2-user@ip-10-x-x-x ~]$ yum repolist enabled | grep "mysql.*-community.*"
[ec2-user@ip-10-x-x-x ~]$ yum repolist enabled | grep "mysql.*-community.*"
mysql-connectors-community/x86_64 MySQL Connectors Community 84+29
mysql-tools-community/x86_64 MySQL Tools Community 84
mysql80-community/x86_64 MySQL 8.0 Community Server 127
[ec2-user@ip-10-x-x-x ~]$ yum repolist all | grep mysql
mysql-cluster-7.5-community/x86_64 MySQL Cluster 7.5 Community disabled
mysql-cluster-7.5-community-source MySQL Cluster 7.5 Community disabled
mysql-cluster-7.6-community/x86_64 MySQL Cluster 7.6 Community disabled
mysql-cluster-7.6-community-source MySQL Cluster 7.6 Community disabled
mysql-cluster-8.0-community/x86_64 MySQL Cluster 8.0 Community disabled
mysql-cluster-8.0-community-source MySQL Cluster 8.0 Community disabled
mysql-connectors-community/x86_64 MySQL Connectors Community enabled: 84+29
mysql-connectors-community-source MySQL Connectors Community disabled
mysql-tools-community/x86_64 MySQL Tools Community enabled: 84
mysql-tools-community-source MySQL Tools Community - Sou disabled
mysql-tools-preview/x86_64 MySQL Tools Preview disabled
mysql-tools-preview-source MySQL Tools Preview - Sourc disabled
mysql55-community/x86_64 MySQL 5.5 Community Server disabled
mysql55-community-source MySQL 5.5 Community Server disabled
mysql56-community/x86_64 MySQL 5.6 Community Server disabled
mysql56-community-source MySQL 5.6 Community Server disabled
mysql57-community/x86_64 MySQL 5.7 Community Server disabled
mysql57-community-source MySQL 5.7 Community Server disabled
mysql80-community/x86_64 MySQL 8.0 Community Server enabled: 127
mysql80-community-source MySQL 8.0 Community Server disabled
This defaults to only activating the 8.x
version of MySQL, which is not the match for our database version. We need to use the 5.7.x
version, so we have to edit /etc/yum.repos.d/mysql-community.repo
to say:
# Enable to use MySQL 5.7
[mysql57-community]
name=MySQL 5.7 Community Server
baseurl=http://repo.mysql.com/yum/mysql-5.7-community/el/6/$basearch/
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql
[mysql80-community]
name=MySQL 8.0 Community Server
baseurl=http://repo.mysql.com/yum/mysql-8.0-community/el/6/$basearch/
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql
Once we verified the proper repositories are enabled worked, we can run sudo yum install mysql-devel
to get the proper package. I ran into issues downloading this:
... (truncated)
mysql-community-devel-5.7.28-1 FAILED
http://repo.mysql.com/yum/mysql-5.7-community/el/6/x86_64/mysql-community-devel-5.7.28-1.el6.x86_64.rpm: [Errno 12] Timeout on http://repo.mysql.com/yum/mysql-5.7-community/el/6/x86_64/mysql-community-devel-5.7.28-1.el6.x86_64.rpm: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
Trying other mirror.
mysql-community-devel-5.7.28-1 FAILED
http://repo.mysql.com/yum/mysql-5.7-community/el/6/x86_64/mysql-community-devel-5.7.28-1.el6.x86_64.rpm: [Errno 12] Timeout on http://repo.mysql.com/yum/mysql-5.7-community/el/6/x86_64/mysql-community-devel-5.7.28-1.el6.x86_64.rpm: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
Trying other mirror.
Error downloading packages:
mysql-community-devel-5.7.28-1.el6.x86_64: [Errno 256] No more mirrors to try.
To solve this, I set timeout = 60
and minrate=1
in /etc/yum.conf
which solved the issue.
Once MysQL 5.7 was installed, we needed to uninstall and reinstall the pip
's instance of mysqlclient
because it expects to see maria-devel
and not our newer mysql-devel
.
This seemed promising, as now our database version and driver version are aligned, but when we try and start Django we get this traceback:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 83, in _execute
return self.cursor.execute(sql)
File "/usr/local/lib/python3.7/site-packages/django/db/backends/mysql/base.py", line 71, in execute
return self.cursor.execute(query, args)
File "/usr/local/lib64/python3.7/site-packages/MySQLdb/cursors.py", line 209, in execute
res = self._query(query)
File "/usr/local/lib64/python3.7/site-packages/MySQLdb/cursors.py", line 315, in _query
db.query(q)
File "/usr/local/lib64/python3.7/site-packages/MySQLdb/connections.py", line 226, in query
_mysql.connection.query(self, query)
MySQLdb._exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')
Other than that, though, the server would start fine and all of our requests went through okay, so why was this crashing so hard? I reverted back to the default versions by spinning up a new instance. After redeploying, I noticed a new log message.
Once all of the modules were installed, Django would start with the following warning log:
MySQL Strict Mode is not set for database connection 'default'
The server would run fine, but this warning did not happen on local. To address it, we can add an init_command
to the default
database configuration
'default': {
'ENGINE': 'django.db.backends.mysql',
'NAME': 'model_data',
'USER': get_env_variable('DBUSER'),
'PASSWORD': get_env_variable('DBPW'),
'HOST': 'dev.sql.internal.net',
'OPTIONS': {'init_command': "SET sql_mode='STRICT_TRANS_TABLES'"}
}
The above change to the init_command
solved this. SSH
ing in and removing that line will instantly kill the server connection and start logging the malformed packet error, and putting it back will make the queries start working again.
Django does not properly format the query when printing QuerySet.query()
, rather it string formats the entire thing, which removes all of the quotes. The query under troubleshooting looks like this when printed 4
SELECT `tables_ingesting`.`prov_id`,
`tables_ingesting`.`beg_date`,
`tables_ingesting`.`method`,
`tables_ingesting`.`category`,
`tables_ingesting`.`code`,
`tables_ingesting`.`allowed`,
`tables_ingesting`.`count`,
`tables_ingesting`.`charge`
FROM `tables_ingesting`
WHERE ((
`tables_ingesting`.`beg_date` BETWEEN 2018-10-01 AND 2018-12-31
OR `tables_ingesting`.`beg_date` BETWEEN 2019-01-01 AND 2019-03-31)
AND `tables_ingesting`.`medcr_ind` = N
AND `tables_ingesting`.`prov_id` IN (0975933)
AND NOT ((
`tables_ingesting`.`category` IS NULL
OR `tables_ingesting`.`code` IS NULL))
AND NOT ((
`tables_ingesting`.`category` =
OR `tables_ingesting`.`code` =
OR `tables_ingesting`.`category` = ""
OR `tables_ingesting`.`code` = ""))
);
However, this is due to the way the strings are formatted when calling .query()
and not what is passed to the SQL server. If this was the case, all queries with quoted parameters would fail with ERROR 1064 (42000): You have an error in your SQL syntax
.