LaraconEU 2016 - Overview

Last time I went to LaraconEU as in 2014 as speaker in the community track, it was my first international event ever speaking in english and I had jitters from it.

A lot a friends were made and it was nice to see all of them again this year.

What I also like about LaraconEU it is a Laravel conference that doesn’t talk always about Laravel, being a bit less insular than one would imagine.

Shawn McCool and his team made a great job this year and I was very honoured this year to be in the Track #1 (Blue Track).

The first round of applauses when I was introducing myself and saying this was my second Laracon made very much emotional and triggered something on my mind:

These people are there to soak in the knowledge you have to share, so give your best, and I confess I tried exactly that.

The venue was amazing, I don’t have a single complaint to make regarding the staff and the organisers, they made everything possible to make people comfortable and I could see them working hard behind the scenes.

The problem is when you are speaking on the second day on a 2 day conference is that you don’t get to get the most out of the other talks that are before yours: insecurity, anxiety, fear, all play a role in there.

The talks I saw were really well delivered and you could see by people that “knew their shit” as we were joking around. Also a lot of cheese jokes to break the ice.

My only regret is not have seen more of the speakers talk, which I will definitely rectify it as soon as the videos comes out, and have made new friends this years was also awesome, keep in touch!

Joind.in: Making the most out of MySQL

Slides:


Tags: laravel laraconeu conference mysql

How I became a Data Engineer

A reminiscence of a personal timeline of events; please, excuse minor errors and/or let me know if my memory has shuffled things around a bit.

My first job was at a company that had their own CRM and they wanted a web version of some parts of the system. At that time, I only had “experience” with ASP and Microsoft Access (I know, tough). They wanted the web version in PHP, their reasoning, I think, being that they wanted the integration to run directly into the database. The web app would write directly into the DB. The CRM system was written using Delphi and Firebird. So I learned PHP and my first database, which wasn’t MySQL (I don’t count MS Access as a DB). After that I got a job in which MySQL was used. I was really fresh on MySQL, and I didn’t know about the engines and such, so it was a bit weird learning about MyISAM (which didn’t have foreign keys for instance).

After that I got a job in a huge multinational where they had this project migrating every Excel spreadsheet to a PHP program. VBA was heavily used there, and they had entire programs running into that. What they didn't tell us was that it was cheaper for them to have a whole team of PHP developers doing an internal system than to have the features built into their ERP. For “security” reasons no business logic could be inside the PHP code, so I had to do tons of Stored Procedures. They also had integrations with MS SQL Server. The workflow system used it together with a PHP tool called Scriptcase.

Another job I had was with a different multinational, I had to do a program to read from various sources and store in a DB for later reports and all. It would be a “scratch” of an ETL (Extract, Transform, Load), but at the time I wasn’t yet well versed with data warehouse techniques. For that job I used PostgreSQL. In the same company we later on did Magento white label stores for our clients (other big companies), and it had to have integration with our ERP (unlike my first job). This integration was through a Service Bus written in Java and the ERP had Oracle as the DB.

One of my employers noticed my interest in database tasks, and how quickly I became familiarised with the schema and our data architecture. Our main relational database had millions and millions of records, terabytes and they created a Data Team to make it more efficient. We would create workers to sync data into Elasticsearch for instance. I was still a PHP Developer officially, but mainly curating the DB and doing workers with NodeJS (we needed the async and the indexing time was crucial for business).

I could go on and on through every job I've had. Most of my jobs have been at big corporations, and that has put me into contact with many flavours of Relational Databases and NoSQL too (MongoDB, Dynamo and Cassandra being the main ones).

In the end, this kind of exposure has made me the “DB” person among my fellow engineers, and everything considered more than a CRUD tends to fall in my lap, and I'm happy with that.

That’s how I discovered that I wanted to work with data. I didn’t have any idea of what kind of title that would be, but I knew I didn’t want to be a DBA, since much of the infrastructure tasks involved in it didn’t sound fun to me. Once the opportunity to work officially as a Data Engineer appeared, I grabbed it with both hands and I knew it was right.


Tags: Data Engineering Software Development

Creating Migrations with Liquibase

Liquibase

Liquibase is a versioning tool for databases. Currently, it's on version 3.5 and is installed as a JAR. It has been on the market since 2006, and recently completed its 10th anniversary. In its feature list we have:

  • Code branching and merging
  • Multiple database types
  • Supports XML, YAML, JSON and SQL formats
  • Supports context-dependent logic
  • Generate Database change documentation
  • Generate Database "diffs"
  • Run through your build process, embedded in your application or on demand
  • Automatically generate SQL scripts for DBA code review
  • Does not require a live database connection

Why you need it?

Some frameworks comes with built-in solutions out of the box like Eloquent and Doctrine. There is nothing wrong with using something like that when you have only one DB per project, but when you have multiple systems, it starts to get complicated.

Since Liquibase works as a versioning tool, you can branch and merge as needed (like you would with code in git). You have contexts, which means changes can be applied to specific environments only, and tagging capabilities allow you to perform rollbacks.

A rollback is a tricky thing; you can either do an automatic rollback or define a script. Scripted rollbacks are useful when dealing with MySQL, for instance, where DDL changes are NOT transactional.

Guidelines for changelogs and migrations

  • MUST be written using the JSON format. Exceptions are changes/legacy/base.xml and changes/legacy/base_procedures_triggers.sql.
  • MUST NOT be edited. If a new column is to be added, a new migration file must be created and the file MUST be added AFTER the last run transaction.

Branching

There could be 3 main branches:

  • production (master)
  • staging
  • testing

Steps:

  1. Create your changelog branch;
  2. Merge into testing;
  3. When the feature ready to staging, merge into staging;
  4. When the feature is ready, merge into production.

Example:

Liquibase

Rules:

  • testing, staging and production DO NOT merge amongst themselves in any capacity;
  • DO NOT rebase the main branches;
  • Custom branch MUST be deleted after merged into production.

The downside of this approach is the diverging state between the branches. Current process is to, from time to time, compare the branches and manually check the diffs for unplanned discrepancies.

Procedures for converting a legacy database to Liquibase migrations

Some projects are complete monoliths. More than one application connects to it, and this is not a good practice. If you are working with that sort of project, I recommend you treating the database sourcing as its own repository, and not together with your application.

Writing migrations

This is a way I found for keeping the structure reasonably sensible. Suggestions are welcome.

1. Create the property file

Should be in the root of the project and be named liquibase.properties:

driver: com.mysql.jdbc.Driver
classpath: /usr/share/java/mysql-connector-java.jar:/usr/share/java/snakeyaml.jar
url: jdbc:mysql://localhost:3306/mydb
username: root
password: 123

The JAR files in the classpath can be manually downloaded or installed though the server package manager.

Create the Migration file

You can choose between different formats. I chose to use JSON. In this instance I will be running this SQL:

Which will translate to this:

It is verbose? Yes, completely, but then you have a tool to show you what the SQL will look like and be able to manage the rollbacks.

Save the file as:

.
  /changes
    - changelog.json
    - create_mydb_users.json

Where changelog.json looks like this:

For each new change you add it to the end of the databaseChangeLog array.

Run it

To run, execute:

$ liquibase --changeLogFile=changes/changelog.json migrate

Don't worry if you run it twice, the change only happens once.

Next post will cover how to add a legacy DB into Liquibase.

To learn how to go deeper into Liquibase formats and documentation, access this link.


Tags: Data Engineering Liquibase Migrations Doctrine Laravel

Fast data import trick

A few weeks ago my friend Frank de Jonge told me he managed to improve an import into a MySQL server down from more than 10 hours to 16 minutes. According to him it had to do with several field types (too long fields to really small data), the amount of indexes, and constraints on the tables. We were talking about 1 million records here. He wondered if it was possible to make it even faster.

The basics

Turns out there are many ways of importing data into a database, it all depends where are you getting the data from and where you want to put it. Let me give you a bit more context: you may want to get data from a legacy application that exports into CSV to your database server or even data from different servers.

If you are pulling data from a MySQL table into another MySQL table (lets assume they are into different servers) you might as well use mysqldump.

To export a single table:

$ mysqldump -h localhost -u root -p --extended-insert --quick --no-create-info mydb mytable | gzip > mytable.sql.gz

A bit more about this line:

  • --extended-insert: it makes sure that it is not one INSERT per line, meaning a single statement can have dozens of rows.
  • --quick: useful when dumping large tables, by default MySQL reads the whole table in memory then dumps into a file, that way the data is streamed without consuming much memory.
  • --no-create-info: this means only the data is being exported, no CREATE TABLE statements will be added

The complex

The problem my friend faced was a bit more complex. He needed to generate the dump file due to the source of his data coming from somewhere else (Later on I advised him on the benefits of LOAD FILE), but since 90% of his work was already done he wanted to know:

Why when I do blocks of 50 rows to be inserted is it faster then when I do with 500?

There could be N reasons for that:

  • buffering 500 rows into memory is slower than 50, remember, you are reading from the disk, it is always slow.
  • if no transactions are used, the indexes gets rebuilt after the end of each INSERT, to 1 million rows at a 50 values per statement we have 20k INSERTs, while with 500 it would be 2k statements. My speculation here is that indexes in InnodB engine are BTREE, slowling building means that you "know" where the values are in the tree, so it's a fast search to sort and organise while with 500 items you need to reorganise a lot of information at once. Again, this is an speculation.

Suggestions

Transactions

My first suggestion was: wrap everything in a single transaction. Put a START TRANSACTION in the beginning and at the end a COMMIT statement. That way you do the rebuilding of the indexes and foreign key checks at the end of the script.

He reported a minor improvement on performance.

The Danger

I knew from the begining a way where his import would be really fast, but since the source of his data wasn't as secure as the database itselft it could result in duplicated data, missing foreign keys, it could end really really bad.

MySQL by default when you use mysqldump put this option in place because it's fair to assume you are going to be importing this to an empty database, so no data integrity problems. Which wasn't the case.

The data was manipulated to be inserted, so the trick I said to him was and I quote:

SET foreign_key_checks = 0;
/* do you stuff REALLY CAREFULLY */
SET foreign_key_checks = 1;

The import went from 16 min to 6 min. He got super happy :D:

And people on the internet got curious (because Frank is famous now, apparently):

I confess it was fun to see the time cut down and more than half, but use with caution.

An even more faster way

CSV files. Yes, that's faster. Specifically TSV, since any string can have a comma.

To generate:

$ mysqldump -h localhost -u root -p --tab=/tmp mydb mytable 

Or if you are manipulating the data yourself from another source, don't forget to use \N for NULL values.

To Read:

$ mysql -h localhost -u root -p 
mysql> LOAD DATA INFILE '/tmp/mytable.txt' INTO TABLE mytable;
Query OK, 881426 rows affected (29.30 sec)
Records: 881426  Deleted: 0  Skipped: 0  Warnings: 0

The same data with bulk INSERTs took over a minute. There are many variables when dealing with that statement such as buffer size, the checking of the keys itself, so for high volume data importing straight from a text file is still the fastest option.

Conclusion

As I said before, it was just a matter of disabling the constraint check in the script. Only do that if you are sure the data is good, else, other options like net_buffer_length, max_allowed_packet and read_buffer_size can help you import big SQL files. Also in most cases this should be considered: Data Integrity > Performance.


Tags: MySQL INSERT file import LOAD FILE

Why you need a Data Engineer

Dilbert

Tech evolves quickly. When the buzzword Big Data started showing up more and more, the market was in need of people able to analyse and give meaning to what was collected. For instance, an article published in 2012 by Harvard Business Review was entitled: Data Scientist: The Sexiest Job of the 21st Century.

Today we have DBA, Data Scientist, Data Engineer, Data Analyst, a wealth of options with the "Data" as prefix. More often than not, people put everyone in the same basket and assume everyone knows and has the same set of skills.

From my point of view and perspective as a Data Engineer, these are the differences:

  • DBA - Once the person I hated the most in any team. Seriously, why doesn't that human give me the necessary permissions on the database? If I had access, I would have done my job sooner... Well, that was my thought as a Software Engineer at the time. Turns out, DBAs are, what my friends and I used to call, the database babysitter. You need to tune and figure out why performance is not as it should be? Need help with a complicated query? That's the go-to person for it. But notice, this is RDBMS specific and heavily focused on the operational part.

  • Data Scientist - The market usually wants a professional with a PhD in statistics or an otherwise heavily math-oriented person. This person will be responsible for creating prediction models based on current data. Do you know how Amazon knows what you should buy next based on your browsing history? Yeah, this individual probably did the programming around that, has machine learning down to a T, and needs to possess Product, Engineering and Statistics knowledge.

  • Data Analyst - This person also deals with a bit of statistics, but more in the business sense, dealing with and creating reports for Business Intelligence. This role tries to answer business questions; identifying where data acquisition/quality is failing, for example.

Data Engineer - This role I can explain with more passion: it is what I do, so this probably will be biased. We are the bridge. We help Software Engineers to build the application for storage and retrieval in a manner which provides the Data Scientists and Data Analysts with the information they need to do their job.

 So why do you need a Data Engineer on your team?

We do ETL (Extract-Transform-Load). We put data in the Data Warehouse, it’s from there that Data Analysts and the Data Scientists get part of the information they need. We may ended up sending data to Hadoop for instance too. They don’t query into your main relational database or on your MongoDB cluster (not usually),

I’ve seen queries taking hours to run because the main DB is not structured to deliver the information the way they need. That's because when trying to do a new application we think in a normalised database, some of those professions need a star schema as in a Data Warehouse for example, or totally non normalized data if you are optimising searches on Solr.

We devise the best strategy for caching information, design database architecture, NoSQL clusters. Should this JSON return from the Facebook API really be stored in the relational database? (Short answer: no.) Should this query with a LIKE '%string%' really be running in the application (no.) and not getting data from Elasticsearch (probably.)?

We work with RDBMS, NoSQL, Search Engines, Cache engines. I particularly make a lot of use of RDBMS since most of my work has been on Legacy applications. As an example, one of our responsibilities is to lower the load on RDBMS for unnecessary stored data presented there.

It is still necessary to know about topics like: indexing, transactions, query profiling and performance tuning.

To sum up: We are the wild card of storage technology.

Developers don't usually care about the precise details of data. They just want what is fast and easy to use. They think about delivery, not so much about long term data retrieval.

"I am going to store this access log for my website on that table."

They probably didn't stop to think that that table will potentially have million of records within a span of months. Why not ELK? Cassandra? Those tools allows the information to be fast retrievable with Elasticsearch and easily scalable with Cassandra. That way you don’t overload your main RDBM system with the same query that’s being repeated plenty of times through a TEXT field for instance. See also : Strip your TEXT field.

How should you work with a Data Engineer?

You know that feature you want to implement? Talk to us first. You can design the application the way you want, but we give you the insight into the data layer.

Do not isolate us on your team, we need support from your DevOps, we need the engineers to be our partners and to collaborate on problem solving; we are not only be there to run the queries they deem necessary because they don’t have access to the database. If you want to have that wicked fast search by category on your ecommerce, I guarantee you that LIKE is not the fast approach, The DevOps will help us setup the environment for the platform to take the best advantage, many developers may foreseen this, but most don’t. Again, we are the bridge. This is our job.

You can have another view of this on this blogpost: The different data science roles in the industry.


Tags: Data Engineer Data Scientist Data Analyst DBA