I got my first map/reduce job running on the HDInsight Services For Windows. The material provided by HDInsight team is very good. :-)
The Hadoop gives a new way to store and process data. We generate huge amount of data every day and the data schema changes constantly. Every times we design a database, we assume the data does not change that often, but this is not true. The possible solution is to dump all the data with low cost. If the portion of data suddenly becomes "valuable", a HIVE external table can be created and give a schema to these data. By using the ODBC driver, the data can be copied to SQL Server which is a high cost solution storage data. Traditional SQL Server will not be replaced by Hadoop, it only serves for the "valuable" data. For some "useless" data, let them stay in Hadoop.
I can't publicly discuss or blog the HDInsight feature right now. So this post is to summarize the public materials.
Personally I am more interested in finding a good solution to store the data and later process it as fast as possible. I know Hadoop is ready to for "big data", so I am more interested in
1. How to link with other technologies, such as SQL server and SQL Azure
2. Move data between different data storage
3. perform my own map/reduce work
The article about SSIS catches my eye. I covers question 1 and 2. The ASV protocol mentioned in the article is the way to access Azure blob from Hadoop. If you link to my previous post about Azure blob, you can tell where that post is from. The HIVE can point to a ASV folder by using the Create External Table statement (in section 8.3). You might want to use this link to check all HIVE data types.
Once the data is organized as a table, it can be accessed by using HIVE ODBC. The ODBC enables all kinds of connections and our existing tools and skills are connected. You can NOT write data by using ODBC.
The map/reduce program is very simple, it is basically a clone of C# sample. The only problem I found is debug. The execution log from UI is not that helpful. My trick is to export debug information to the output file. Once the algorithm is correct, the debug information can be eliminated.
The Hadoop gives a new way to store and process data. We generate huge amount of data every day and the data schema changes constantly. Every times we design a database, we assume the data does not change that often, but this is not true. The possible solution is to dump all the data with low cost. If the portion of data suddenly becomes "valuable", a HIVE external table can be created and give a schema to these data. By using the ODBC driver, the data can be copied to SQL Server which is a high cost solution storage data. Traditional SQL Server will not be replaced by Hadoop, it only serves for the "valuable" data. For some "useless" data, let them stay in Hadoop.
I can't publicly discuss or blog the HDInsight feature right now. So this post is to summarize the public materials.
Personally I am more interested in finding a good solution to store the data and later process it as fast as possible. I know Hadoop is ready to for "big data", so I am more interested in
1. How to link with other technologies, such as SQL server and SQL Azure
2. Move data between different data storage
3. perform my own map/reduce work
The article about SSIS catches my eye. I covers question 1 and 2. The ASV protocol mentioned in the article is the way to access Azure blob from Hadoop. If you link to my previous post about Azure blob, you can tell where that post is from. The HIVE can point to a ASV folder by using the Create External Table statement (in section 8.3). You might want to use this link to check all HIVE data types.
Once the data is organized as a table, it can be accessed by using HIVE ODBC. The ODBC enables all kinds of connections and our existing tools and skills are connected. You can NOT write data by using ODBC.
The map/reduce program is very simple, it is basically a clone of C# sample. The only problem I found is debug. The execution log from UI is not that helpful. My trick is to export debug information to the output file. Once the algorithm is correct, the debug information can be eliminated.