|
Posted
9 days
ago
by
Hyehwan Ahn
NHN, the company behind CUBRID open source database development, is automating tests for diverse browsers (Chrome, Firefox, Internet Explorer, Opera, iOS Safari and Android) that run on a variety of operating systems (Linux, Microsoft Windows, Mac
... [More]
OS, iOS and Android) with a set of test codes by using Hudson and Selenium WebDriver. In this article, I will briefly summarize the test automation process we use at NHN's NAVER portal and the benefits of Hudson and Selenium WebDriver for automation.
How to Automate Tests I registered a test code to Hudson, Continuous Integration (CI) tool, and set the test code to be executed by using Selenium WebDriver and JUnit whenever source is committed in SVN. It takes about 2 seconds to test a browser. When a problem occurs, the problem is reported within 10 minutes. Hudson is commonly used in most development departments of NHN. By clicking Build Now, users can view the test progress status. The report system provides a good environment for "Fast Fail, Fast Feedback". Figure 1: Multi-browser Test with One Set of Test Code. Automating Firefox Test in Linux Install xserver package on Linux server with Hudson. Modify run level to 5 and then add the following setting to an account where CI runs in order to execute a browser in a text console. Xvfb :1 -screen 0 1024x768x24 > /home1/irteam/log/xvfb.log & Install the latest Firefox version under the /usr/lib64 directory and replace the symbolic link. Now you can execute automation tests by running Firefox in Hudson. Firefox Test Automation Demo Video (Youtube) Automating iPhone and iPad Tests in Mac OS X and Simulator For tests, install Hudson on Mac OS X. It requires the iPhone or iPad Simulator, so you should install Xcode in advance. iPhone Test Automation Demo Video by Using Simulator (Youtube) Automating Internet Explorer and Opera Tests in Microsoft Windows For more details on how to automate tests in Microsoft Windows, see http://seleniumhq.org/docs/03_webdriver.html. Why Is Browser Test Automation with Hudson Required? If tests for several browsers are automatically executed whenever a source code is changed, it would clearly help developers to reduce bugs and keep the source tree that can be released anytime. In addition, QA time can be significantly reduced. There are many benefits of browser tests with Hudson, in addition to quality and cost. The following are the benefits of Hudson. Keeping a Living Document and Building up Domain Knowledge Whenever performing a test in Hudson, a document with illustration and description can be automatically generated by using Javadoc package.htm in the test code. This document is called a "living document" which is continuously updated and used by operators, developers and QAers for a variety of purposes such as hand-off of domain knowledge and preservation of revision history. Figure 2: Javadoc Document Created from Tests. Visualizing Service Quality by Browser and Domain Knowledge Anyone who clicks Build Now in Hudson can check the service quality by browser. Figure 3: Verifying Mobile News Service in Firefox. For example, in some cases, the news service should be provided differently by day due to the service characteristics. It is very difficult to hand off all domain information to all related staffs. For example, the KOSPI/KOSDAQ indexes are not showed on the news on weekends but after the opening of the market on Monday morning. Anyone who clicks the Build Now button can see the news service operation process by status. Tests with Hudson help users to hand off domain knowledge and build up the information as well as enhance quality and cut costs. Best Practices and Examples We once needed to update the version of Spring Framework used for the mobile news service. We modified the version information in the pom.xml file of Maven and performed the unit test and the integration test for web server source, and the result was no problem. Developers opened some new web pages by running the web server and checked that all of them were fine. However, there were more than 80 runtime errors. Fortunately, we had the multi-browser test in Hudson test, so we could fix all bugs and problems before releasing the service. Selenium WebDriver Why Selenium WebDriver? We use Selenium WebDriver because of its reliable quality through a version up process for a long time, and it is easy-to-use. The test is performed by running a browser, so it is easy to perform UI test by browser. The virtual browser test such as HtmlUnit cannot support a variety of browser versions (it is limited to Firefox 3.6 and lower and Internet Explorer 8 and lower). In addition, it does not support mobile browsers. Therefore, I recommend Selenium WebDriver over HtmlUnit, even though Selenium WebDriver requires more effort. With Java API provided by Selenium WebDriver at the JUnit test code, you can easily produce the browser test code. Code 1 below is now used for the mobile news service test. Code 1. News Service Test Code. @Test(timeout = 1000 * 60) public void 뉴스홈에_최근3일_최종편집시간이_표시된다() { Driver.get(driver, "http://m.news.naver.com/home.nhn"); String expectedResult = "최종편집"; String result = Driver.getTextByClass(driver, "last_update"); assertThat(result.startsWith(expectedResult), is(true)); } Benefits of Selenium WebDriver Benefits of Selenium WebDriver are various: it allows developers to produce the test code with a variety of languages (Java, C#, Python, PHP, Ruby, Perl) and provides fast feedback by using the implicit wait function. The following are the benefits of Selenium WebDriver. Test by Specifying ID/Class/XPath If considering HTML maintenance, ID is the best choice to specify HTML/CSS elements, the second best is Class, and finally, XPath. Selenium WebDriver supports all three. Multi-browser Tests with One Set of Test Code You can test several browsers with one set of test code because you can change the test browser by replacing WebDriver only. As of 2012, WebDriver supports Firefox, Internet Explorer, Chrome, Opera, iPhone Safari, iPad Safari and Android browsers. Code 2 below is a part of code that creates WebDriver. Code 2. Example of Creating WebDriver. public WebDriverFactory() throws Exception { String browsetype = TestConfigParam.getBrowseType(); if ("firefox".equals(browsetype)) { this.driver = new FirefoxDriver(); } else if ("chrome".equals(browsetype)) { this.driver = new ChromeDriver(); } else if ("ie".equals(browsetype)) { this.driver = new InternetExplorerDriver(); } else if ("iphone".equals(browsetype)) { this.driver = new IPhoneDriver(); } else if ("ie".equals(browsetype)) { this.driver = new InternetExplorerDriver(); } else if ("android".equals(browsetype)) { this.driver = new AndroidDriver(); } else if ("htmlunit".equals(browsetype)) { this.driver = new HtmlUnitDriver(false); } else { this.driver = new FirefoxDriver(); driver.manage().timeouts().implicitlyWait(Driver.TIMEOUT, TimeUnit.SECONDS); } } Conclusion So far, we have reviewed test automation applied to NAVER services. NHN has already automated tests for many services by using Hudson and Selenium WebDriver. I guess other companies have automated tests in the similar ways. Therefore, I hope this article to be a chance to communicate with others to create a Best Practice, rather than showing it for test methods. By Hyehwan Ahn, Software Engineer at News Service Development Team, NHN Corporation. I am a developer who is interested in automation of routine tasks in releasing Alpha/Beta/QA/Distribution phases to improve efficiency. [Less] |
||||||
|
Posted
10 days
ago
by
Lukas Eder
This is a guest post by Lukas Eder, the creator of jOOQ open source Java API for typesafe SQL modeling. If you develop or use an open source application and would like to tell the world about it, CUBRID open source database project
... [More]
is accepting guest posts.
Big Data, the Web and SQL In recent years, software companies have started to raise millions up to billions of dollars getting acquired by a big player, such as Google, Facebook, Yahoo! or Microsoft. Very often, the assumed value of such deals lay in the fact that Big Data could be purchased along with such acquisitions. "Social" Big Data was generated by millions of users over the web. It seemed too big to fit in classic relational databases, which is why the purchases also included buying the proprietary, rather short-lived technologies used to maintain Big Data. Most of the new companies thus experimented with NoSQL in one form or another. SQL, on the other hand, has come a long way. SQL is a very expressive and powerful language used to model queries against any type of data, albeit mostly relational. At the same time, SQL is standardised and quite open. CUBRID is a good example of an object-relational database, which combines the expressiveness of SQL with high availability, sharding, and many other features needed to manage Big Data! In other words, CUBRID is the proof that SQL can be an adequate technology for the modern web. Querying CUBRID with jOOQ jOOQ is a Java API modelling SQL as an internal domain-specific language directly in Java. It features a built-in code generator to generate Java classes from your database model. These generated classes can then be used to create typesafe SQL queries directly in Java. A simple example of how this works with CUBRID can be seen in this jOOQ CUBRID tutorial. The idea of creating fluent APIs in Java is not new. Usually, Martin Fowler takes most credits for his elaborations on the subject. After that, many approaches towards building internal domain-specific languages have surfaced, mostly in unit testing environments (e.g. JMock and Mockito). Apart from jOOQ, there are also a couple of fluent APIs that model SQL as a language in Java. These include: JaQu OneWebSQL Quaere QueryDSL Squill Among the above, QueryDSL is the only other API with a comparable traction to jOOQ's. While QueryDSL hides the full SQL expressiveness behind a LINQesque API, jOOQ strongly focuses on SQL only. Unlike any of the above SQL abstraction APIs, jOOQ combines these features: A BNF defines jOOQ's fluent API jOOQ uses next generation techniques to implement its fluent API. These techniques involve a formal BNF notation specifying API type and method hierarchies: With a formal BNF, jOOQ's fluent API is much more robust and typesafe, as it will dictate syntax correctness in a more formal way than ordinary builder APIs. jOOQ embraces usage of stored procedures When closely coupling with your favourite relational database, you will likely want to make use of stored procedures and functions, directly in your SQL. jOOQ embraces this fact and allows for typesafe embedding of stored functions. jOOQ embraces usage of row value expressions Row value expressions (also called tuples, records) are at the heart of SQL. Few libraries outside of the SQL world will be able to model the fact that the following predicates are type-safe: SELECT * FROM t1 WHERE t1.a = (SELECT t2.a FROM t2) -- Types must match: ^^^^ ^^^^ SELECT * FROM t1 WHERE (t1.a, t1.b) IN (SELECT t2.a, t2.b FROM t2) -- Types must match: ^^^^^^^^^^^^ ^^^^^^^^^^ SELECT t1.a, t1.b FROM t1 UNION SELECT t2.a, t2.b FROM t2 -- ^^^^^^^^^^ Types must match ^^^^^^^^^^ jOOQ will leverage the Java compiler to help you check the above: select().from(t1).where(t1.a.eq(select(t2.a).from(t2)); // Type-check here: -----------------> ^^^^ select().from(t1).where(row(t1.a, t1.b).in(select(t2.a, t2.b).from(t2))); // Type-check here: ----------------------------> ^^^^^^^^^^ select(t1.a, t1.b).from(t1).union(select(t2.a, t2.b).from(t2)); // Type-check here: -------------------> ^^^^^^^^^^ jOOQ emulates built-in functions and SQL clauses Providing support for simple SQL clauses is easy: SELECT, DISTINCT, FROM, JOIN, GROUP BY, etc. Implementing "real" SQL is much harder, though. Take the above row value expressions, for instance. They are currently not supported in CUBRID, but you can use them nonetheless with jOOQ. jOOQ emulates missing functions and SQL clauses for you as can be seen in this syndicated blog post. jOOQ renders specialised SQL for 14 major RDBMS vendors Instead of generalising and abstracting advanced standard and vendor-specific SQL features, such as JPA and tools built upon JPA, jOOQ sees good things in each vendor-specific syntax element. You know your database well, so you want to leverage it, not abstract it. jOOQ is a platform jOOQ is much more than just a SQL library. For example, it features the very useful jOOQ Console, which helps you debug and profile your jOOQ-generated SQL statements in any environment, without the need for expensive third-party tools: The jOOQ Console also includes on-the-fly SQL editing tools as well as breakpoint capability for advanced debugging. More feature comparisons More feature comparisons can be found here, in this blog post. Getting productive with jOOQ jOOQ is a vision where SQL matters again to the Java developer. While some have called ORM to be the Vietnam of Computer Science, jOOQ is the Peace Treaty Between SQL and Java. Using the above and many more features, you can be productive again when writing high-performing, specialised SQL against your favourite database directly in Java, typesafely compiled by your Java compiler. By Lukas Eder, the creator of jOOQ. Follow him on Twitter @JavaOOQ. I'm a Java and SQL enthusiast developer currently contracting for Adobe Systems in Basel, Switzerland. Originating from the E-Banking field, I have a strong Oracle SQL background. I'm the creator of jOOQ, a comprehensive SQL library for Java. [Less] |
||||||
|
Posted
10 days
ago
by
Phil Jackson
This is a guest post by Phil Jackson, the creator of ApiAxle open source API management and analytics solution. If you develop or use an open source application and would like to tell the world about it, CUBRID open source database project is
... [More]
accepting guest posts.
ApiAxle is an API management solution which is open source and free. The basic premise is that you build your API, put ApiAxle in front of it and it will handle user authentication, rate limiting, statistics, etc. I represent a business which generates a revenue through support and consultancy, but first and foremost I am a developer who loves APIs and the idea of companies exposing their data to enable people to build some brilliant things. As a company we are enjoying building a great product and watching it gain traction amongst fellow hackers. The API problem We noticed a space in the market for an open source, on-premise proxy which did not cost the earth and did not involve sending your data over multiple, high-latency hops out into the cloud. Where a developer these days can type apt-get install nginx to get a solid webserver, there was not really an equivalent for an API management system. Building out the features we provide can be a time-consuming, monotonous and error-prone process that we really hope people do not keep having to perform. The solution That is where we come in. We want to bury our head in security documentation and RFCs so that you can concentrate on making a great API. With a few simple commands you can be up and running within 20 minutes. Within 30 you can be on-boarding customers, authenticating them and getting detailed statistics about their usage. You will also get caching, rate limiting, HTTPS support and a highly configurable logging system. How it works There are three components in ApiAxle: The REPL You probably want this first. The repl allows you to configure ApiAxle from the command line. Setting up an API and API keys ready for the proxy to work with is easy. It fires up an instance of ApiAxle’s own HTTP API in the background and uses that to modify aspects of the system. Anything you can do in the REPL you can do programmatically with the API too. Install: $ sudo npm install apiaxle-repl Start: $ apiaxle Configure an API and a new key to use the API with: axle> key "05050c14643dc" create axle> api "acme" create endPoint="localhost:81" axle> api "acme" linkkey "05050c14643dc" The Proxy The kernel of the system. This goes between the Internet and your API and does the authentication, throttling, caching and statistics collection. It’is fast, secure and easy to setup. You will need either the REPL or the API to configure it first. It does not matter what your API actually outputs - ApiAxle never modifies the body of a response. With regards to errors (e.g. user over quota) you can tell ApiAxle what format they should be in. We support XML or JSON. If you have Node.js installed, installation is as simple as: $ sudo npm install apiaxle-proxy Then start the proxy with: $ apiaxle-proxy The API Bear with me, this gets a bit meta. This is ApiAxle’s own HTTP API which gives you full control over your APIs and the API keys and keyrings used to access them. You can view statistics about individual APIs and API keys from week long granularities right down to near-real-time hits at a single second granularity. Install: $ sudo npm install apiaxle-api Start: $ apiaxle-api Find out which APIs you have configured: $ curl "localhost:3000/v1/apis" Where we are headed to We have lots planned, to summarise: Client drivers for the API. Demand is high for PHP and Ruby so we will get them done ASAP. Now that OAuth2 has been ratified we will be working on getting that in as an authentication method. We are working on a dashboard which will give you a way to manage your APIs, keys, keyrings and give you a way to view real-time and historical statistics - this will be a paid-for service. We will be pushing out a user registration system soon so that you can on-board and bill customers without any manual intervention. So the future is exciting. We are really looking forward to meeting more developers that are interested in APIs and the huge ecosystem that is formed around them. Please feel free to get in touch with any questions or just to say hi! By Phil Jackson, the creator of ApiAxle. Follow him on Twitter @philjackson. After receiving his computing degree from Teesside University, Phil became a software engineer. As he moved through industry, he found himself becoming increasingly fascinated with APIs and, after helping the BBC write their iPlayer API, founded Qwerly, a company that aggregated social profile information and offered it to companies for insight/marketing purposes. After selling Qwerly to a competitor, taking some time off, Phil wrote the code which would eventually make up ApiAxle. Now he's running ApiAxle full-time and hopes it will become the ubiquitous, open source tool for managing APIs. [Less] |
||||||
|
Posted
22 days
ago
by
Esen Sagynov
At CUBRID we know how much important and difficult at the same time it is to find the right audience who would listen to what your open source project has to offer. Due to its nature, most OS projects, especially those driven by a community
... [More]
of likeminded enthusiasts, often do not possess the budget to reach out to target users. If you are a member of such community and would like to tell the world about it, read on. We have good news for you!
Last year we announced the CUBRID Affiliates Program through which we have already donated 3000 USD to 14 open source projects. Some of these projects provide very useful features to CUBRID Database users. This time we want to periodically introduce to our CUBRID community members one or two handpicked open source projects that we think are also potential and worth talking about. Starting from today we will accept guest posts to our CUBRID Blog from open source communities where they can introduce their software to our readers. We will handle the promotion. Step 1: Pitch it! We are looking for great software that will make our and our readers' eyes sparkle. Before your post gets published, we would like to read an overview of your project. Send us a brief email introducing your project: site links, introductory videos, presentations if available, what it does, what benefits it can provide to our readers. We will get back to you in no time. Step 2: Write it! If we accept your pitch, we will ask you to write a full post that should cover your software. You can include images and even video. In your introduction you need to tell the readers what problem your software solves, where and how a reader can take off, and, very importantly, how it differs from other existing solutions (I am sure there is at least one product you can compare with). Together with your article, send us your bio and Gravatar enabled email address. Step 3: Get published! We will create an account for you on our CUBRID community site and associate your email address it with. Then we will publish your post under your name so that you can receive notifications for comments as the post author. Once published, we will share your story on CUBRID Facebook, Twitter, Google+ pages as well as other networking sites like the popular DZone. On top of that we will place a 223x170 pixel banner on the top right of CUBRID Blog site, if you provide one, as well as pay at least $5 (about 11K impressions) to promote your post on Facebook timeline. Like it? Then go on and contact us! [Less] |
||||||
|
Posted
24 days
ago
by
Jaehong Kim
Last month a colleague of mine has already covered Vert.x, a relatively new Java application framework which provides noticable performance advantage over competing technologies and features multi programming language support. The ... [More] previous article has explained us about the philosophy of Vert.x, performance comparison with Node.js, internal structure of Vert.x, and many more. Today, I would like to continue this conversation and talk more about Vert.x architecture. Considerations Used to Develop Vert.x Polyglot is the feature making Vert.x stand out from other server frameworks. In the past, server frameworks could not support multiple languages. Supporting several languages does more than expand the range of users. More important thing is that services using different languages in a distributed environment can intercommunicate with ease. Of course, supporting a variety of languages is not sufficient for supporting a distributed environment. Essential functions of greater priority for a distributed environment include address system or message bus. Vert.x framework provides these functions. As Vert.x provides these functions as well as Polyglot, the benefits of Vert.x should be considered for a distributed environment. As Vert.x supports a universal server framework, a variety of workloads should be considered. We should consider unusal cases different from Nginx, which is typically used as a Web server, or Node.js. It is to build a universal server application that processes a variety of protocols except HTTP (i.e., not a Web server which executes simple operations, considering scalability in a 3-tier environment). In order to accomplish this, Vert.x provides an additional thread pool while using the Run Loop method. We will discuss Vert.x architecture starting from the thread pool and a consideration for a distributed environment. Run Loop and Thread Pool Vert.x and asynchronous server applications (or frameworks), include Ngin.x and Node.js, use the Run Loop method. Vert.x uses the term 'Event Loop' instead of 'Run Loop'. However, as Run Loop is the more popular term among some developers. I use this term, Run Loop, here. Run Loop, as you will guess from the name, is a method for checking whether there is a new event in the infinite loop, and calling an event handler when an event has been received. As such, the asynchronous server application and the event-based server application are different terms indicating an identical target, similar to ‘enharmonic' for music. To use the Run Loop method, all I/Os should be managed as events. For example, imagine a general Web server application that creates a query for a database to respond to an HTTP request from a Web browser. The CPU of the Web server is used when one thread analyzes the HTTP request to execute proper business logic, and creates a query statement. However, the CPU is not used while the thread sends the query to the database and waits for a response. However, when the thread to be created equals the number of HTTP requests (Thread per Connection), another thread may be processing a task requiring the web server CPU, while one thread is waiting for response from the database. Finally, the web server CPU is used to process HTTP requests. As you know, the weakness of Thread per Connection is the cost for context switching at the kernel level since many threads must be created. This can be called waste. The asynchronous event handling method can overcome this weakness (figuratively speaking, 'asynchronous event handling' is the 'purpose' and ‘Run Loop’ is the 'means'). If ‘HTTP request itself’ and ‘receiving a response from the database’ are created as an event, and the Run Loop calls the corresponding event handler whenever an event is received, the execution performance of the application can be enhanced by avoiding unnecessary context switching. In this fashion, to utilize a CPU efficiently, the number of Run Loops required equals the number of cores (i.e., thread should be created equaling the number of cores and each thread should run the Run Loop). However, there is another problem creating threads equaling the number of cores, which is preventing as much context switching as possible. If a handler, using server resources, takes a long time to handle an event, other events received while the handler is being executed are not managed in a timely manner. A popular example is file searching on the server disk. In this case, it is better to create a separate thread for searching files. Therefore, to build a universal server framework with asynchronous event handling, the framework should have a function for managing a thread pool. This is the aim of Vert.x. Thread pool management is the biggest difference between Vert.x and Node.js, except for polyglot. Vert.x creates Run Loops (Event Loops) equaling the number of cores and provides thread pool-related function to handle tasks using server resources requiring long periods for event handling. Why is Hazelcast Used? Vert.x uses Hazelcast, an In-Memory Data Grid (IMDG). Hazelcast API is not directly revealed to users but is used in Vert.x. When Vert.x is started, Hazelcast is started as an embedded element. Hazelcast is a type of distributed storage. When storage is embedded and used in a server framework, we can obtain expected effects from a distributed environment. The most popular case is session data processing. Vert.x calls it Shared Data. It allows multiple Vert.x instances to share the same data. Of course, additional RDBMS, instead of Hazelcast, will bring the same effect from the functional side. It is natural that embedded memory storage can consistently provide results faster than remote RDBMS. Therefore, users who need sessions for e-commerce or chatting servers can build a system with a simple configuration by using only Vert.x. Hazelcast allows a message queue use without additional costs or investments (without server costs or monitoring of message queue instances). As mentioned before, Hazelcast is a distributed storage. It can duplicate a storage for reliability. By using this distributed storage as a queue, the server application implemented by using Vert.x becomes a message processing server application and a distributed queue. These benefits make Vert.x a strong framework in a distributed environment. Understanding Vert.x Components Figure 1: Vert.x Architecture (Component) Diagram. Figure 1 above shows a diagram of Vert.x components. As shown in the figure, in all Vert.x instances (these can be understood as a JVM), a Hazelcast is embedded and runs. The embedded Hazelcast is connected to Hazelcast in other Vert.x instances. Event Bus uses functions of Hazelcast. Hazelcast itself provides a certain level of reliability (because of WAL records and data duplication). So, events can be forwarded with a certain level of reliability. HTTP Server and Net Server HTTP Server and Net Server control network events and event handlers. A Net Server is for events and handlers private protocol, and an HTTP Server allows registering a handler to an HTTP event such as GET or POST. The reason for preparing an HTTP Server is eliminating the need to add event types, as well as the universality of HTTP itself. HTTP Server supports WebSocket as well as HTTP. Figure 2: Event and Handler of HTTP Server. Vert.x Thread Pool Vert.x has three types of thread pools: Acceptor: A thread to accept a socket. One thread is created for one port. Event Loops: (same with Run Loop) equals the number of cores. When an event occurs, it executes a corresponding handler. When execution is performed, it repeats reading another event. Background: Used when Event Loop executes a handler and an additional thread is required. Users can specify the number of threads in vertx.backgroundPoolSize, an environmental variable. The default is 20. Using too many threads causes an increase in context switching costs, so be cautious. Event Loops can be described as follows in a detailed way. Event Loops use Netty NioWorkder as it is. All handlers specified by verticles run on Event Loops. Each verticle instance has its specified NioWorker. As such, it is guaranteed that a verticle instance is always executed on an identical thread. Therefore, verticles can be written in a thread-safe manner. Conclusion So far, I have briefly described Vert.x architecture. Since Vert.x framework is not widely used, I believe it would be better to detail the concept of designing Vert.x than detail each Vert.x component. Even if you have no interest in network server frameworks, it is helpful to review new products and determine differences between new and existing products. Doing so helps in understanding the evolution and direction of software products that are flooding today's market. By Jaehong Kim, Senior Engineer at Web Platform Development Lab, NHN Corporation. [Less] |
||||||
|
Posted
about 1 month
ago
by
Kim Sung Kyu
PostgreSQL shows excellent functionalities and performance. Considering its high quality, it may seem strange that PostgreSQL is not more popular. However, PostgreSQL continues to make progress. This article will discuss this
... [More]
database.
Why You Should Know about PostgreSQL PostgreSQL is an RDBMS, which is popular mainly in North America and Japan. It is not used much in Korea yet, but as it is a very excellent RDBMS in terms of functionality and performance, it is worth learning about what kind of database PostgreSQL is. PostgreSQL (pronounced as [Post-Gres-Q-L]) is an object-relational database system (ORDBMS), and is an open-source DBMS that provides the enterprise-level DBMS functionalities and many other functionalities you can find only in advanced DBMS. PostgreSQL is also known as an open-source DBMS that Oracle users can adapt themselves to the most easily, as it has many functionalities similar to those of Oracle. History There were many ancestors of PostgreSQL, and of them, Ingres (INteractive Graphics REtrieval System) can be said to be the progenitor of PostgreSQL. Ingres was a project launched by Michael Stonebraker (Picture 1), a great master in the area of databases who is still working hard even today. Picture 1: Michael Stonebraker started Ingres project. The Ingres project was launched at Berkeley University in the US in 1977. After Ingres Michael Stonebraker had started another project called Postgres (Post-Ingres). As Postgres version 3 was released in 1991, its user base grew to be quite large. But as the burden of providing support to users became too high, the project was terminated in 1993 (Postgres is known to have had a huge influence on the current Informix product, even after the end of the project. Illustra, a commercial version of POSTGRES, was taken over by Informix in 1997, and then by IBM in 2001.). Figure 1: Product History. Despite the project having ended, Postgres users and students continued its development and finally created Postgres95, which achieved 40% better performance than Postgres by supporting SQL and improving its structure. When Postgre95 became an open-source system in 1996, it was given the name PostgreSQL, its current name, to reflect the fact that it succeeded Postgres and supports SQL (Postgres supported a language called QUEL instead of SQL). In 1997, PostgreSQL was finally released after determining its first version as 6.0. Since then, PostgreSQL has been actively developed to this day through an open-source community, and the latest release is 9.2, as of May 2013. In addition, due to its open license (like the BSD or MIT license, PostgreSQL allows commercial use and modification, but it also clarifies that the original developers are not liable for any problem that may occur in its use), there have been more than 20 various forks, some of which have had an influence on PostgreSQL and some of which have disappeared. PostgreSQL's logo is an elephant named 'Slonik' (a baby elephant in Russian language). The true reason why an elephant was used for the logo is not known, but it has been said that just after it became an open-source system, one of its users was inspired by Agatha Christie's novel "Elephants Can Remember" and suggested it. Since then, the elephant logo has been visible at every official PostgreSQL event. As elephants are thought of as large, strong, reliable and have a good memory, Hadoop and Evernote also use an elephant as their official logo. Functionalities and Limitations PostgreSQL supports transaction and ACID, which are the basic functionalities of a relational DBMS. Moreover, PostgreSQL also has many progressive functionalities or expanded functionalities for academic research as well as for basic reliability and stability. Even a general list of PostgreSQL functionalities includes a large number of functionalities. Nested transactions (savepoints) Point in time recovery Online/hot backups, Parallel restore Rules system (query rewrite system) B-tree, R-tree, hash, GiST method indexes Multi-Version Concurrency Control (MVCC) Tablespaces Procedural Language Information Schema I18N, L10N Database & Column level collation Array, XML, UUID type Auto-increment (sequences), Asynchronous replication LIMIT/OFFSET Full text search SSL, IPv6 Key/Value storage Table inheritance In addition to these, it features a variety of functionalities and new functionalities of enterprise-level DBMS. In general, PostgreSQL has the following limits: Table 1: Basic Limits of PostgreSQL. Limit Value Max. Database Size Unlimited Max. Table Size 32 TB Max. Row Size 1.6 TB Max. Field Size 1 GB Max. Rows per Table Unlimited Max. Columns per Table 250~1600 Max. Indexes per Table Unlimited Roadmap As of May 2013, the latest release is 9.2. Figure 2 provides some brief information on the progress of PostgreSQL by year. Figure 2: Progress of PostgreSQL by Year. The main functionalities of each version are as follows: Table 2: Main Functionalities by Version. VersionRelease YearMain Functionalities 0.01 1995 Postgres95 release 1.0 1995 Copyright change, open source 6.0~6.5 1997~1999 Renamed PostgreSQL Index, VIEWs and RULEs Sequences, Triggers Genetic Query Optimizer Constraints, Subselect MVCC, JDBC interface, 7.0~7.4 2000~2010 Foreign keys SQL92 syntax JOINs Write-Ahead Log Information Schema, Internationalization 8.0~8.4 2005~2012 Native Support for MS Windows Savepoint, Point-in-time recovery Two-phase commit Table spaces, Partitioning Full text search Common table expressions (CTE) SQL/XML, ENUM, UUID Type Window functions Per-database collation Replication, Warm standby 9.0 2010-09 Streaming replication, Hot standby Support for 64bit MS Windows Per-column conditional trigger 9.1 2011-09 Functionality differentiation Synchronous replication Per-column collations Unlogged tables K-nearest-neighbor indexing Serializable isolation level Writeable CTE (WITH) SQL/MED External Data SE-Linux integration 9.2 2012-09 Performance optimization linear scalability to 64 cores Reduction in CPU power consumption Cascade streaming replication JSON, Range Type Improved lock management Space-partitioned GiST index Index-only scans (covering) The next PostgreSQL release under development is PostgreSQL 9.3, which is due to be released in the third quarter of 2013. This release features many functionalities, including an enhanced management functionality, parallel query, MERGE/UPSERT, multi-master replication, materialized view, and enhanced multi-language support. Internal Structure The following shows the process structure: Figure 3: Process Structure. If the client requests connection with the server through the (1) interface library (variety of Interfaces Including libpg, JDBC and ODBC), the Postmaster process relays connection with the server (2). Then, the client executes a query through connection with the allocated server (Figure 3). The following shows the process of query execution in the server: Figure 4: Query Execution Procedure. If it receives a query request from the client, the system creates a parse tree through the syntax analytics process (1), starts a new transaction through the semantic checking process (2) and creates a query tree. Next, a query tree is re-generated according to the rules defined in the server (3), and of the many available execution plans, the most optimized plan tree is created (4). The server executes this (5) and sends the result of the requested query to the client. While the server executes a query, a system catalog in the database is frequently used. In the system catalog, users can directly define the type of functions and data, as well as index access methods and rules. In PostgreSQL, therefore, a system catalog is utilized as an important point in adding or expanding its functionalities. A file that stores data consists of multiple pages, and a single page has a scalable slotted page structure (Figures 5 and 6). Figure 5: Data Page Structure. Figure 6: Index Page Structure. Development Process The development process model of PostgreSQL can be explained by the following sentence: ‘A community-based open-source project led by a few.’ Like the Linux, Apache and Eclipse projects, the PostgreSQL project is also composed only of a few administrators, a variety of developers and a large number of users. The small administrator group (Core Team) collects requests and feedback (the group sometimes takes a vote to determine priorities at http://postgresql.uservoice.com) from a large number of users, determines the direction of the product, has final approval right for the code and exerts its right for release. This is a different model from corporate management development processes such as MySQL and JBoss. The developer group consists of code committers and code developers/contributors. They are located in many countries, including the U.S., Japan and Europe. Figure 7: Distribution of PostgreSQL Developers by Region. Codes developed by a variety of developers go through a variety of review processes (Submission Review, Usability Review, Feature Test, Performance Review, Coding Review, Architecture Review, Review Review), and are reflected in the product after approval by the Core Team. The mailing list that has been used by the community for a long time is usually used, and a variety of documents, including manuals, are well maintained through the official website. Products in Competition PostgreSQL wants to be compared with enterprise-level commercial DBs, but it has been compared mainly with popular open-source DBMSs. The following are the catchphrases of these open-source DBMSs, each of which reflects its features: PostgreSQL: The world's most advanced open source database MySQL: The world's most popular open source database CUBRID: Open Source Database Highly Optimized for Web Applications Firebird: The true open source database SQLite: self-contained library, serverless, zero-configuration, transactional SQL database engine It is not easy to compare these products using their catchphrases alone, but you can see that PostgreSQL seeks progressiveness and openness. The following is brief comparison of PostgreSQL and its competitiors: Table 3: Comparison of Products in Competition. Oracle An enormous amount of long-proven code and a variety of references. High cost DB2, MS SQL Similar to Oracle MySQL A variety of applications and references. Corporate development model And the burden of licensing CUBRID An alternative to MySQL Built-in HA and database sharding Dual licensing Other commercial DBs Other commercial DBs show a downtrend due to open-source DBMSs Other open source DBs Struggle to attract developers For a long time, the PostgreSQL community has made attempts to enter the enterprise DBMS market. In 2004, EnterpriseDB, a company using PostgreSQL, was established, and it is striving to strengthen its position in the enterprise DBMS market. One of the company's main products is Postgres Plus Advanced Server. Postgres Plus Advanced Server was developed by adding Oracle-compatible functionalities (PL/SQL, SQL statements, functions, DB Links, OCI library, etc.) to the open-source PostgreSQL, featuring easy data and application migration and a cost reduction of 20% compared to Oracle (Figure 7). Figure 8: Cost Reduction Compared to Oracle. In addition, Postgres Plus Advanced Server provides differentiated services, including a training, consulting and migration, and technical support service from PostgreSQL experts. Through approximately 300 reference sites in a variety of areas, the product is promoted as a database for all industries, with a growing base of users across the world. Present Status and Trend As you can see from most posts on PostgreSQL, most PostgreSQL users have a developer-like tendency, and are very loyal to the product. In fact, they have a good reason for their loyalty. PostgreSQL provides sufficient functionalities and conservative performance compared to other products, and one of its advantages is that it has good enough conditions for beginners to attract new developers. These good conditions include a well-written manual on the project page, related documents, over 300 reference publications, and over 10 seminars and conferences held in a variety of countries every year. More recently, a PostgreSQL magazine has even appeared. And these are the results of the active PostgreSQL community. The representative features that PostgreSQL users identify as being important are as follows: Reliability is the top priority of the product ACID and transaction A variety of indexing techniques Flexible full-text search MVCC for better concurrency performance Diverse and flexible replication methods A variety of procedures (PL/pgSQL, Perl, Python, Ruby, TCL, etc.)/Interface (JDBC, ODBC, C/C++, .Net, Perl, Python, etc.) languages Excellent community and commercial support Well-made documents and a thorough manual A variety of expansion functionalities and ease of development of such functionalities are also advantages of PostgreSQL. The following are the differentiated expansion functionalities of PostgreSQL: GIS add-on (PostGIS) Key-Value store expansion (HStore) DBLink Support for a variety of functions and types, including Crypto and UUID There are many other practical and experimental expansion functionalities as well. Of these, you will see a brief account of GIS (Geographic Information System), which has recently become a hot topic. PostGIS is a middleware expansion functionality that enables PostgreSQL to conform to the OpenGIS standard and support geographic objects (Figure 9). Figure 9: PostGIS Structure. PostGIS began to be developed from 2001, and with many functionality and performance improvements, it currently has the most users among the open-source products. There are some commercial products, such as Oracle Spatial, DB2 and MS SQL Server, but the commercial products have not been as well-received in terms of price-performance ratio. In addition, you can easily find benchmark data that shows that the functionalities and performance of PostGIS/PostgreSQL are worthy of comparison to Oracle. According to the recent trend, PostgreSQL is also much talked about in relation to cloud as well as GIS. With the recent increase in the number of companies providing DBaaS (Database as a Service), the demand for PostgreSQL, which has advantages in terms of costs and license, has increased, and as such EnterpriseDB has released Postgres Plus Cloud Database in the cloud market, with the following features: Simple setup & web-based management Automatic scaling, load balancing and failover Automated online backup Database Cloning It is used in many web services, including Amazon EC2, Eucalyptus cloud, and Red Hat Openshift development platform cloud. Other cloud service providers such as Heroku and dotCloud also provide services using PostgreSQL. Conclusion As Sun, which had acquired MySQL, was acquired by Oracle in 2009, MySQL began to be developed as a more closed corporate project, and many MySQL developers left the community around the same time. Afraid of this change, MySQL users are paying attention not only to the forks (MariaDB, Drizzle, Percona, etc.) of MySQL to which they can easily migrate, but also to the migration to PostgreSQL. Looking at the trend of help-wanted ads related to PostgreSQL and MySQL in the most popular job finding portal http://www.indeed.com (Figure 9), we can see the increase in help-wanted ads related to MySQL is slowing down, while help-wanted ads related to PostgreSQL continue to increase. Figure 10: Trend of Help-wanted Ads. According to the trend of search frequency in search sites (Figure 10), MySQL shows a continued downtrend, while PostgreSQL seems to have almost no change. In Korea, however, the search frequency for PostgreSQL has shown an upward trend since mid 2010. Figure 11: Search Frequency Trend (source). Of course, the popularity and usage of MySQL is still much higher than PostgreSQL. Although you may not be able to determine the true status or prospects of these products from the above data alone, you could infer that if the popularity of MySQL declines, the popularity of PostgreSQL will increase. PostgreSQL is not yet powerful enough to surpass MySQL in popularity, but the PostgreSQL open source project community continues to make the following efforts: Improvement of the reliability of basic DBMS functionalities Provision of progressive and differentiated functionality expansion Continuous attraction of more open source developers In addition, EnterpriseDB, which has stronger business purposes, is also striving to achieve the following objectives: Expansion of its share in the enterprise market Expansion of its share in the cloud market Efforts to replace Oracle and MySQL By Kim Sung Kyu, Senior Software Engineer at CUBRID DBMS Lab, NHN Corporation. [Less] |
||||||
|
Posted
about 1 month
ago
by
Esen Sagynov
Three weeks ago on behalf of the CUBRID team a few of my colleagues and me have attended and given talks at two international conferences. Today I would like to share my impressions of these events. I will write a separate post
... [More]
about various sharding solutions introduced at these conferences. So, stay tuned!
RIT++ The first presentation at RIT++ (Russian Internet Technologies) was held on Monday April 22nd, 2013, in Moscow, Russia. The second one at Percona MySQL Conference & Expo was held on Wednesday the same week on Wednesday April 24th, 2013, in Santa Clara, CA, US. At both conferences the agenda was the same: "Easy MySQL Database Sharding with CUBRID SHARD". At RIT++, though, the presentation was given in Russian language. Very exciting! The following is a list of resouces related to the talks. The presentation abstract in English The presentation abstract in Russian Slideshare in English Slideshare in Russian This was the third time we, the CUBRID team, have attended the conferences organized by Russian Ontico company. Previously we have attended to RIT++ 2012 and HighLoad++ 2012 conferences. This year at RIT++ 2013 there were over 800 attendees, and 13 categories of talks ranging from client-side development to server-side, to database scalability, to project management, to analytics, and so on. Annually after the conference is over the organizers conduct after-event survey and assess the past experience. I think because of users' past feedback this year RIT++ organizers have accepted more talks related to client-side development than usually. Besides us from Korea, there were presenters from the States, representing Facebook, and Brasil, representing PUC-Rio University of Brasil. My personal impression was that this year there were fewer foreign speakers than last year at RIT++ or HighLoad++. At my session about MySQL database sharding with CUBRID SHARD there were over 100, close to 150, I guess attendees. The audience welcomed my speech in Russian language very well. Next time I should talk in Russian again. They like it! When the presentation was over, there were slew of questions. I think CUBRID SHARD as an easy sharding middleware for MySQL was received very well. To my surprise there were many questions unrelated to CUBRID SHARD. The audience asked a lot about CUBRID itself and its HA feature. Later I learned that many attendees listened to my talks about CUBRID open source relational database system from the last year. One from the audience said that he'd been looking into CUBRID for a while already and was considering to use it in production. His most favorite feature in CUBRID was its built-in support for HA and very clever 3-tier architecture. Overall the unofficial Q&A session lasted for over 1 hour 30 minutes. It was a great experience for me to present CUBRID SHARD at RIT++ this year and a great opportunity to our CUBRID team. The conference lasted two days, but I could not attend the second day as I had to head to Santa Clara, CA, to give a talk at Percona MySQL Conference & Expo. Percona It was the first time I have talked at Percona. Previously we have spoken at OSCON 2011 about CUBRID HA, and 2010 MySQL Conference & Expo about CUBRID Database. When compared to OSCON, Percona MySQL conference was a lot more specific (obviously about MySQL). There were more quality talks about scalability and performance tuning. If I was to choose where to go next year, I would definitly select Percona. That interesting it was! Unlike at RIT++, our session at Percona conference had attracted only about 20 attendees. The presentation went well, but I should accept that the number of listeners plays a big role. There were fewer questions, less enthusiasm. On the other hand, Facebook, two Percona, Continuent, and Tokutek presentations, which were held at the same time at 3:30 PM, attracted hundreds of listeners each. After realizing this I came to a conclusion that it is the brand recognition that plays a significant role in attracting listeners. Even though NHN is very popular in Korea and Asia in general, it is almost unknown in Western countries. In fact, when I asked the audience at Percona if they had ever heard about NHN, their answer was negative. Very pitty. I think NHN has to seriously reconsider its strategy on increasing its worldwide brand recognition. Nevertheless, I am very glad we had this chance to present our open source sharding middleware at a well-known conference like Percona. Like I mentioned at the beginning of this post, I will write another post covering various sharding solutions presented at Percona conference. It was very interesting to learn about different techniques used by large scale service providers who have developed their own sharding solutions. After my presentation was over and I had answered all the questions, I headed to one of the lounge rooms where I had made an oppointment to meet with Ryan Walsh, a Corporate Account Executive at Percona. We have discussed about various opportunities for cooperation between Percona and NHN, the company behind CUBRID development. Persona is a widely-known and reputable MySQL support and consulting company. It is known to be the oldest and largest independent company which provides not only MySQL support, consulting and training but also develops a custom MySQL server, i.e. provides patches, "backport changes to older MySQL versions to obtain a key patch without a full version upgrade". During our conversation Rayn had introduced his company and told about large scale cases their company has worked so far. One that I would like to mention today is that some of the services at Amazon Web Services have been actually developed by Percona. Amazon RDS was said to have been developed by Percona team. Percona database tools seem to work with RDS natively. Also Percona is cooperating with HP to build RedDwarf DaaS as part of the OpenStack open source cloud project. At Percona conference HP engineers have presented how to use RedDwarf APIs to use and administer the features of Percona Server. Such vast knowledge and experience of Percona in bulding cloud database services may be quite benefitial to NHN to develop and provide its own cloud computing service. Overall, both presentations went well. I have talked to many attendees and answered to quite a lof of their questions about CUBRID SHARD and CUBRID Database. One thing which requires more attention from NHN is its global brand recognition. The more developers will recognize NHN and its services, the more will be eager to listen to and learn from NHN enginneers. If you have any feedback or suggestions, feel free to comment below. Also you should follow us on twitter here. [Less] |
||||||
|
Posted
about 1 month
ago
by
Lee Jae Ik
At NHN we have a service called NELO (NHN Error Log System) to manage and search logs pushed to the system by various applications and other Web services. The search performance and functionality of NELO2, the second generation of the
... [More]
system, have significantly been improved through ElasticSearch. Today I would like to share our experience at NHN in deploying ElasticSearch in Log Search Systems.
ElasticSearch is a distributed search engine based on Lucene developed by Shay Banon. Shay and his team have recently released the long awaited version 0.90. Here is a link to a one-hour recorded webinar where Clinton Gormley, one of the core ElasticSearch developers, explains what's new in ElasticSearch 0.90. If you are developing a system which requires a search functionality, I would recommend ElasticSearch as its installation and server expansion are very easy. Since it is a distributed system, ElasticSearch can easily cope with an increase in the volume of search targets. At NHN all logs coming into NELO2 are stored and indexed by ElasticSearch for faster near real-time search results. Features of ElasticSearch Let's get started with familiarizing ourselves with the terms widely used in ElasticSearch. For those who are familiar with relational database systems, the following table compares the terms used in relational databases with the terms used in ElasticSearch. Table 1: Comparison of the terms of RDBMS and ElasticSearch. Relational DB ElasticSearch Database Index Table Type Row Document Column Field Schema Mapping Index Everything is indexed SQL Query DSL JSON-based Schemaless Storage ElasticSearch is a search engine but can be used like NoSQL. Since a data model is represented in JSON, both requests and responses are exchanged as JSON documents. Moreover, sources are also stored in JSON. Although schema is not defined in advance, JSON documents are automatically indexed when they are transferred. Number and date types are automatically mapped. Multitenancy ElasticSearch supports multitenancy. Multiple indexes can be stored in a single ElasticSearch server, and data of multiple indexes can be searched with a single query. NELO2 separates indexes by date and stores logs. When executing a search, NELO requests indexes of dates within the scope of search with a single query. Code 1: Multitenancy Example Query. # Store logs in the log-2012-12-26 index curl -XPUT http://localhost:9200/log-2012-12-26/hadoop/1 -d '{ "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-26T14:12:12", "host": "host1.nelo2", "body": "org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile" }' # Store logs in the log-2012-12-27 index curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{ "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-27T02:02:02", "host": "host2.nelo2", "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" }' # Request search to the nelo2-log-2012-12-26 and nelo2-log-2012-12-27 indexes at once curl -XGET http://localhost:9200/nelo2-log-2012-12-26,nelo2-log-2012-12-27/_search Scalability and Flexibility ElasticSearch provides excellent scalability and flexibility. It enables the expansion of functionality through plug-ins, which was further improved in recent 0.90 release. For example, by using Thrift or Jetty plugin, you can change transfer protocol. If you install BigDesk or Head, which is a required plugin, you can use the functionality of ElasticSearch monitoring. As shown in the following Code 2, you can also adjust the number of replicas dynamically. The number of shards is not changeable as it is fixed for each index, so an appropriate number of shards should be allocated in the first time by taking the number of nodes and future server expansion into account. Code 2: Dynamic Configuration Change Query. $ curl -XPUT http://localhost:9200/log-2012-12-27/ -d '{ "settings": { "number_of_shards": 10, "number_of_replicas": 1 } }' Distributed Storage ElasticSearch is a distributed search engine. It distributes data by configuring multiple shards according to keys. An index is configured for each shard. Each shard has 0 or more replicas. Moreover, ElasticSearch supports clustering, and when a cluster runs, one of many nodes is selected as the master node to manage metadata. If the master node fails, another node in the cluster automatically becomes the master. It is also very easy to add nodes. When a node is added to the same network, the added node will automatically find the cluster through multicast and add itself to the cluster. If the same network is not used, the master node address should be specified through unicast (see a related video: http://youtu.be/l4ReamjCxHo). Installing Quick Start ElasticSearch supports zero configuration installation. As shown in the following code snippets, all you have to do for execution is download a file from the official homepage and unzip it. Download ~$ wget http://download.ElasticSearch.org/ElasticSearch/ElasticSearch/ElasticSearch-0.20.1.tar.gz ~$ tar xvzf ElasticSearch-0.20.1.tar.gz Executing Server ~$ bin/ElasticSearch -f Installing Plugins You can easily expand the functionality of ElasticSearch through plugins. You can add management functionalities, change the analyzer of Lucene, and change the basic transfer module from Netty to Jetty. The following is a command we use to install plugins for NELO2. Head and bigdesk, which are found in the first and second lines, are the plugins required for ElasticSearch monitoring. It is strongly recommended to install them and check their functionalities. After installing them, visit http://localhost:9200/plugin/head/ and http://localhost:9200/plugin/bigdesk/, and you can see the status of ElasticSearch in your Web browser. bin/plugin -install Aconex/ElasticSearch-head bin/plugin -install lukas-vlcek/bigdesk bin/plugin -install ElasticSearch/ElasticSearch-transport-thrift/1.4.0 bin/plugin -install sonian/ElasticSearch-jetty/0.19.9 Main Configurations You don't need to change configurations when conducting a simple functionality test. When you carry out a performance test or apply it to production services, then you should change some default configurations. See the following snippet and try to find for yourself the configurations which should be changed from the initial configuration file. Code 5: Main Configurations (config/ElasticSearch.yml). # As it is a name used to identify clusters, use a name with uniqueness and a meaning. cluster.name: ElasticSearch-nelo2 # A node name is automatically created but it is recommended to use a name that is discernible in a cluster like a host name. node.name: "xElasticSearch01.nelo2" # The default value of the following two is all true. node.master sets whether the node can be the master, while node.data is a configuration for whether it is a node to store data. Usually you need to set the two values as true, and if the size of a cluster is big, you should adjust this value by node to configure three types of node. More details will be explained in the account of topologies configuration later. node.master: true node.data: true # You can change the number of shards and replicas. The following value is a default value: index.number_of_shards: 5 index.number_of_replicas: 1 #To prevent jvm swapping, you should set the following value as true: bootstrap.mlockall: true # It is a timeout value for checking the status of each node in a cluster. You should set an appropriate value; if the value is too small, nodes may frequently get out of a cluster. The default value is 3 seconds. discovery.zen.ping.timeout: 10s # The default value is multicast, but in an actual environment, unicast should be employed due to the possibility of overlapping with other clusters. It is recommended to list servers that can be a master in the second setting. discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3[portX-portY]"] Using REST API ElasticSearch provides a REST API as shown below. It provides most of its functionalities through REST API, including the creation and deletion of indexes, mappings, as well as search and change of settings. In addition to REST API, it also provides various client APIs for Java, Python and Ruby. Code 6: REST API Format in ES. http://host:port/(index)/(type)/(action|id) As mentioned earlier, NELO2 classifies indexes (databases in RDBMS terms) by date, and type (table) is separated by project. Code 7 below shows the process of creating logs that came into the hadoop project on December 27, 2012, in the unit of document by using a REST API. Code 7: An Example of Using ElasticSearch REST API. #Creating documents curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 curl -XGET http://localhost:9200/log-2012-12-27/hadoop/1 curl -XDELETE http://localhost:9200/log-2012-12-27/hadoop/1 #Search curl -XGET http://localhost:9200/log-2012-12-27/hadoop/_search curl -XGET http://localhost:9200/log-2012-12-27/_search curl -XGET http://localhost:9200/_search #Seeing the status of indexes curl -XGET http://localhost:9200/log-2012-12-27/_status Creating Documents and Indexes As shown in the following Code 8, when the request is sent, ElasticSearch creates the log-2012-12-27 index and hadoop type automatically without any pre-defined index or type. If you want to create them specifically instead of using auto creation, you should specify the setting of action.auto_create_index and index.mapper.dynamic as false in the configuration file. Code 8: Creating Documents. # Request curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{ "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-27T02:02:02", "host": "host2.nelo2", "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" }' # Result { "ok": true, "_index": "log-2012-12-27", "_type": "hadoop", "_id": "1", "_version": 1 } As shown in Code 9 below, you can make a request after including type in a document. Code 9: A Query Including Type. curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{ "hadoop": { "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-27T02:02:02", "host": "host2.nelo2", "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" } }' If an id value is omitted as in Code 10, an id will be created automatically when a document is created. Note that thePOST method was used instead of PUT when a request was made. Code 10: A Query Creating a Document without an ID. # Request curl -XPOST http://localhost:9200/log-2012-12-27/hadoop/ -d '{ "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-27T02:02:02", "host": "host2.nelo2", "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" }' # Result { "ok": true, "_index": "log-2012-12-27", "_type": "hadoop", "_id": "kgfrarduRk2bKhzrtR-zhQ", "_version": 1 } Deleting a Document Code 11 below shows how to delete a document (a record in RDBMS terms) in type (a table). You can delete a hadoop type document with id=1 of the log-2012-12-27 index by using the DELETE method. Code 11: A Query to Delete a Document. # Request $ curl -XDELETE 'http://localhost:9200/log-2012-12-27/hadoop/1' # Result { "ok": true, "_index": "log-2012-12-27", "_type": "hadoop", "_id": "1", "found": true } Getting a Document You can get a hadoop type document with id=1 of the log-2012-12-27 index by using the GET method as shown in Code 12. Code 12: A Query to Get a Document. #Request curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop/1' # Result { "_index": "log-2012-12-27", "_type": "hadoop", "_id": "1", "_source": { "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-27T02:02:02", "host": "host2.nelo2", "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" } } Search When the Search API is called, ElasticSearch executes the Search API and returns the search results that match the content of the query. Code 13 shows an example of using Search API. Code 13: An Example Query of Using Search API. # All types of a specific index $ curl -XGET 'http://localhost:9200/log-2012-12-27/_search?q=host:host2.nelo2' # A specific type of a specific index $ curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop,apache/_search?q=host:host2.nelo2' # A specific type of all indexes $ $ curl - XGET 'http://localhost:9200/_all/hadoop/_search?q=host:host2.nelo2' # All indexes and types $ curl -XGET 'http://localhost:9200/_search?q=host:host2.nelo2' Search API by Using URI Request Table 2: Main Parameters. Name Description q Query string. default_operator The operator used as a default (AND or OR). The default is OR. fields The field to get as a result. The default is the "_source" field. sort Sort method. Ex) fieldName:asc/fieldName:desc. timeout Search timeout value. The default is "unlimited". size The number of result values. The default is 10. If you use URI, you can search easily by using parameters in Table 2 and a query string. As it does not provide all search options, it is useful when used for tests. Code 14: Search Query by Using URI Request. # Request $ curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop/_search?q=host:host2.nelo2' # Result { "_shards":{ "total": 5, "successful": 5, "failed": 0 }, "hits":{ "total": 1, "hits": [ { "_index": "log-2012-12-27", "_type": "hadoop", "_id": "1", "_source": { "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-27T02:02:02", "host": "host2.nelo2", "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" } } ] } } Search API by Using Request Body When HTTP body is used, perform a search by using query DSL. As query DSL has a large amount of contents, you are advised to refer to a guide from the official website. Code 15: Search by Using Query DSL. # Request $ curl -XPOST 'http://localhost:9200/log-2012-12-27/hadoop/_search' -d '{ "query": { "term": { "host": "host2.nelo2" } } }' # Result { "_shards":{ "total": 5, "successful": 5, "failed": 0 }, "hits":{ "total": 1, "hits": [ { "_index": "log-2012-12-27", "_type": "hadoop", "_id": "1", "_source": { "projectName": "hadoop", "logType": "hadoop-log", "logSource": "namenode", "logTime":"2012-12-27T02:02:02", "host": "host2.nelo2", "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" } } ] } } Mapping Put Mapping API To add a mapping to a specific type, you can define a mapping in the form shown in Code 16. Code 16: Query to Register a Mapping. $ curl -XPUT 'http://localhost:9200/log-2012-12-27/hadoop/_mapping' -d ' { "hadoop": { "properties": { "projectName": {"type": "string", "index": "not_analyzed"}, "logType": {"type": "string", "index": "not_analyzed"}, "logSource": {"type": "string", "index": "not_analyzed"}, "logTime": {"type": "date"}, "host": {"type": "string", "index": "not_analyzed"}, "body": {"type": "string"}, } } }' Get Mapping API To get defined mapping information, you can use a query in the form shown in Code 17. Code 17: Query to Get a Mapping. $ curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop/_mapping' Delete Mapping API Code 18 shows an example of deleting a defined mapping. Code 18: Query to Delete a Mapping. $ curl -XDELETE 'http://localhost:9200/log-2012-12-27/hadoop' How to Optimize Performance Memory and the Number of Open Files If the amount of data to search increases, you will need more memory. When you run ElasticSearch, you will encounter many problems due to the use of memory. In an operating method recommended by an ElasticSearch community, when you run a server exclusively for ElasticSearch, you are advised to allocate only half of the memory capacity to ElasticSearch, and to allow the OS to use the other half for system cache. You can set the memory size by setting the ES_HEAP_SIZE environmental variable or by using -Xms and -Xmx of JVM. Code 19: Execution by Specifying Heap Size. bin/ElasticSearch -Xmx=2G -Xms=2G When using ElasticSearch, you will see OutOfMemory errors frequently. This error occurs when the field cache exceeds the maximum heap size. If you change the setting for index.cache.field.type from resident (default) to soft, soft reference will be used and the cache area will be preferentially GC, and this problem can be resolved. Code 20: Configuring Field Cache Type. index.cache.field.type: soft If the amount of data increases, the number of index files also increases. This is because Lucene, which is used by ElasticSearch, manages indexes in the unit of segments. Sometimes the number will even exceed the number of MAX_OPEN files. For this reason, you need to change the maximum open file limit by using the ulimit command. The recommended value is 32000-64000, but sometimes you may need to set a larger value depending on the size of the system or data. Index Optimization NELO2 manages indexes by date. If indexes are managed by date, you can delete old logs that don't need to be managed easily and quickly, as shown in Code 21. In this case, the overhead imposed on the system is smaller than when deleting logs by specifying the TTL value for each document. Code 21: Deleting an Index. $ curl -XDELETE 'http://localhost:9200/log-2012-10-01/' If index optimization is performed, segments are incorporated. Using this method, you can enhance search performance. As index optimization can impose a burden on the system, it is better to perform it when the system is being used less. Code 22: Index Optimization. $ curl -XPOST 'http://localhost:9200/log-2012-10-01/_optimize' Shards and Replicas You can't change the number of shards after setting it. For this reason, you need to decide this value carefully by taking the current number of nodes in the system and the number of nodes expected to be added in the future into account. For example, if there are 5 nodes and the number is expected to reach 10 in the future, it is recommended to set the number of shards as 10 from the beginning. If you set it as 5 in the beginning, you can add 5 more nodes later, but you won't be able to use the added 5 nodes. If you set the number of replicas to 1, of course, you can utilize the added 5 nodes as nodes exclusively for replication. If the number of shards increases, it is more advantageous to process a large amount of data because queries are distributed as much as the number of shards. But you need to set this value appropriately, because the performance could be deteriorated due to increasing traffic if the value is too high. Configuring Cluster Topologies The content of the configuration file of ElasticSearch is shown in Code 23 below. There are three types of nodes: data node This does not act as the master, and only stores data. When it receives a request from a client, it searches data from shards or creates an index. master node It functions to maintain a cluster, and requests indexing or search to data nodes. search balancer node If it receives a search request, it requests data, gathers data and delivers the result. You can have one node which will function both like a master and a data node. But if you use the three types of node separately, you can reduce the burden of the data node. In addition, if you configure the master node separately, you can improve the stability of a cluster. Also, you can reduce operation costs by using low-spec. server equipment for the master and search node. Code 23: Settings Related to Topology. # You can exploit these settings to design advanced cluster topologies. # # 1. You want this node to never become a master node, only to hold data. # This will be the "workhorse" of your cluster. # # node.master: false # node.data: true # # 2. You want this node to only serve as a master, to not store any data and # to have free resources. This will be the "coordinator" of your cluster. # # node.master: true # node.data: false # # 3. You want this node to be neither a master nor a data node, but # to act as a "search load balancer" (fetching data from nodes, # aggregating results, etc.) # # node.master: false # node.data: false Figure 1 below shows the configuration of NELO2 topologies that use ElasticSearch. The efficiency of equipment use and the stability of the entire cluster has been improved as follows: only ElasticSearch runs on the 20 data nodes (server) so that they can achieve sufficient performance, while other daemon server processes in addition to ElasticSearch run on the 4 master nodes and 3 search nodes. Figure 1: NELO2 ElasticSearch Topologies. Configuring Routing When a large amount of data needs to be indexed, increasing the number of shards will improve the overall performance. On the other hand, if the number of shards increases, the traffic among nodes will also go up. For example, when there are 100 shards, if it receives a single search request, it sends the request to all the 100 shards and aggregates data, and this imposes a burden on the entire cluster. If you use routing, data will be stored only in a specific shard. Even if the number of shards increases, the application will still send a request only to a single shard, and consequently the traffic can be reduced dramatically. Figure 2, 3, and 4 are excerpted from the slides Rafal Kuc presented at Berlin Buzzwords 2012. If you don't use routing, as shown in Figure 2, the application will send a request to all the shards. But if you use routing, it will send a request only to a specific shard, as shown in Figure 3. According to the material cited, in Figure 4 when there are 200 shards, the response time is over 10 times faster with routing than without routing. If routing is applied, the number of threads will increase by 10 to 20 times compared to when it is not applied, but the CPU usage is much smaller. In some cases, however, the performance will be better when routing is not applied. For a search query whose result should be collected from multiple shards, it could be more advantageous in terms of performance to send the request to multiple shards. To complement this, NELO2 determines the use of routing depending on the log usage of the project. Figure 2: Before Using Routing. Figure 3: After Using Routing. Figure 4: Performance Comparison before and after Using Routing. Conclusion The number of users of ElasticSearch is increasing rapidly, thanks to its easy installation and high scalability. It was several days only since the release of the latest ElasticSearch version 0.90. Its functionality is improving very quickly thanks to its active community. In addition, more and more companies are beginning to use ElasticSearch for their services. Recently, some committers, including the developer Shay Banon, gathered together and established ElasticSearch.com, which provides consulting and training services. In this article I have explained the basic information on the installation of ElasticSearch, how to use it, and do performance tuning. We have started testing the latest 0.90 release and soon will migrate the current 0.20.1 ES deployment. In the next post I will continue this topic and tell you about our experience with 0.90 as well as the critical split-brain problem we have previously experienced. Due to the scarcity of solutions for this problem, I believe it will be very useful for our readers. By Lee Jae Ik, Senior Software Engineer at Global Platform Development Lab, NHN Corporation. References Official guide: http://www.ElasticSearch.org/guide/ Introduction to ElasticSearch and comparison of the terms of ElasticSearch and RDB: http://www.slideshare.net/clintongormley/cool-bonsai-cool-an-introduction-to-ElasticSearch About ElasticSearch: http://www.slideshare.net/dadoonet/ElasticSearch-devoxx-france-2012-english-version Shay Banon's articles: http://2011.berlinbuzzwords.de/sites/2011.berlinbuzzwords.de/files/ElasticSearch-bbuzz2011.pdf Using ElasticSearch for logs: http://www.ElasticSearch.org/tutorials/2012/05/19/ElasticSearch-for-logging.html Concept of multitenancy: http://en.wikipedia.org/wiki/Multitenancy Shay Banon's ElasticSearch optimization: https://github.com/logstash/logstash/wiki/ElasticSearch-Storage-Optimization Rafal Kuc's article on performance tuning presented at Berlin Buzzwords 2012: http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext [Less] |
||||||
|
Posted
about 1 month
ago
by
Esen Sagynov
We are very glad to announce the immediate availability of CUBRID ALL-IN-ONE Windows Downloader version 1.0 beta. You can download CUBRID ALL-IN-ONE Windows Downloader from
... [More]
http://www.cubrid.org/wiki_tools/entry/cubrid-all-in-one-windows-downloader. The source code is available at http://svn.cubrid.org/cubridtools/cubrid-downloader/ which is open sourced under BSD license just like all other CUBRID Tools.
CUBRID ALL-IN-ONE Windows Downloader is an application that allows our users to easily download CUBRID components including the server engine, drivers and GUI tools. All you have to do is to select the components you want to download on your local Windows machine and the Downloader will download them for you, one by one, without any other actions required. Application key features: The application can auto-update itself, anytime a new version is available (it uses the ClickOnce technology). Retrieves all the components information from a remote CUBRID online location, so it is always up-to- date with the latest application releases. Detects local machine specifics - CUBRID version, OS architecture – and automatically selects the appropriate list of components. Can handle software pre-prerequisites dependencies and download them as well. Supports both HTTP and FTP protocols for downloads. Provides additional information to users like links to online resources. Handles download errors and auto-retries in case of failures. Supports alternate download locations to try in case of failures. Saves the user preferences and re-uses them next time. Provides a comprehensive operations log information. Supports UI localization. Here is a mute video which shows how to use CUBRID ALL-IN-ONE Downloader. If you have questions or suggestions, leave your comments below. [Less] |
||||||
|
Posted
2 months
ago
by
Jaehee Ahn
For a long time, Java has provided security-related functions. Among the security-related functions, Java Cryptography Architecture (JCA) is the core one. JCA uses a provider structure with a variety of APIs related to security. These functions are
... [More]
essential for modern IT communication encryption technology, including Digital Signature, Message Digest (hashs), Certificate, Certificate Validation, creation and management of Key, and creation of Secure Random Number.
With JCA, even developers who do not have specialized knowledge of encryption can successfully implement security-related functions. You don't need to use algorithms like those you had to rack your brain for a long time to understand in computer science classes and cryptology-related classes. JCA allows you to implement the algorithms with a few lines of codes. Of course, utilizing the APIs well will be highly valuable for business. But, it does not mean that you do not need to understand how JCA runs. Understanding how JCA runs internally will be important to using the functions more efficiently. To be a better software developer and architect, you may need to trace how the result, JCA, was created from the cryptology and security-related algorithms. This article is a summary of JCA architecture that I learned while producing the nClavis (Symmetric-key cryptography) at NHN. Of course, I do not understand all of JCA yet. However, I was so happy to understand JCA at this level that I decided to write this article to share my experience with you. Design Principles As I mentioned, JCA is a Java security platform, based on the provider structure, having implementation independence, implementation interoperability, and algorithm extensibility. An application can utilize the information protection encryption technology just by requesting security services on the Java platform, without implementing security algorithms. JCA-provided security services are implemented by the provider mounted on the Java security platform. An application can introduce a variety of security functions by using several independent providers. The list of providers is described in the jre/lib/security/java.security file. The Java platform includes many providers and installs them by default when JRE is installed. Code 1: java.security file. # # List of providers and their preference orders (see above): # security.provider.1=sun.security.provider.Sun security.provider.2=sun.security.rsa.SunRsaSign security.provider.3=com.sun.net.ssl.internal.ssl.Provider security.provider.4=com.sun.crypto.provider.SunJCE security.provider.5=sun.security.jgss.SunProvider security.provider.6=com.sun.security.sasl.Provider security.provider.7=org.jcp.xml.dsig.internal.dom.XMLDSigRI security.provider.8=sun.security.smartcardio.SunPCSC security.provider.9=sun.security.mscapi.SunMSCAPI The providers mounted on the Java security platform by default are compatible with all Java applications and so widely used to regard them as trusted. Of course, JCA supports mounting of custom providers for applications which want to introduce the latest security technology that has not been implemented yet. Architecture Cryptographic Service Providers All providers are an implementation of java.security.Provider. This provider implementation includes the list of security algorithm implementations. When an instance of a specific algorithm is necessary, the JCA framework searches the proper implementation class of the corresponding algorithm from the provider repository and creates a class instance. The providers defined in the java.security file are included in the repository by default. In this way, a provider can be statically included. In addition, it can be dynamically added in runtime. When several providers are defined, they may implement an identical encryption algorithm in different ways. In this case, an application can specify the provider or specify the preference in the repository. To use JCA, an application simply requests a specific object type (such as MessageDigest) and an algorithm or service (e.g., MD5). Then, the application obtains an implementation from one of the installed providers. Of course, it can explicitly request an object of a specific provider. Code 2: Requesting an Object of a Provider. md = MessageDigest.getInstance("MD5"); md = MessageDigest.getInstance("MD5", "ProviderC"); Figure 1: Provider Framework of JCA (Source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). Oracle JRE (Sun JDK) has a variety of providers (Sun, SunJSSE, SunJCE, and SUnRsaSign) included by default. The criteria of classifying the providers are made based on the production process; functions or algorithms used by each provider are not so different from each other. JRE except Oracle has no need to include the providers mandatorily. Therefore, it is not recommended to implement an application in the provider-dependent way. All encryption technology implementations required by an application are provided by default and implemented with a fully reliable level. Therefore, developers do not need to pay attention to a provider itself. Key Management Two of the most important things in JCA are provider and key management. Java uses a kind of key database, called "keystore", to manage the key/certificate repository. KeyStore can be usefully used in an application that needs information for authentication, encryption, or signature. An application can access the KeyStore by using the implementation of the java.security.KeyStore class. The default KeyStore implementation is provided by Sun Microsystems (the package name still starts with com.sun even though Oracle acquired it long time ago :D). The KeyStore is created as a file with the naming rule of “jks”. It can also be converted to the type of “jceks” or “pkcs12” in order to suppoprt applications which use another format of KeyStore implemetation. "jceks" format is in PBE format which uses triple DES, which uses even stronger encryption algorithm than "jks" type, to protect KeyStore. "pkcs12" format is a standard syntax to exchange personal information, based on RSA. For machines, applications, and browser Internet kiosks that support this standard, users can export, import, or activate the personally identifying information (certificates for identification, pkcs12 format certificates). Safari, Chrome, and IE browsers follow this standard. Therefore, when the pkcs12 certificate file is installed once, it is applied to all browsers. However, Firefox does not follow this standard. KeyStore is to save and to manage SecretKey (SymmetricKey), Public/Private KeyPair (AsymmetricKey), self-signed certificate, and the certificates signed by trusted CA (Certificate Authority) or the private CA on a file. Here, experienced developers may recall OpenSSL which was used to create a certificate file. Both certificate files have the same purpose but different file format. However their formats are convertible. Java even provides 'keytool' command-line utility in JDK_HOME/bin directory, which is similar as OpenSSL's utility. Therefore you can handle certificates using a keytool on Windows, unlikely OpeSSL only on Linux. Of course, this tool runs with the KeyStore implementation provided by Oracle JRE. If JDK is installed in the system, a certificate can be created by using the keytool. However, note that the keytool can provide the functions at the same level provided by OpenSSL from JAVA. A KeyStore file created by using the keytool is compatible with the lower Java versions, so you do not need to worry about it. In-depth 1: Certificate. Here, I need to address the correct meaning of certificate. In a narrow sense, KeyStore is a kind of certificate. A certificate is used for two purposes; a "lock" required for encrypting the information and a tool for "identification" to identify the opponent technically. The authenticated certificate used for bank transactions is a security technology which utilizes both purposes of a certificate. The cryptographic meaning of a certificate is an electronic document that uses a digital signature to sign the public key created by the RSA algorithm (asymmetric-key cryptography) with the private key of the certificate authority (CA). It is popular practice that a pair of keys created by using the RSA algorithm is solid. It had been proven long ago that calculating the other key by using one key within a meaningful time is impossible. So, why is electronic signature required? When "A" and "B" communicate with each other by encrypting their data, A opens its public key and then protects the private key paired with the public key. Then B encrypts the data to be sent to A by using the public key of A. In this case, there is a problem of how B will know it can trust the public key; is it really provided by A? When a malicious attacker "C" disguises its public key as A's to deceive B, the public key cannot be used. To solve this problem, a trusted certificate authority "D" is necessary. The CA has a chain structure from the top root certificate authority to the sub certificate authority. In this structure, the upper layer certifies the lower layer by using signature. As it had been rooted as a worldwide standard so long ago, the signature chain of the certificates around the world includes a few common top root CAs (e.g. VeriSign, Thawte). Countries of which IT communication environment reaches a certain standard have their national root CA (e.g., KISA in Korea). The top root CAs may have a circular structure that allows signatures with each other. In some cases, one CA can sign the public key with its private key. So, the signature of CA should not be valid limitlessly but updated regularly or irregularly by other security events. It is possible for an individual or a company to establish a private certificate authority if required. Certificates signed by the private certificate authority have a more complex certification process. You may have seen the following Figure 2 on your browser: Figure 2: Browser Display when Private Certificate Authority is used. It is displayed when the website uses the private certificate authority. On the browser (even though it is not recommended), the user can skip this situation by clicking the mouse. However, for server-to-server connection, the opponent's certificate issued by the private certificate authority should be imported to JAVA_HOME/jre/lib/security/cacerts or added to SSLContext when creating the connection of the program code. The certificates from official certificate authorities are widely granted as trusted CAs and included to the OS or JRE by default. Therefore, the situation illustrated in Figure 2 does not occur. If a private certificate authority can obtain the pkcs12 format certificate, the private certificate authority is considered as a trusted CA. Therefore, it can be easily installed in the system. Code 2: Certificate Verification Program installed in the system. @Windows>certmgr.msc @MAC OS X>Keychain Access @Linux> keychain In-depth 2: HTTPS To understand encryption, you need to understand SSL/TLS, the certificate-based cryptographic protocols, as well as certificate. The purpose of a certificate can be easily misunderstood when encrypting the HTTPS protocol communication section. When a server and a client communicate via HTTPS, if only the communication section is encrypted without identifying the client (setting the clientAuth attribute of HTTPS Connector to false in the server.xml setting of Tomcat), only a simple certification verification is executed. Symmetric-key cryptography is used for encryption of data. Here, I will describe how a certificate is used through HTTPS. A client connects to a server via the HTTPS protocol (at this time, the SSL connection-defined server port is used). The server sends its certificate public key to the client.(including several meta information for validation and Cipher supported by the server) The client validates the server with the public key and meta information sent by the server.(checking whether the public key is signed with a trusted official Root CA) When the certificate of the server is signed with the official CA, it is passed (the official CAs are registered to the system as a trusted authority by default). When the certificate of the server is signed with a private CA, checking the trust manager of the SSL socket (SSLContext) created by the client. If the certificate is registered, it is passed. When the server's certificate passes the validation check, the client creates a symmetric key, encrypts the symmetric key and the cipher as a public key of the server, and then sends them to the server.(the symmetric key is created by selecting one of cipher algorithms supported by the server) The server decodes the [symmetric key and cipher: encrypted as the public key of the server] to its private key and acquires the symmetric key to be used for encrypted communication. After that, data communication between the server and the client is made to be encrypted with the symmetric key created by the client. JCA Structure JCA structure can be described with Engine class and algorithm. In JCA, an Engine class provides interfaces for all encryption service types regardless of a specific encryption algorithm or provider. The engine class provides one of the following functions: Encryption operations (encryption, digital signatures, message digests, etc.) Creating and converting the elements (keys and algorithm parameters) required for encryption Objects (keystores or certificates) which imply the encryption data or can be used by an object or the upper abstraction layer Let's take a look at the Engine classes provided by JCA and talk about encryption in detail. SecureRandom SecureRandom class is used to create a Pseudo Random number. In Java, random refers to Pseudo Random, as a more accurate expression. If so, are random and pseudo random different? Technically, they are different. There are two random types: True Random and Pseudo Random. True Random is a random number which cannot be forecasted. You may say that Pseudo Random cannot be forecasted. Pseudo Random is a random progression determined by seed and a mathematical algorithm. It has a sequence which is eventually repeated even if it takes a very long time or its probability is very low. In addition, if you know the seed and the random algorithm, you can forecast the sequence of the Pseudo Random. True Random creates a random number based on atomic physical phenomena, not the mathematical way used by the Pseudo Random. If there is no hardware equipment to measure the atomic physical phenomena, e.g., electromagnetic noises and radioactive element decay, it is impossible to create True Random. JCA SecureRandom class is an engine class that provides a powerful function to create random numbers. As I described, it is not easy to implement a True Random Number Generator (TRNG). Therefore, many implementations implement Pseudo Random Number Generator (PRNG). As mentioned before, the random level of the pseudo random is incomplete. The popular random class cannot satisfy the minimum level that is required cryptographically. Therefore, the implementation of SecureRandom should be verified that it satisfies the requirements of cryptographic level (CSPRNG). Figure 3: Classification of Encryption Type (source: http://en.wikipedia.org/wiki/Cipher). Now, you may ask why creating the random number is considered such an very important thing for encryption. The core of modern cryptology is the key used for encryption. Previously, cryptology had been based on a conversion table like Base64 or UTF8 encoding. For modern cryptology, that kind of traditional method is not considered as encryption any more. The key is a random sequence generated by the random sequence generator. We naturally think of an encryption algorithm as thinking of cryptology. However, the open symmetric key encryption algorithm can be simplified to XOR operation (or multiplication/division operation) for the input values and key streams. As I said, the core is the key. If the random level of a random sequence generator is not ensured, the entire outline of the key may be revealed to an attacker when a part of created key or some sequential random sequences are leaked out to the attacker. For modern cryptology, the key used for encryption is the core. So, the random sequence generator is very important. JCA uses SecureRandom as the random sequence generator. Implementation of an algorithm to create random numbers is provided by a provider, like other encryption algorithms. MessageDigest MessageDigest is used to calculate the message digest (hash) of input data. The purpose of message digest is the integrity check to check whether the original file is reserved as it is. Message digest algorithm processes a variable-length original message into a fixed-length hash output. Message digest algorithm consists of a unidirectional hash function, so it is not possible to draw the original value from the hash value. When A and B are communicating with each other, A sends the original message, message hash value of the original message, and the message digest algorithm to B. B calculates the message hash value by using the algorithm and original message sent from A. When the message hash value calculated by B is identical to the message hash value sent from A, it means that the original message sent from A has not been changed or modified until B receives it via the network. Figure 4: Example of Using MessageDigest at a Download Site. You can frequently see Checksums or digital fingerprints at download sites. It is an alternative name for MessageDigest. MD5 or SHA1 is the well-known message digest algorithm. Signature Signature is used to sign data and to decide validity of the digital signature with a key received during initialization. Receiving a key means that key-based encryption is executed. Figure 5: Flow of Actions of Signature Object (source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). In initialization, Signature Object receives the private key and the original data to be signed as parameters and finishes preparation for signing. The sign() method of Signature signs the original data with the private key and returns Signature Bytes. To validate the signed data, the verification signature object is initialized by using the public key paired with the private key used for signing. The object additionally receives the original data, signature output and Signature Bytes, and the verify method checks whether the two parameters are identical to determine the reliability of the original data. Signature can be made only by the person who holds the private key. However, verification is made by using the public key. So, anyone who acquired the public key can perform verification. Digital Signature vs Cryptography vs MessageDigest For cryptography, users can select either symmetric key method or asymmetric key method based on the user's request. Digital signature is also a kind of cryptography. However, asymmetric key encryption is a prerequisite for digital signature. In addition, digital signature is a combination of MessageDigest and asymmetric key encryption. Large-capacity data with a variable length is compressed to a fixed-length format which is easy to manage by the MessageDigest and then signed with a private key to create fixed-length signature bytes. When creating a signature instance, you can see the principle of digital signature from the signature algorithm names, such as SHA1withRSA, MD5withRSA, SHA1withDSA; the signature algorithms are sent as a signature.getInstance() parameter and their names are made by combining RSA (asymmetric key encryption algorithm), MessageDigest algorithm, SHA1, and MD5. Signature dsa = Signature.getInstance("SHA1withDSA"); Signed Certificate Keystore Type: jks Keystore Provider: SUN Keystore includes the following two items: Alias: rootcaalias Written on: 2012. 9. 26 Input Type: trustedCertEntry Holder: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Issuer: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Serial Number: Opened on: Fri Apr 06 10:17:08 KST 2012 Expired on: Sun Mar 13 10:17:08 KST 2112 Certificate Fingerprint: MD5: 0C:FC:12:C5:68:E5:95:0B:95:7D:B0:2F:FA:4F:DB:B4 SHA1: 90:37:1C:E6:F4:64:AD:E6:27:AA:4F:58:88:16:11:24:6D:A5:EB:2B ******************************************* ******************************************* Alias: nplatform Written on: 2012. 9. 26 Input Type: keyEntry Length of Certificate Chain: 2 Certificate[1]: Holder: O=NHN INC, OU=NHN NBP, CN= NPLAFORM, UID=1 Issuer: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Serial Number: Opened on: Fri Sep 21 17:26:22 KST 2012 Expired on: Sun Aug 28 17:26:22 KST 2112 Certificate Fingerprint: MD5: 48:8C:46:A3:E7:54:58:97:60:0D:5C:56:08:B0:D1:E7 SHA1: 12:64:3C:DA:C1:2C:94:1A:2B:EB:E9:98:2B:DA:8F:06:78:6E:26:1E Certificate[2]: Holder: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Issuer: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Serial Number: Opened on: Fri Apr 06 10:17:08 KST 2012 Expired on: Sun Mar 13 10:17:08 KST 2112 Certificate Fingerprint: MD5: 0C:FC:12:C5:68:E5:95:0B:95:7D:B0:2F:FA:4F:DB:B4 SHA1: 90:37:1C:E6:F4:64:AD:E6:27:AA:4F:58:88:16:11:24:6D:A5:EB:2B Let's review certificate and signature, which were described in the previous in depth section, with JCA Signature object mechanism. The above text box is the KeyStore certificate file created by using the keytool of Java. From Holder and Issuer of Certificate[1], you can see that “CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC” has signed the certificate of Holder “O=NHN INC, OU=NHN NBP, CN= NPLAFORM, UID=1” by using its private key. The result of the signature is the Certificate Fingerprint. The length of the Certificate Fingerprint is decided by the MessageDigest algorithm (MD5 or SHA1). As following the Certificate Chain, you can see that the Holder and Issuer of Certificate[2] are identical. It means that “CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC” is self-signed by using its private key. As the Certificate[2] has self-signed, the Certificate Chain is ended here. Cipher Class Figure 6: Flow of Actions of Cipher Object (source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). Cipher class provides encryption/decryption functions. The encryption/decryption algorithms are variously classified as follows: Symmetric bulk encryption (AES, DES, DESede, Blowfish, IDEA), Stream encryption (RC4), Asymmetric encryption (RSA), and Password-based encryption (PBE). I will not describe classification of encryption to Symmetric and Asymmetric because the classification is so well-known. Stream vs. Block Cipher Symmetric bulk encryption can be classified into Stream and Block Cipher. Block Cipher encodes the data in the fixed-length block unit. Data whose length does not fit the fixed length is padded with dummy values. Bytes padded are removed while decrypting the data. This padding is executed by the padding type (e.g., PKCS5PADDING) which is sent as a parameter while initializing Cipher. On the contrary, Stream Cipher processes input data in the unit of byte or bit. Therefore, it can process variable-length data without padding. Modes Of Operation The important concept of Block Cipher you should know is Feedback Modes. Assume a very simple block cipher. If the input data is identical, the encrypted result is identical. From this characteristic, attackers obtain a hint to decrypt the encrypted data with a repeated same pattern. To avoid security vulnerabilities and make Cipher more complex, Feedback Mode was introduced. Feedback Mode is an operation which combines (XOR operation) the Nth input data block (or the Nth encrypted result data block) and the N-1st input data block (or the N-1st encrypted result data block) at the Nth encryption process. Therefore, when the input data blocks are identical, the result values are different corresponding to the variables used in the previous encryption process. Note one more thing: if N = 1, any variable cannot be acquired from the N-1st encryption process. In this case, Initial Value (IV) takes the role instead of the variable in the previous process. To use Feedback Mode, the IV value should be randomly created and prepared for encryption. The IV value used for encryption should be stored because it is necessary for decryption as well. The feedback modes provided by JCA are CBC, CFB, and OFB. The mode that no feedback mode is used is called ECB for distinction. More detailed description of each mode will not be provided here. Figure 7 shows the importance of feedback modes. If the original image data is encrypted without using the feedback mode (ECB MODE), identical input data is used and an identical encryption result is acquired. Therefore, the entire outline is drawn up. Figure 7: Image Encryption (source: http://en.wikipedia.org/wiki/Modes_of_operation). Creating Cipher Object The essential thing for creating a Cipher instance is to specify transformation. Transformation consists of encryption algorithms (/feedback modes/paddings) described before. Only the encryption algorithm values can be specified. But, the default feedback mode/padding (ECB/PKCS5Padding) is internally specified. Cipher c1 = Cipher.getInstance("DES/ECB/PKCS5Padding"); or Cipher c1 = Cipher.getInstance("DES"); The Cipher class instance can be initialized by selecting one from four modes (opmode: Encryption, Decryption, Wrap, Unwrap) for initialization. WRAP_MODE: Wraps Java.security.Key to convert it to the byte unit for secured key transmission UNWRAP_MODE: Unwraps the wrapped key to the Java.security.Key object When initializing the cipher class instance, the init() method is called as its parameter. It requests opmode, key(certificate), params, and random as its parameters. Here, note the AlgorithmParameters-type params parameter. This instance is used to store the IV value of feedback mode and the salt value and the iteration count value of the PBS algorithm. These values are not required when initializing cipher of ENCRYPTION_MODE. These can be randomly created by ScureRandom and used for the encryption process. The values created are stored in the AlgorithmParameters field of the encryption cipher object. On the other hand, the params value is required for initializing DECRYPTION_MODE Cipher. In the decryption process, the params value identical to the value used for the encryption process is required. When the init() method is called, all existing values are deleted from the cipher class. Therefore, before initializing the cipher instance again, the getParameters() method should be called to store the AlgorithmParameter object used for the encryption process. To make the jobs simpler, SealedObject can be used in the encryption result. SealedObject class receives a target statement to encrypt and Cipher object as arguments(sealing process in SealedObject class). The SealedObject itself is an encrypted data and it manages algorithm arguments used for encryption. If a key used in encryption process is passed, you can obtain decrypted data(unsealing process in SealedObject class). Code 4: Encryption using SealedObject. //Create Cipher object Cipher c = Cipher.getInstance("DES"); c.init(Cipher.ENCRYPT_MODE, sKey); // Create SealedObject: it is an encryped data SealedObject so = new SealedObject("This is a secret", c); Code 5 Decryption using SealedObject //Note: sKey is as same as encryption key //Note: so is SealedObject which was previously created. //Decryption using SealedObject #1 //Decrypt using Cipher object c.init(Cipher.DECRYPT_MODE, sKey); try { String s = (String)so.getObject(c); } catch (Exception e) { //do something }; //Decryption using SealedObject #2 //Decrypt using encryption key try { String s = (String)so.getObject(sKey); } catch (Exception e) { //do something }; Message Authentication Codes(MAC) MAC is similar to MessageDigest because it creates the hash value; however, it is different from MessageDigest in that it requires SecretKey (symmetric key) for initialization. MessageDigest allows any receiving party to execute integrity check for the received message. However, MAC allows only the party which has the identical SecretKey to execute integrity check for the received message. MAC is used among those who share the SecretKey. Figure 8. Flow of Actions of MAC (source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). HMAC is a MAC based on the encryption hash function (MessageDigest algorithm: MD5 or SHA1). HMAC is a combination of MessageDigest algorithm and shared SecretKey. Signature is different from HMAC because Signature uses the asymmetric key. HMAC allows identifying the opponent faster than the signature that uses the RSA algorithm. So, some services strategically use HMAC. Conclusion So far, I have described half of JCA functions. However, the rest of the functions are also important even if they are not described here. This article does not include the other core of JCA, such as Key, KeyPair, KeyFactory, KeyGenerator, KeyStore, CertificateFactory, and CertStore. I think that the functions should be deeply and fully described. For lack of space, I won't deal with the functions here. They will be described in the next article if possible. It was very difficult to study JCA and prepare this article. I felt that there were more things to study and research as I prepared the article and was left with even more questions while writing. I hope my article will help you to understand the "vague" concept of encryption more clearly. By Jaehee Ahn, Software Egnineer at Web Platform Development Lab, NHN Corporation. References http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html http://en.wikipedia.org/wiki/Modes_of_operation http://en.wikipedia.org/wiki/Cipher http://en.wikipedia.org/wiki/Stream_Cipher http://luxsci.com/blog/how-does-secure-socket-layer-ssl-or-tls-work.html [Less] |
||||||
Copyright
©
2013
Black Duck Software, Inc.
and its contributors, Some Rights Reserved. Unless otherwise marked, this work is licensed under a
Creative Commons Attribution 3.0 Unported License
. Ohloh
®
and the Ohloh logo are trademarks of
Black Duck Software, Inc.
in the United States and/or other jurisdictions. All other trademarks are the property of their respective holders.